To assemble and run this example, use the batch file "m3" in the Warpcode directory.
It's at this point, if you were using an inferior processor, that you'd say "oh well, nice effect, shame it's too slow to be of any use", and give up and go down the pub. However, since we have this cool VLIW thing on the MPEs, and we haven't even used it yet, we can stay at home and carry on coding and stand a decent chance of flaying it into usable shape.
I'll now list the code with the new bilinear interpolation stuff in the pixel_gen loop. In this code I have also fixed a stupid mistake that I made in the previous versions of the code. In doing bilinear interpolation, one uses a pixel value and the fractional part of the source pixel address to generate an interpolated colour value using the mul_p instruction. I was forgetting that mul_p only accepts the u and v indices for multiply, and not x and y! Silly! So I have changed around the usage of (xy) and (uv) - (uv) now traverses the source, and (xy) the output buffer. Everything works the same way as before though. Again, I'll add comments in bold wherever something new has been inserted.
; ; warp3.a - now does bilinear interpolation ; but look how slow it goes now... ; here's some definitions .include "merlin.i" ;general Merlin things .include "scrndefs.i" ;defines screen buffers and screen DMA type .start go .segment local_ram .align.v ; buffer for internal pixel map (1 DMA's worth) buffer: .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ; output line buffer (1 DMA's worth x2, for double buffering) line: .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ; DMA command buffer dma__cmd: .dc.s 0,0,0,0,0,0,0,0 ; destination screen address dest: .dc.s dmaScreen2 ; frame counter ctr: .dc.s 0 ; reg equates for this routine x = r8 y = r9 pixel = v1 pixel2 = v0 pixel3 = v6 pixel4 = v7 destx = r12 desty = r13 dma_len = r14 destw = r10 desth = r11 yi = r16 xi = r17 xs = r18 ys = r19 dma_mode = r20 dma_dbase = r21 out_buffer = r22I have declared three extra pixel vectors. These are used in the bilinear interpolation code, where one needs to sample 4 pixel values to arrive at the final result.
.segment instruction_ram go: st_s #$aa,intctl ;turn off any existing video st_s #(local_ram_base+4096),sp ;here's the SP ; clear the source buffer to *random* pixels, using the pseudo random sequence generator ; out of Graphics Gems 1 mv_s #$a3000000,r2 ;This is the mask mv_s #$b3725309,r0 ;A random seed mv_s #buffer,r1 ;Address of the source buffer st_s #64,rc0 ;This is how many pixels to clear cl_srceb: btst #0,r0 ;Check bit zero of the current seed bra eq,nxor ;Do not xor with the mask if it ain't set lsr #1,r0 ;Always shift the mask, whatever happens dec rc0 ;dec the loop counter eor r2,r0 ;If that bit was 1, xor in the mask nxor: bra c0ne,cl_srceb ;loop for all the pixels st_s r0,(r1) ;store the pixel add #4,r1 ;point to next pixel address ; set up a simple cross-shaped test pattern in the buffer RAM mv_s #$51f05a00,r0 ;Pixel colour (a red colour) mv_s #buffer+(32*4),r1 ;Line halfway down buffer mv_s #buffer+16,r2 ;Column halfway across top line of buffer st_s #8,rc0 ;Number of pixels to write testpat: st_s r0,(r1) ;Store pixel value at row address. st_s r0,(r2) ;Store pixel value at column address. dec rc0 ;Decrement loop counter. bra c0ne,testpat ;Loop if counter not equal to 0. add #4,r1 ;Increment row address by one pixel. add #32,r2 ;Increment column address by one line. ; now, initialise video jsr SetUpVideo,nop frame_loop: ; generate a drawscreen address mv_s #dmaScreenSize,r0 ;this lot selects one of mv_s #dmaScreen3,r3 ;three drawscreen buffers ld_s dest,r1 ;this should be inited to a ;valid screen buffer address nop cmp r3,r1 bra ne,updatedraw add r0,r1 nop mv_s #dmaScreen1,r1 ;reset buffer base updatedraw: st_s r1,dest ;set current drawframe address ; actually draw a frame jsr drawframe,nop ; increment the frame counter ld_s ctr,r0 nop add #1,r0 st_s r0,ctr ; set the address of the frame just drawn on the video system jsr SetVidBase ld_s dest,r0 nop ; loop back for the next frame bra frame_loop,nop drawframe: ; save the return address for nested subroutine calls push v7,rz ; ensure that any pending DMA is complete. Whilst it ; is not really necessary at the moment, it is good form, ; for later on we may arrive at the start of a routine ; while DMA is still happening from something we did before. jsr dma_finished,nop ; initialise the bilinear addressing registers st_s #buffer,uvbase ;I want *UV* to point at the buffer here. st_s #$104dd008,uvctl ;UV type, derived as follows: ;Bit 28 set, I wanna use CH-NORM. ;Pixel type set to 4 (32-bit pixels). ;YTILE and VTILE both set to 13 (treat the buffer as an 8x8 tilable bitmap). ;The width is set to 8 pixels. st_s #line,xybase ;set the line buffer address mv_s #line,dma_dbase ;Store the same address as double buffer base. st_s #$1040f040,xyctl ;XY type, derived as follows: ;Bit 28 set, I wanna use CH-NORM. ;Pixel type set to 4 (32-bit pixels). ;XTILE off, YTILE set to mask bits 17-31. ;This means that the integer part of Y is ;constrained to 0 or 1. We use it to switch buffers. ;The width is set to 64 pixels. st_s #0,ry ;Init Y to point to the first buffer.I've swapped the functions of the XY and UV pairs, here and throughout the code, to fix my silly mistake of using XY when I meant to use UV!
; initialise parameters for the routine mv_s #0,desty ;Start at dest y=0 mv_s #0,destx ;Start at dest x=0 ld_s ctr,x ;Use counter, to make it move ld_s ctr,y ;Same for Y lsl #13,x ;make it half a source pixel a frame lsl #14,y ;same mv_s #$2000,xi ;Source X inc mv_s #$400,yi ;Source Y inc mv_s #$c00,xs ;Source X step mv_s #$1400,ys ;Source Y step mv_s #360,destw ;Width of dest rectangle mv_s #240,desth ;Height of dest rectangle sub out_buffer,out_buffer ;select buffer offset of 0 ; now the outer loop warp_outer: push v2 ;save the source X and Y, and the width and height push v3 ;save the dest X and Y ; and now the inner. warp_inner: mv_s #64,r0 ;This is the maximum number of pixels for one DMA. sub r0,destw ;Count them off the total dest width. bra gt,w_1 ;do nothing if this is positive mv_s dma_mode,r1 ;My 'standard' DMA call requires this address in r1. st_s #0,rx ;Point rx at the first pixel of the output buffer add destw,r0 ;If negative, modify the number of pixels to generate. w_1: jsr pixel_gen ;Go and call the pixel generation loop mv_s r0,dma_len ;Set the dma length in my dma vector st_s r0,rc0 ;Set the counter for the pixgen loop ; Pixel gen function will return here after having generated and DMA'd out the pixels cmp #0,destw ;Did the width go negative? bra gt,warp_inner ;No, it did not, carry on the horizontal traverse of the dest rectangle add dma_len,destx ;add dma_len to the dest x position nop ;empty delay slot ; Horizontal span is finished if we fall through to here pop v3 ;restore dest X and Y pop v2 ;restore source X and Y add #1,desty ;point to next line of dest sub #1,desth ;decrement the Y size jmp gt,warp_outer ;loop for entire height add xs,x ;add the X step to the source add ys,y ;add the Y step to the source ; all done! pop v7,rz ;get back return address nop rts t,nop ;and return pixel_gen: ; This is the pixel generation function. It collects *bilerped* pixels from the 8x8 pattern buffer and ; deposits them in the linear destination buffer for output to external RAM. st_s x,ru ;Initialise bilinear U pointer st_s y,rv ;Initialise bilinear V pointer ; Here is the bilerp part. ld_p (uv),pixel ;Grab a pixel from the source addr #1,ru ;go to next horiz pixel ld_p (uv),pixel2 ;Get a second pixel addr #1,rv ;go to next vert pixel ld_p (uv),pixel4 ;get a third pixel addr #-1,ru ;go to prev horizontal pixel ld_p (uv),pixel3 ;get a fourth pixel addr #-1,rv ;go back to original pixelThe bilerp begins. Here we get the four pixels adjacent to our position in uv-space.
sub_p pixel,pixel2 ;make vector between first 2 pixels sub_p pixel3,pixel4 ;make vector between second 2 pixels mul_p ru,pixel2,>>#14,pixel2 ;scale according to fractional part of ru mul_p ru,pixel4,>>#14,pixel4 ;scale according to fractional part of ru add_p pixel2,pixel ;get first intermediate pixel add_p pixel4,pixel3 ;get second intermediate pixelHere we arrive at two intermediate pixel values by interpolating between the two pairs of horizontally adjacent pixels. This is done by calculating the difference between the two horizontal pixels, scaling that difference by the fractional part of the ru index by using mul_p, and then adding back the base pixel value to get the interpolated result.
sub_p pixel,pixel3 ;get vector to final value mul_p rv,pixel3,>>#14,pixel3 ;scale with fractional part of rv nop ;wait for multiply add_p pixel3,pixel ;make final pixel valueHere we do the same thing with the two intermediate values, this time scaling by the rv index, to arrive at the final pixel colour. Then we just write it out as per usual.
dec rc0 ;Decrement the counter st_p pixel,(xy) ;Deposit the pixel in the dest buffer addr #1,rx ;increment the dest buffer pointer bra c0ne,pixel_gen ;Loop for the length of the dest buffer add xi,x ;Add the x-increment add yi,y ;Add the y_increment ; Now, the pixel buffer is full, so it is time to DMA it out to external RAM. ; ; To implement simple double-buffering of the DMA out, we have to do ; the following: wait for (a) the PENDING bit to go clear, which will ; mean that DMA is ready to accept a command; and (b), make sure that ; the ACTIVE level is never greater than (#buffers-1). Here we are using ; 2 buffers, so we wait until it is 1. dma_avail: ld_s mdmactl,r0 ;Get the DMA status. nop btst #4,r0 ;Pending? bra ne,dma_avail ;Yeah, gotta wait. bits #3,>>#0,r0 ;Extract the ACTIVE level cmp #1,r0 ;check against (#buffers-1) bra gt,dma_avail,nop ;Wait until it is OK. ; Now we know DMA is ready, so we can proceed to set up and launch the DMA write. mv_s #dmaFlags,r0 ;Get DMA flags for this screentype. ld_s dest,r1 ;Address of external RAM screen base copy destx,r2 ;destination xpos copy desty,r3 ;destination ypos lsl #16,dma_len,r4 ;shift DMA size up or r4,r2 ;and combine with x-position bset #16,r3 ;make Y size = 1 mv_s #dma__cmd,r4 ;address of DMA command buffer in local RAM st_v v0,(r4) ;set up first vector of DMA command add #16,r4 ;point to next vector add out_buffer,dma_dbase,r0 ;point to the buffer we just drew st_s r0,(r4) ;place final word of DMA command sub #16,r4 ;point back to start of DMA command buffer st_s r4,mdmacptr ;launch the DMA ; Because we are double buffering, there is no need to wait for ; DMA to complete. We can switch buffers, return and get straight on with the ; next line. rts ;Return to the main loops. eor #1,<>#-8,out_buffer ;Toggle the buffer offset twixt 0 and 256. addr #1,ry ;Change the write buffer index.Nice though it is to be able to do pixel-multiplies and all that good stuff, as you can see from the frame rate on this example, even cool pixel-oriented instructions are not enough on their own to yield the kind of shit-kickin' pace that we are really aiming for. So it's time to get into that VLIW stuff at last.
jmp next jmp prev rts nop nop