To assemble and run this example, use the batch file "m6" in the Warpcode directory.
Now here's an interesting tale of how really wanting to do something can make the scales fall from your eyes and allow you to see scope for even more optimisation where you thought there was none to be had.
As stated at the outset, I wanted to traverse the source image in a cool and interesting manner. The bilinear interpolation looked good, but it was still very linear. I had had in mind all along turning this into a nonlinear warp effect by the simple method of adding a couple of extra variables, xii and yii, and adding them to the x-increment and y-increment at each pixel.
That meant that I had to squeeze into my already well-packed inner loop two extra adds, add xii,xi and add yii,yi. One of them slipped in nicely on the multiplier at the end of the loop. I had one more to add... and I looked and looked at the loop... multiplier saturated... ALU saturated... reg unit saturated... mem unit saturated.... it was increasingly looking like I was going to have to add an extra instruction for just one poxy add, and that was going to push the framerate up.
The thought of doing that just burned my ass.
I wanted my nice warp effect in 8 ticks of inner loop, and I would give up llamas for life before I would sully my nice, pert, firmly packed inner loop with a single, isolated poxy add instruction.
I made myself a really hot cup of tea and sat down to study the code, looking for any way I could make a hole in the flow for any of the function units. Finally, I saw the light. If you consider the four pixels that you have to pick up for the bilerp to be in this formation:
{a b} {c d}then, due to the way the pixel registers became freed up during the calculation, I was loading them in the order a,c,b,d. This is somewhat inelegant, as it needs two addr instructions to go from pixel C to pixel B. I grokked that if I were to enter the loop with XY already set up to look at the "c" pixel (i.e. add 1 to ry), then, with a slight modification to the calculation to free up registers in a different order, I could load the pixels in the order (c,a,b,d), thus freeing up two register unit slots, one in the middle and one at the end of the loop. If I were to update rv with an addr instruction instead of adding to Y and then using st_s rv, I could just use ld_s rv,y outside of the loop to maintain the correct value across the scanline. Not having to add to Y in the inner loop would then free up a (precious!) multiplier slot. So, if I were to free up a register to use as a constant four in the inner loop, I could add to the destination buffer pointer using the spare addm slot, and I'd be home free. This I could do if I used the counter rc1 for the outer loop, instead of just subtracting from desth. I would not need to keep desth at all, so I could free up its register for my constant four.
I duly implemented, and it worked. The moral of this little digression being that, even if your inner loop is stuffed to bursting, you can often find a way to squeeze in a little bit more, if you really want to and you drink enough tea.
Anyway, with the inner loop nicely sorted, we should finish off the functionality of the warper. We wanted lots of nice parameters for tweaking, remember? Since the main parameters controlling the shape of the warpage are the increment registers xi, yi, xii and yii, it seemed sensible to add a buncha extra params to allow for much tweakage of same. To this end I added the following goodies:
; ; warp6.a - makes the warp nonlinear ; here's some definitions .include "merlin.i" ;general Merlin things .include "scrndefs.i" ;defines screen buffers and screen DMA type .start go .segment local_ram .align.v ; buffer for internal pixel map (1 DMA's worth) buffer: .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ; output line buffer (1 DMA's worth x2, for double buffering) line: .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ; DMA command buffer dma__cmd: .dc.s 0,0,0,0,0,0,0,0 ; destination screen address dest: .dc.s dmaScreen2 ; frame counter ctr: .dc.s 0 ; reg equates for this routine x = r8 y = r9 pixel = v1 pixel2 = v0 pixel3 = v6 pixel4 = v7 pixel5 = v3 destx = r12 desty = r13 dma_len = r14 destw = r10 desth = r11 four = r11 yi = r16 xi = r17 xii = r18 yii = r19 xs = r24 ys = r25 xss = r26 yss = r27 xis = r28 yis = r29 xsss = r30 ysss = r31 dma_mode = r20 dma_dbase = r21 out_buffer = r22Note the new registers for all the parameters. I also shifted around a couple of the others for ease of access of stuff inside the inner loop, and to be easily able to stack certain params that need to be restored at the end of every scanline.
.segment instruction_ram go: st_s #$aa,intctl ;turn off any existing video st_s #(local_ram_base+4096),sp ;here's the SP ; clear the source buffer to *random* pixels, using the pseudo random sequence generator ; out of Graphics Gems 1 mv_s #$a3000000,r2 ;This is the mask mv_s #$b3725309,r0 ;A random seed mv_s #buffer,r1 ;Address of the source buffer st_s #64,rc0 ;This is how many pixels to clear cl_srceb: btst #0,r0 ;Check bit zero of the current seed bra eq,nxor ;Do not xor with the mask if it ain't set lsr #1,r0 ;Always shift the mask, whatever happens dec rc0 ;dec the loop counter eor r2,r0 ;If that bit was 1, xor in the mask nxor: bra c0ne,cl_srceb ;loop for all the pixels st_s r0,(r1) ;store the pixel add #4,r1 ;point to next pixel address ; set up a simple cross-shaped test pattern in the buffer RAM mv_s #$51f05a00,r0 ;Pixel colour (a red colour) mv_s #buffer+(32*4),r1 ;Line halfway down buffer mv_s #buffer+16,r2 ;Column halfway across top line of buffer st_s #8,rc0 ;Number of pixels to write testpat: st_s r0,(r1) ;Store pixel value at row address. st_s r0,(r2) ;Store pixel value at column address. dec rc0 ;Decrement loop counter. bra c0ne,testpat ;Loop if counter not equal to 0. add #4,r1 ;Increment row address by one pixel. add #32,r2 ;Increment column address by one line. ; now, initialise video jsr SetUpVideo,nop frame_loop: ; generate a drawscreen address mv_s #dmaScreenSize,r0 ;this lot selects one of mv_s #dmaScreen3,r3 ;three drawscreen buffers ld_s dest,r1 ;this should be inited to a ;valid screen buffer address nop cmp r3,r1 bra ne,updatedraw add r0,r1 nop mv_s #dmaScreen1,r1 ;reset buffer base updatedraw: st_s r1,dest ;set current drawframe address ; actually draw a frame jsr drawframe,nop ; increment the frame counter ld_s ctr,r0 nop add #1,r0 st_s r0,ctr ; set the address of the frame just drawn on the video system jsr SetVidBase ld_s dest,r0 nop ; loop back for the next frame bra frame_loop,nop drawframe: ; save the return address for nested subroutine calls push v7,rz ; ensure that any pending DMA is complete. Whilst it ; is not really necessary at the moment, it is good form, ; for later on we may arrive at the start of a routine ; while DMA is still happening from something we did before. jsr dma_finished,nop ; initialise the bilinear addressing registers st_s #buffer,uvbase ;I want *UV* to point at the buffer here. st_s #buffer,xybase ;I want XY to point at the buffer here too. st_s #$104dd008,uvctl ;UV type, derived as follows: ;Bit 28 set, I wanna use CH-NORM. ;Pixel type set to 4 (32-bit pixels). ;YTILE and VTILE both set to 13 (treat the buffer as an 8x8 tilable bitmap). ;The width is set to 8 pixels. st_s #$104dd008,xyctl ;XY type, same as UV type mv_s #line,dma_dbase ;Store the same address as double buffer base. st_s #$10400000,linpixctl ;Linear pixel mode, derived as follows: ;Bit 28 set, I wanna use CH-NORM. ;Pixel type set to 4 (32-bit pixels).Notice that, since I have modified the algorithm to have both xy and uv working on the same bitmap, I set up both xy and uv the same way now. In the new version I am addressing the output buffer directly through an address held in a scalar. Since I am still going to be storing pixels using st_p, I have to set up linpixctl to tell the system what kind of pixels I am using. linpixctl is very similar to xy/uvctl, except that obviously there is no tile or width information.
; initialise parameters for the routine mv_s #0,desty ;Start at dest y=0 mv_s #0,destx ;Start at dest x=0 ld_s ctr,x ;Use counter, to make it move ld_s ctr,y ;Same for Y lsl #13,x ;make it half a source pixel a frame lsl #14,y ;same mv_s #$300000,xi ;Source X inc *in 8:24* mv_s #$180000,yi ;Source Y inc *in 8:24* mv_s #-$40,xii ;x-inc-inc mv_s #-$10,yii ;y-inc-inc mv_s #$c0000,xs ;Source X step *in 8:24* mv_s #$140000,ys ;Source Y step *in 8:24* mv_s #-$1000,xss ;x-step-step (8:24) mv_s #-$1c00,yss ;y-step-step (8:24) mv_s #-$300,xis ;x-increment-step (8:24) mv_s #-$40,yis ;y-increment-step (8:24) mv_s #$40,xsss ;guess mv_s #$22,ysss ;it's obvious really mv_s #360,destw ;Width of dest rectangle mv_s #240,desth ;Height of dest rectangle sub out_buffer,out_buffer ;select buffer offset of 0Obviously, there are now a few more parameters to load. Note that I have given some of the parameters more bits of fraction. This is to allow for specifying the second and third order warp parameters with more precision than is actually used inside the loop.
; now the outer loop st_s desth,rc1 ;I am going to use rc1 to count off the height... mv_s #4,four ;gonna use it as a constant :-)I am using rc1 to count the outer loop iterations, so I can reuse its register for a constant that is used to update the output buffer pointer in the inner loop.
warp_outer: push v2 ;save the source X and Y, and the width and height push v3 ;save the dest X and Y push v4 ;lead v4 not into corruption... push v6 ;and deliver the contents of v6 from molestation push v7 ;guess whatThese pushes preserve variables that need to be restored at the end of a scanline, and the stuff that is going to get mashed by being used as pixel vectors.
asr #8,xi ;convert these to 16:16 for inner loop use asr #8,yi ;convert these to 16:16 for inner loop useI convert the stuff that is held external to the inner loop with more fraction, back to the format that it is actually used in inside the loop.
; and now the inner. warp_inner: mv_s #64,r0 ;This is the maximum number of pixels for one DMA. sub r0,destw ;Count them off the total dest width. bra gt,w_1,nop ;do nothing if this is positiveThere was a st_s rx in that delay slot, but due to the new way we are using rx and ry, it is no longer needed. Since that would leave just two empty NOPs in the delay slots, I have binned them and used the bra instruction form that generates the NOPs for you.
add destw,r0 ;If negative, modify the number of pixels to generate. w_1: jsr pixel_gen ;Go and call the pixel generation loop mv_s r0,dma_len ;Set the dma length in my dma vector st_s r0,rc0 ;Set the counter for the pixgen loop ; Pixel gen function will return here after having generated and DMA'd out the pixels cmp #0,destw ;Did the width go negative? bra gt,warp_inner ;No, it did not, carry on the horizontal traverse of the dest rectangle add dma_len,destx ;add dma_len to the dest x position nop ;empty delay slot ; Horizontal span is finished if we fall through to here pop v7 ;no surprise pop v6 ;restore the purity of v5 pop v4 ;undo the nastiness we did to v4 pop v3 ;restore dest X and Y pop v2 ;restore source X and YSundry unstackings to restore what got mashed.
asr #8,xs,r0 ;change these from 8:24 asr #8,ys,r1 ;change from 8:24 add #1,desty ;point to next line of dest add r0,x ;add the X step to the source add r1,y ;add the Y step to the source add xss,xs ;add x step inc add yss,ys ;add y step inc add xis,xi ;add x inc step { add yis,yi ;add y inc step dec rc1 ;decrement the Y size } jmp c1ne,warp_outer ;loop for entire height add xsss,xss ;another tweaker for the step add ysss,yssFinally the loop count is decremented and the conditional branch evaluated, and the parameters that change every scanline are updated.
; all done! pop v7,rz ;get back return address nop rts t,nop ;and return pixel_gen: ; This is the pixel generation function. It collects *bilerped* pixels from the 8x8 pattern buffer and ; deposits them in the linear destination buffer for output to external RAM. mv_s dma_dbase,r15 ;save this in a spare reggy in v3 { add out_buffer,dma_dbase ;Generate the real address of the buffer push v3 ;I am going to use v3 as an extra pixel holder. } ; Now, outside of the actual loop, I am gonna load up my stuff. st_s x,ru ;Initialise bilinear U pointer st_s y,rv ;Initialise bilinear V pointer st_s x,rx ;Initialise bilinear X pointer st_s y,ry ;Initialise bilinear Y pointer { ld_p (uv),pixel ;Grab a pixel from the source addr #1,ru ;go to next horiz pixel add xi,x }That add xi,x performs a necessary preincrement of x prior to entering the inner loop.
{ ld_p (uv),pixel2 ;Get a second pixel addr #1,rv ;go to next vert pixel } { ld_p (uv),pixel4 ;get a third pixel addr #-1,ru ;go to prev horizontal pixel sub #4,dma_dbase ;point at start of buffer -4 } { ld_p (uv),pixel3 ;get a fourth pixel addr #-1,rv ;go back to original pixel sub_sv pixel,pixel2 ;b=b-a } addr #1,ryThe addr #1,ry is the offset that I need on the xy pair to be able to pick up my pixels in a more efficient manner.
bilerp: ; Here is the bilerp part. { mv_v pixel,pixel5 ;save a copy of first pixel, freeing up pixel 1. mul_p ru,pixel2,>>#14,pixel2 ;scale according to fractional part of ru sub_sv pixel3,pixel4 ;make vector between second 2 pixels addr yi,ry ;Point ry to next y } { st_s x,(ru) ;Can now update ru, finished multiplying with it. mul_p ru,pixel4,>>#14,pixel4 ;scale according to fractional part of ru sub_sv pixel3,pixel addr xi,rx ;(XY) now points at next pixel 1 } { ld_p (xy),pixel3 ;Loading next pixel 1. addr #-1,ry ;POinting to next pixel 3. add_sv pixel2,pixel ;get first intermediate result dec rc0 ;Decrementing the loop counter. } }I am loading the pixels in a slightly different order. pixel3 is now loaded before pixel.
{ ld_p (xy),pixel ;getting next pixel 3. sub_sv pixel,pixel4 ;get vector to final value addm four,dma_dbase,dma_dbase addr #1,rx ;Working over to point to pixel 2. }And here is where I squeeze in my dest buffer pointer increment on the multiplier.
{ mul_p rv,pixel4,>>#14,pixel4 ;scale with fractional part of rv add_sv pixel2,pixel5 ;add pix2 to the copy of pix1 addr yi,rv }Now rv is updated by using yi and the addr instruction.
{ ld_p (xy),pixel2 ;load up next pixel2 addr #1,ry ;point to next pixel 4 bra c0ne,bilerp ;start the branch add xii,xi ;Incrementing the x increment }Here's one of the warp parameters getting updated
{ ld_p (xy),pixel4 ;get next pixel4 add_sv pixel4,pixel5 ;make final pixel value addr #-1,rx ;start putting these right addm yii,yi,yi ;do Y-inc-inc }And here's the other one.
{ st_p pixel5,(dma_dbase) ;Deposit the pixel in the dest buffer sub_sv pixel,pixel2 ;b=b-a addm xi,x,x ;do x inc } ; Now, the pixel buffer is full, so it is time to DMA it out to external RAM. ; ; To implement simple double-buffering of the DMA out, we have to do ; the following: wait for (a) the PENDING bit to go clear, which will ; mean that DMA is ready to accept a command; and (b), make sure that ; the ACTIVE level is never greater than (#buffers-1). Here we are using ; 2 buffers, so we wait until it is 1. dma_avail: ld_s mdmactl,r0 ;Get the DMA status. nop btst #4,r0 ;Pending? bra ne,dma_avail ;Yeah, gotta wait. bits #3,>>#0,r0 ;Extract the ACTIVE level cmp #1,r0 ;check against (#buffers-1) bra gt,dma_avail,nop ;Wait until it is OK. ; Now we know DMA is ready, so we can proceed to set up and launch the DMA write. mv_s #dmaFlags,r0 ;Get DMA flags for this screentype. ld_s dest,r1 ;Address of external RAM screen base copy destx,r2 ;destination xpos copy desty,r3 ;destination ypos lsl #16,dma_len,r4 ;shift DMA size up or r4,r2 ;and combine with x-position bset #16,r3 ;make Y size = 1 mv_s #dma__cmd,r4 ;address of DMA command buffer in local RAM st_v v0,(r4) ;set up first vector of DMA command add #16,r4 ;point to next vector add out_buffer,dma_dbase,r0 ;point to the buffer we just drew st_s r0,(r4) ;place final word of DMA command sub #16,r4 ;point back to start of DMA command buffer st_s r4,mdmacptr ;launch the DMA ; Because we are double buffering, there is no need to wait for ; DMA to complete. We can switch buffers, return and get straight on with the ; next line. rts ;Return to the main loops. { ld_s rv,y ;fix this coz of preincrement sub xi,x } eor #1,<>#-8,out_buffer ;Toggle the buffer offset twixt 0 and 256.The previously empty delay slot is used to do a couple of fixups to stuff that was used in the inner loop. Because we removed direct adds to y, and that variable needs to be preserved across the whole scanline, we get the value of y from whatever it happened to be at the end of the loop by loading it from rv. We also subtract xi from x, because it has been incremented one extra time - it was preincremented before we started the inner loop.
Well, that's about it for this example. You may like to run warp7, by using the batch file "m7" - this example shows a warping screen image based on a version of the Warp code that has been tweaked to use a 16x16 source tile, and which uses the ctr counter to munge some of the parameters over time, bending the warp. Form here, it'd be a simple matter to tweak this core code into something you could use as a trippy background screen somewhere - you might consider, for example, implementing some realtime pattern generation that runs on the source tile rather than using a static, pre-defined image. You can also use this code as a framework for exploring more complex forms of algorithmic pattern generation - how about using more than one source tile, and combining them in some cool way using a more complex inner loop?.
You may introduce a few extra ticks into the inner loop by adding more stuff, but don't worry too much about that - so far, we've only been using the one processor! While that warp is running, most of the system is idle! And now that you've got a toehold into writing native MPE code, and you've seen what you can do just using one processor, I'm sure you'll be just itching to code some more interesting stuff and start using more of the capacities of the whole Merlin chip.
jmp next jmp prev rts nop nop