Adding Extra Functionality

 Hey, I think that acid must be coming on!

To assemble and run this example, use the batch file "m6" in the Warpcode directory.  
 

 Now here's an interesting tale of how really wanting to do something can make the scales fall from your eyes and allow you to see scope for even more optimisation where you thought there was none to be had.

 As stated at the outset, I wanted to traverse the source image in a cool and interesting manner. The bilinear interpolation looked good, but it was still very linear. I had had in mind all along turning this into a nonlinear warp effect by the simple method of adding a couple of extra variables, xii and yii, and adding them to the x-increment and y-increment at each pixel.

 That meant that I had to squeeze into my already well-packed inner loop two extra adds, add xii,xi and add yii,yi. One of them slipped in nicely on the multiplier at the end of the loop. I had one more to add... and I looked and looked at the loop... multiplier saturated... ALU saturated... reg unit saturated... mem unit saturated.... it was increasingly looking like I was going to have to add an extra instruction for just one poxy add, and that was going to push the framerate up.

 The thought of doing that just burned my ass.

 I wanted my nice warp effect in 8 ticks of inner loop, and I would give up llamas for life before I would sully my nice, pert, firmly packed inner loop with a single, isolated poxy add instruction.

 I made myself a really hot cup of tea and sat down to study the code, looking for any way I could make a hole in the flow for any of the function units. Finally, I saw the light. If you consider the four pixels that you have to pick up for the bilerp to be in this formation:

{a b}
{c d}
then, due to the way the pixel registers became freed up during the calculation, I was loading them in the order a,c,b,d. This is somewhat inelegant, as it needs two addr instructions to go from pixel C to pixel B. I grokked that if I were to enter the loop with XY already set up to look at the "c" pixel (i.e. add 1 to ry), then, with a slight modification to the calculation to free up registers in a different order, I could load the pixels in the order (c,a,b,d), thus freeing up two register unit slots, one in the middle and one at the end of the loop. If I were to update rv with an addr instruction instead of adding to Y and then using st_s rv, I could just use ld_s rv,y outside of the loop to maintain the correct value across the scanline. Not having to add to Y in the inner loop would then free up a (precious!) multiplier slot. So, if I were to free up a register to use as a constant four in the inner loop, I could add to the destination buffer pointer using the spare addm slot, and I'd be home free. This I could do if I used the counter rc1 for the outer loop, instead of just subtracting from desth. I would not need to keep desth at all, so I could free up its register for my constant four.

 I duly implemented, and it worked. The moral of this little digression being that, even if your inner loop is stuffed to bursting, you can often find a way to squeeze in a little bit more, if you really want to and you drink enough tea.

 Anyway, with the inner loop nicely sorted, we should finish off the functionality of the warper. We wanted lots of nice parameters for tweaking, remember? Since the main parameters controlling the shape of the warpage are the increment registers xi, yi, xii and yii, it seemed sensible to add a buncha extra params to allow for much tweakage of same. To this end I added the following goodies:

 

I'll now list the resulting source code, once again commenting in bold where there's anything of note to say about it.

 



;
; warp6.a - makes the warp nonlinear

; here's some definitions

        .include        "merlin.i"          ;general Merlin things
        .include    "scrndefs.i"            ;defines screen buffers and screen DMA type
        .start  go
        .segment        local_ram
        .align.v

; buffer for internal pixel map (1 DMA's worth)

buffer:

        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

; output line buffer (1 DMA's worth x2, for double buffering)

line:

        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


; DMA command buffer

dma__cmd:

        .dc.s   0,0,0,0,0,0,0,0


; destination screen address

dest:   .dc.s   dmaScreen2

; frame counter

ctr:    .dc.s   0

; reg equates for this routine

        x = r8
        y = r9
        pixel = v1
        pixel2 = v0
        pixel3 = v6
        pixel4 = v7  
        pixel5 = v3      
        destx = r12
        desty = r13
        dma_len = r14
        destw = r10
        desth = r11
        four = r11
        yi = r16
        xi = r17
        xii = r18
        yii = r19
        xs = r24
        ys = r25
        xss = r26
        yss = r27
        xis = r28
        yis = r29
        xsss = r30
        ysss = r31
    
        dma_mode = r20
        dma_dbase = r21
        out_buffer = r22

Note the new registers for all the parameters. I also shifted around a couple of the others for ease of access of stuff inside the inner loop, and to be easily able to stack certain params that need to be restored at the end of every scanline.
        .segment        instruction_ram

go:


        st_s    #$aa,intctl                   ;turn off any existing video
        st_s    #(local_ram_base+4096),sp     ;here's the SP

; clear the source buffer to *random* pixels, using the pseudo random sequence generator
; out of Graphics Gems 1

        mv_s    #$a3000000,r2                   ;This is the mask
        mv_s    #$b3725309,r0                   ;A random seed
        mv_s    #buffer,r1                      ;Address of the source buffer
        st_s    #64,rc0                         ;This is how many pixels to clear
cl_srceb:
        btst    #0,r0                           ;Check bit zero of the current seed
        bra     eq,nxor                         ;Do not xor with the mask if it ain't set
        lsr     #1,r0                           ;Always shift the mask, whatever happens
        dec     rc0                             ;dec the loop counter
        eor     r2,r0                           ;If that bit was 1, xor in the mask
nxor:
        bra     c0ne,cl_srceb                   ;loop for all the pixels
        st_s    r0,(r1)                         ;store the pixel        
        add     #4,r1                           ;point to next pixel address

; set up a simple cross-shaped test pattern in the buffer RAM

        mv_s    #$51f05a00,r0                   ;Pixel colour (a red colour)
        mv_s    #buffer+(32*4),r1               ;Line halfway down buffer
        mv_s    #buffer+16,r2                   ;Column halfway across top line of buffer
        st_s    #8,rc0                          ;Number of pixels to write

testpat:

        st_s    r0,(r1)                         ;Store pixel value at row address.
        st_s    r0,(r2)                         ;Store pixel value at column address.
        dec     rc0                             ;Decrement loop counter.
        bra     c0ne,testpat                    ;Loop if counter not equal to 0.
        add     #4,r1                           ;Increment row address by one pixel.
        add     #32,r2                          ;Increment column address by one line.

; now, initialise video

    jsr SetUpVideo,nop

frame_loop:

; generate a drawscreen address 

    mv_s    #dmaScreenSize,r0       ;this lot selects one of
    mv_s    #dmaScreen3,r3          ;three drawscreen buffers
    ld_s    dest,r1                 ;this should be inited to a
                                    ;valid screen buffer address
    nop
    cmp     r3,r1
    bra     ne,updatedraw
    add     r0,r1             
    nop
    mv_s    #dmaScreen1,r1          ;reset buffer base
updatedraw:
    st_s    r1,dest                 ;set current drawframe address

; actually draw a frame

    jsr drawframe,nop
    
; increment the frame counter

    ld_s    ctr,r0
    nop
    add #1,r0
    st_s    r0,ctr    
    
; set the address of the frame just drawn on the video system

    jsr SetVidBase
    ld_s    dest,r0
    nop
    
; loop back for the next frame

    bra frame_loop,nop
    


drawframe:

; save the return address for nested subroutine calls

    push    v7,rz
    
; ensure that any pending DMA is complete.  Whilst it
; is not really necessary at the moment, it is good form,
; for later on we may arrive at the start of a routine
; while DMA is still happening from something we did before.

    jsr dma_finished,nop                

; initialise the bilinear addressing registers


        st_s    #buffer,uvbase          ;I want *UV* to point at the buffer here.
        st_s    #buffer,xybase          ;I want XY to point at the buffer here too.
        st_s    #$104dd008,uvctl        ;UV type, derived as follows:
                                        ;Bit 28 set, I wanna use CH-NORM.
                                        ;Pixel type set to 4 (32-bit pixels).
                                        ;YTILE and VTILE both set to 13 (treat the buffer as an 8x8 tilable bitmap).
                                        ;The width is set to 8 pixels.
        st_s    #$104dd008,xyctl        ;XY type, same as UV type
        mv_s    #line,dma_dbase         ;Store the same address as double buffer base.
        st_s    #$10400000,linpixctl    ;Linear pixel mode, derived as follows:
                                        ;Bit 28 set, I wanna use CH-NORM.
                                        ;Pixel type set to 4 (32-bit pixels).
Notice that, since I have modified the algorithm to have both xy and uv working on the same bitmap, I set up both xy and uv the same way now.  In the new version I am addressing the output buffer directly through an address held in a scalar. Since I am still going to be storing pixels using st_p, I have to set up linpixctl to tell the system what kind of pixels I am using. linpixctl is very similar to xy/uvctl, except that obviously there is no tile or width information.
; initialise parameters for the routine

        mv_s    #0,desty                                ;Start at dest y=0
        mv_s    #0,destx                                ;Start at dest x=0
        ld_s    ctr,x                                   ;Use counter, to make it move
        ld_s    ctr,y                                   ;Same for Y
        lsl #13,x                                       ;make it half a source pixel a frame
        lsl #14,y                                       ;same
        mv_s    #$300000,xi                             ;Source X inc *in 8:24*
        mv_s    #$180000,yi                             ;Source Y inc *in 8:24*
        mv_s    #-$40,xii                               ;x-inc-inc
        mv_s    #-$10,yii                               ;y-inc-inc
        mv_s    #$c0000,xs                              ;Source X step *in 8:24*
        mv_s    #$140000,ys                             ;Source Y step *in 8:24*
        mv_s    #-$1000,xss                             ;x-step-step (8:24)
        mv_s    #-$1c00,yss                             ;y-step-step (8:24)
        mv_s    #-$300,xis                              ;x-increment-step  (8:24)
        mv_s    #-$40,yis                               ;y-increment-step  (8:24)
        mv_s    #$40,xsss                               ;guess
        mv_s    #$22,ysss                               ;it's obvious really    
        mv_s    #360,destw                              ;Width of dest rectangle
        mv_s    #240,desth                              ;Height of dest rectangle
        sub out_buffer,out_buffer                       ;select buffer offset of 0
Obviously, there are now a few more parameters to load. Note that I have given some of the parameters more bits of fraction. This is to allow for specifying the second and third order warp parameters with more precision than is actually used inside the loop.
; now the outer loop

        st_s    desth,rc1                               ;I am going to use rc1 to count off the height...
        mv_s    #4,four                                 ;gonna use it as a constant :-)
I am using rc1 to count the outer loop iterations, so I can reuse its register for a constant that is used to update the output buffer pointer in the inner loop.
warp_outer:

        push    v2      ;save the source X and Y, and the width and height
        push    v3      ;save the dest X and Y  
        push    v4      ;lead v4 not into corruption...
        push    v6      ;and deliver the contents of v6 from molestation
        push    v7      ;guess what
These pushes preserve variables that need to be restored at the end of a scanline, and the stuff that is going to get mashed by being used as pixel vectors.
        asr     #8,xi   ;convert these to 16:16 for inner loop use
        asr     #8,yi   ;convert these to 16:16 for inner loop use
I convert the stuff that is held external to the inner loop with more fraction, back to the format that it is actually used in inside the loop.
; and now the inner.

warp_inner:


        mv_s    #64,r0          ;This is the maximum number of pixels for one DMA.
        sub     r0,destw        ;Count them off the total dest width.
        bra     gt,w_1,nop      ;do nothing if this is positive
There was a st_s rx in that delay slot, but due to the new way we are using rx and ry, it is no longer needed.  Since that would leave just two empty NOPs in the delay slots, I have binned them and used the bra instruction form that generates the NOPs for you.
        add     destw,r0                                ;If negative, modify the number of pixels to generate.
w_1:
        jsr     pixel_gen                               ;Go and call the pixel generation loop
        mv_s    r0,dma_len                              ;Set the dma length in my dma vector
        st_s    r0,rc0                                  ;Set the counter for the pixgen loop

; Pixel gen function will return here after having generated and DMA'd out the pixels

        cmp     #0,destw                                ;Did the width go negative?
        bra     gt,warp_inner                           ;No, it did not, carry on the horizontal traverse of the dest rectangle
        add     dma_len,destx                           ;add dma_len to the dest x position
        nop                                             ;empty delay slot

; Horizontal span is finished if we fall through to here

        pop     v7                                      ;no surprise
        pop     v6                                      ;restore the purity of v5
        pop     v4                                      ;undo the nastiness we did to v4
        pop     v3                                      ;restore dest X and Y
        pop     v2                                      ;restore source X and Y
Sundry unstackings to restore what got mashed.
        asr     #8,xs,r0                                ;change these from 8:24
        asr     #8,ys,r1                                ;change from 8:24
        add     #1,desty                                ;point to next line of dest
        add     r0,x                                    ;add the X step to the source
        add     r1,y                                    ;add the Y step to the source
        add     xss,xs                                  ;add x step inc
        add     yss,ys                                  ;add y step inc
        add     xis,xi                                  ;add x inc step
{       add     yis,yi                                  ;add y inc step
        dec     rc1                                     ;decrement the Y size
}
        jmp     c1ne,warp_outer                         ;loop for entire height
        add     xsss,xss                                ;another tweaker for the step
        add     ysss,yss
Finally the loop count is decremented and the conditional branch evaluated, and the parameters that change every scanline are updated.
; all done!

    pop v7,rz                       ;get back return address
    nop
    rts t,nop                       ;and return 


pixel_gen:

; This is the pixel generation function.  It collects *bilerped* pixels from the 8x8 pattern buffer and
; deposits them in the linear destination buffer for output to external RAM.

        mv_s    dma_dbase,r15               ;save this in a spare reggy in v3
{
        add     out_buffer,dma_dbase        ;Generate the real address of the buffer
        push    v3                          ;I am going to use v3 as an extra pixel holder.
}

; Now, outside of the actual loop, I am gonna load up my stuff.

        st_s    x,ru                        ;Initialise bilinear U pointer
        st_s    y,rv                        ;Initialise bilinear V pointer
        st_s    x,rx                        ;Initialise bilinear X pointer
        st_s    y,ry                        ;Initialise bilinear Y pointer
{
        ld_p    (uv),pixel                  ;Grab a pixel from the source
        addr    #1,ru                       ;go to next horiz pixel
        add     xi,x
}
That add xi,x performs a necessary preincrement of x prior to entering the inner loop.
{
        ld_p    (uv),pixel2                 ;Get a second pixel
        addr    #1,rv                       ;go to next vert pixel
}
{
        ld_p    (uv),pixel4                 ;get a third pixel
        addr    #-1,ru                      ;go to prev horizontal pixel
        sub     #4,dma_dbase                ;point at start of buffer -4
}
{
        ld_p    (uv),pixel3                 ;get a fourth pixel
        addr    #-1,rv                      ;go back to original pixel
        sub_sv  pixel,pixel2                ;b=b-a
}       
        addr    #1,ry
The addr #1,ry is the offset that I need on the xy pair to be able to pick up my pixels in a more efficient manner.
 

bilerp:

; Here is the bilerp part.

{
        mv_v    pixel,pixel5                    ;save a copy of first pixel, freeing up pixel 1.
        mul_p   ru,pixel2,>>#14,pixel2          ;scale according to fractional part of ru
        sub_sv  pixel3,pixel4                   ;make vector between second 2 pixels
        addr    yi,ry                           ;Point ry to next y
}
{
        st_s    x,(ru)                          ;Can now update ru, finished multiplying with it.
        mul_p   ru,pixel4,>>#14,pixel4          ;scale according to fractional part of ru
        sub_sv  pixel3,pixel
        addr    xi,rx                           ;(XY) now points at next pixel 1
}
{
        ld_p    (xy),pixel3                     ;Loading next pixel 1.
        addr    #-1,ry                          ;POinting to next pixel 3.
        add_sv  pixel2,pixel                    ;get first intermediate result
        dec     rc0                             ;Decrementing the loop counter.
}
}
I am loading the pixels in a slightly different order. pixel3 is now loaded before pixel.
{
        ld_p    (xy),pixel                      ;getting next pixel 3.
        sub_sv  pixel,pixel4                    ;get vector to final value
        addm    four,dma_dbase,dma_dbase
        addr    #1,rx                           ;Working over to point to pixel 2.
}
And here is where I squeeze in my dest buffer pointer increment on the multiplier.
{
        mul_p   rv,pixel4,>>#14,pixel4  ;scale with fractional part of rv
        add_sv  pixel2,pixel5           ;add pix2 to the copy of pix1
        addr    yi,rv
}
Now rv is updated by using yi and the addr instruction.
{
        ld_p    (xy),pixel2             ;load up next pixel2
        addr    #1,ry                   ;point to next pixel 4
        bra     c0ne,bilerp             ;start the branch
        add     xii,xi                  ;Incrementing the x increment
}
Here's one of the warp parameters getting updated
{
        ld_p    (xy),pixel4             ;get next pixel4
        add_sv  pixel4,pixel5           ;make final pixel value
        addr    #-1,rx                  ;start putting these right      
        addm    yii,yi,yi               ;do Y-inc-inc
}
And here's the other one.
{
        st_p    pixel5,(dma_dbase)      ;Deposit the pixel in the dest buffer
        sub_sv  pixel,pixel2            ;b=b-a
        addm    xi,x,x                  ;do x inc
}

; Now, the pixel buffer is full, so it is time to DMA it out to external RAM.
;
; To implement simple double-buffering of the DMA out, we have to do
; the following:  wait for (a) the PENDING bit to go clear, which will
; mean that DMA is ready to accept a command; and (b), make sure that
; the ACTIVE level is never greater than (#buffers-1).  Here we are using
; 2 buffers, so we wait until it is 1.

dma_avail:

    ld_s    mdmactl,r0              ;Get the DMA status.
    nop
    btst    #4,r0                   ;Pending?
    bra ne,dma_avail                ;Yeah, gotta wait.
    bits    #3,>>#0,r0              ;Extract the ACTIVE level
    cmp #1,r0                       ;check against (#buffers-1)
    bra gt,dma_avail,nop            ;Wait until it is OK.

; Now we know DMA is ready, so we can proceed to set up and launch the DMA write.    

    mv_s    #dmaFlags,r0            ;Get DMA flags for this screentype.
    ld_s    dest,r1                 ;Address of external RAM screen base
    copy    destx,r2                ;destination xpos
    copy    desty,r3                ;destination ypos
    lsl #16,dma_len,r4              ;shift DMA size up
    or  r4,r2                       ;and combine with x-position
    bset    #16,r3                  ;make Y size = 1
    mv_s    #dma__cmd,r4            ;address of DMA command buffer in local RAM
    st_v    v0,(r4)                 ;set up first vector of DMA command
    add #16,r4                      ;point to next vector
    add out_buffer,dma_dbase,r0     ;point to the buffer we just drew
    st_s    r0,(r4)                 ;place final word of DMA command
    sub #16,r4                      ;point back to start of DMA command buffer
    st_s    r4,mdmacptr             ;launch the DMA

; Because we are double buffering, there is no need to wait for
; DMA to complete.  We can switch buffers, return and get straight on with the
; next line.

    rts                             ;Return to the main loops.
{
    ld_s    rv,y                    ;fix this coz of preincrement
    sub xi,x
}
    eor #1,<>#-8,out_buffer         ;Toggle the buffer offset twixt 0 and 256.
The previously empty delay slot is used to do a couple of fixups to stuff that was used in the inner loop. Because we removed direct adds to y, and that variable needs to be preserved across the whole scanline, we get the value of y from whatever it happened to be at the end of the loop by loading it from rv. We also subtract xi from x, because it has been incremented one extra time - it was preincremented before we started the inner loop.

Well, that's about it for this example.  You may like to run warp7, by using the batch file "m7" - this example shows a warping screen image based on a version of the Warp code that has been tweaked to use a 16x16 source tile, and which uses the ctr counter to munge some of the parameters over time, bending the warp.  Form here, it'd be a simple matter to tweak this core code into something you could use as a trippy background screen somewhere - you might consider, for example, implementing some realtime pattern generation that runs on the source tile rather than using a static, pre-defined image.  You can also use this code as a framework for exploring more complex forms of algorithmic pattern generation - how about using more than one source tile, and combining them in some cool way using a more complex inner loop?.

You may introduce a few extra ticks into the inner loop by adding more stuff, but don't worry too much about that - so far, we've only been using the one processor!  While that warp is running, most of the system is idle!  And now that you've got a toehold into writing native MPE code, and you've seen what you can do just using one processor, I'm sure you'll be just itching to code some more interesting stuff and start using more of the capacities of the whole Merlin chip.

 


jmp next
jmp prev
rts
nop
nop