Deep Optimisation

 ahhh, now that's more like it!

To assemble and run this example, use the batch file "m5" in the Warpcode directory.  

 


The framerate is, as you can see, a lot more acceptable than when we first started out. We are now cranking out bilerped pixels faster than the initial naive bit of code could just copy them.

 For deep optimisation of an inner loop, it's generally not good enough to just pack up the instructions that you wrote for the unpacked version and hope for the best. I find it best to take myself away from in front of the computer altogether and make myself a really hot cup of tea and go and pace about in front of the whiteboard.

 I find it helpful to first of all consider what the absolute minimum, best case timing could be for the task in hand. In the case of the bilinear interpolation I wrote down something like this (using a different colour for ALU and MUL unit operations, and with a,b,c and d representing the four pixels used in the interpolation). I assumed that at the start of the calculation, all four pixels were already loaded:

 
b=b-a
d=d-c b*RU
d*RU
b=b+a
d=d+c
d=d-b
d*RV
(NOP)
d=d+b
(write pixel)
This represented just the bare bones of the arithmetic operations necessary to produce the interpolated pixel, and I didn't worry too much about other stuff like loading the pixels in.

 Looking at the algorithm, I realised I could start the third multiply one tick earlier by calculating one of the differences in parallel with the two initial multiplies, as follows:

 
b=b-a
d=d-c b*RU
c=a-c d*RU
c=c+b
d=d-c
b=a+b d*RV
(NOP)
d=d+b
(write pixel)
Now, what would be cool would be to start loading pixels for the *next* time around the loop as soon as their respective registers become freed up in the course of the calculation. After a degree of drinking tea and scribbling on the whiteboard I came up with the following:

 
b=b-a
d=d-c b*RU
c=a-c d*RU copy a to e load a
c=c+b
d=d-c load c
e=e+b d*RV load b
(NOP)
e=d+e load d
(write pixel)
By introducing an extra pixel register e, for intermediate results, I could see that registers became free quickly enough that it should be possible to load all four pixels for the next iteration in parallel with the current calculation without lengthening the loop at all. This was looking good, but there were problems. The values of the RU and RV indices would need to be modified to point to the next pixel, *but* they are also needed for the multiplies in the calculation. I would have to delay incrementing them to point to the next pixel until the multiplies were started, and doing that would mean I couldn't do the loads as soon as the registers became available, so I'd be scuppered.

 I did consider using the non-index-register version of mul_p to get around this, but that would have entailed fossicking around with the index values using the ALU to get them in the correct position for the shift inherent in the mul_p instruction, and the ALU is already very well loaded.

 Then I remembered that I was basically wasting the XY index pair by just using it to address a pair of linear buffers. If I were to write to the destination buffer through a direct scalar address, rather than using the index registers, I could use (XY) to point to the next pixel, whilst maintaining the current pixel address in UV for the calculation. It might mean a little extra code to handle the double buffering, but not much - and if it would speed the inner loop, it would be well worth it.

 Once I had that plan of attack in mind, it was just a case of looking to see how I could fit in the various increments and stores to set up the index registers. Eventually, assuming that before entry to the loop not only were the four initial pixels loaded but that the indices XY and UV had been set equal, I came up with this: (p is the pointer to the destination buffer)

 
addm xi,x b=b-a
addm yi,y d=d-c addr yi,RY
c=a-c b*RU mv a,e addr xi,RX
p=p+4 d*RU load a addr xi,RX
c=c+b st_io x,RU addr #1,RY
d=d-c load c addr #-1,RY
e=e+b d*RV st_io y,RV addr #1,RX
(NOP) load b addr #1,RY
e=d+e load d addr #-1,RX
st_p e,(p) addr #-1,RY
This introduced an extra tick, due to the addm instructions at the start of the calculation delaying the first multiply. But, if I assumed that the first calculation (addm xi,x and b=b-a) were already done when I entered the main loop, I could squeeze those calculations into the latter part of the loop, once the next a and b were loaded:

 [SETUP]

 (load a,b,c,d)

 (set xy and uv)

 
addm xi,x b=b-a
[LOOP]

 
addm yi,y d=d-c addr yi,RY
c=a-c b*RU mv a,e addr xi,RX
p=p+4 d*RU load a addr #1,RY dec rc0
c=c+b st_io x,RU
d=d-c load c addr #-1,RY
e=e+b d*RV st_io y,RV addr #1,RX
add xi,x load b addr #1,RY bra c0ne,LOOP
e=d+e load d addr #-1,RX
b=b-a st_p e,(p) addr #-1,RY
Which looked pretty promising. Further pondering and drinking of tea finally yielded the following organisation for the inner loop code, shaving another tick off the loop and bringing home the bacon for a total of 814398 ticks for a full screen of 352x256 pixels, or well under one frame @ 60FPS (with three processors still unused!).

 [SETUP]

 (load a,b,c,d)

 (set xy and uv)

 
addm xi,x b=b-a
[LOOP]

 
d=d-c b*RU mv a,e addr yi,RY
c=a-c d*RU st_io x,RU addr xi,RX
c=c+b load a addr #1,RY dec rc0
addm yi,y d=d-c load c addr #-1,RY
e=e+b d*RV st_io y,RV addr #1,RX
add #4,p load b addr #1,RY bra c0ne,LOOP
addm xi,x e=d+e load d addr #-1,RX
b=b-a st_p e,(p) addr #-1,RY
Here is the inner loop and setup/DMA code as it finally appeared in the source:

pixel_gen:

; This is the pixel generation function.  It collects *bilerped* pixels from the 8x8 pattern buffer and
; deposits them in the linear destination buffer for output to external RAM.

        mv_s    dma_dbase,r15               ;save this in a spare reggy in v3
        add     out_buffer,dma_dbase        ;Generate the real address of the buffer
        push    v3                          ;I am going to use v3 as an extra pixel holder.
Because I am no longer using the xy index registers to address the output buffer, I get the actual address of it into dma_dbase, before I enter my loop. I also stack v3, so I can use it for an extra pixel holder ("e" in the pseudocode).
; Now, outside of the actual loop, I am gonna load up my stuff.

        st_s   x,ru            ;Initialise bilinear U pointer
        st_s   y,rv            ;Initialise bilinear V pointer
        st_s   x,rx            ;Initialise bilinear X pointer
        st_s   y,ry            ;Initialise bilinear Y pointer
{
        ld_p    (uv),pixel      ;Grab a pixel from the source
        addr    #1,ru           ;go to next horiz pixel
}
{
        ld_p    (uv),pixel2     ;Get a second pixel
        addr    #1,rv           ;go to next vert pixel
}
{
        ld_p    (uv),pixel4     ;get a third pixel
        addr    #-1,ru          ;go to prev horizontal pixel
        sub     #4,dma_dbase    ;point at start of buffer -4
}
{
        ld_p    (uv),pixel3     ;get a fourth pixel
        addr    #-1,rv          ;go back to original pixel
        sub_sv  pixel,pixel2    ;b=b-a
}       
        add     xi,x            ;pre increment x
By here, I have snarfed up my initial pixels, done the preliminary calculations, and set up XY and UV the way they need to be for entry into the loop proper. So now we are ready to dive into those 8 ticks o'glory...
bilerp:

; Here is the bilerp part.
{
        mv_v    pixel,pixel5            ;save a copy of first pixel, freeing up pixel 1.
        mul_p   ru,pixel2,>>#14,pixel2  ;scale according to fractional part of ru
        sub_sv  pixel3,pixel4           ;make vector between second 2 pixels
        addr    yi,ry                   ;Point ry to next y
}
{
        st_s    x,ru                    ;Can now update ru, finished multiplying with it.
        mul_p   ru,pixel4,>>#14,pixel4  ;scale according to fractional part of ru
        sub_sv  pixel3,pixel,pixel3
        addr    xi,rx                   ;(XY) now points at next pixel 1
}
{
        ld_p    (xy),pixel              ;Loading next pixel 1.
        addr    #1,ry                   ;POinting to next pixel 3.
        add_sv  pixel2,pixel3           ;get first intermediate result
        dec     rc0                     ;Decrementing the loop counter.
}
{
        ld_p    (xy),pixel3             ;getting next pixel 3.
        sub_sv  pixel3,pixel4           ;get vector to final value
        addm    yi,y,y                  ;Point to next y        
        addr    #-1,ry                  ;Working over to point to pixel 2.
}
{
        st_s    y,rv                    ;Can now update this as finished multiplying.
        mul_p   rv,pixel4,>>#14,pixel4  ;scale with fractional part of rv
        add_sv  pixel2,pixel5           ;add pix2 to the copy of pix1
        addr    #1,rx                   ;(xy) now points at pixel 2.
}
{
        ld_p    (xy),pixel2             ;load up next pixel2
        addr    #1,ry                   ;point to next pixel 4
        add     #4,dma_dbase            ;Incrementing the output buffer pointer.
        bra     c0ne,bilerp             ;start the branch
}
{
        ld_p    (xy),pixel4             ;get next pixel4
        add_sv  pixel4,pixel5           ;make final pixel value
        addr    #-1,rx                  ;start putting these right      
        addm    xi,x,x                  ;do x inc
}
{
        st_p    pixel5,(dma_dbase)      ;Deposit the pixel in the dest buffer
        addr    #-1,ry                  ;finish putting these right
        sub_sv  pixel,pixel2            ;b=b-a
}
Heh! That made the bugger work for a living!
; Postamble - get back v3 and the proper buffer address

    pop     v3                      ;restore dma stuff
    nop                             ;empty delay slot
    mv_s    r15,dma_dbase           ;put this back where it was


; Now, the pixel buffer is full, so it is time to DMA it out to external RAM.
;
; To implement simple double-buffering of the DMA out, we have to do
; the following:  wait for (a) the PENDING bit to go clear, which will
; mean that DMA is ready to accept a command; and (b), make sure that
; the ACTIVE level is never greater than (#buffers-1).  Here we are using
; 2 buffers, so we wait until it is 1.

dma_avail:

    ld_s    mdmactl,r0              ;Get the DMA status.
    nop
    btst    #4,r0                   ;Pending?
    bra ne,dma_avail                ;Yeah, gotta wait.
    bits    #3,>>#0,r0              ;Extract the ACTIVE level
    cmp #1,r0                       ;check against (#buffers-1)
    bra gt,dma_avail,nop            ;Wait until it is OK.

; Now we know DMA is ready, so we can proceed to set up and launch the DMA write.    

    mv_s    #dmaFlags,r0            ;Get DMA flags for this screentype.
    ld_s    dest,r1                 ;Address of external RAM screen base
    copy    destx,r2                ;destination xpos
    copy    desty,r3                ;destination ypos
    lsl #16,dma_len,r4              ;shift DMA size up
    or  r4,r2                       ;and combine with x-position
    bset    #16,r3                  ;make Y size = 1
    mv_s    #dma__cmd,r4            ;address of DMA command buffer in local RAM
    st_v    v0,(r4)                 ;set up first vector of DMA command
    add #16,r4                      ;point to next vector
    add out_buffer,dma_dbase,r0     ;point to the buffer we just drew
    st_s    r0,(r4)                 ;place final word of DMA command
    sub #16,r4                      ;point back to start of DMA command buffer
    st_s    r4,mdmacptr             ;launch the DMA

; Because we are double buffering, there is no need to wait for
; DMA to complete.  We can switch buffers, return and get straight on with the
; next line.

        rts                                                 ;Return to the main loops.
    eor #1,<>#-8,out_buffer         ;Toggle the buffer offset twixt 0 and 256.
    nop
There's a couple of extra ticks here for the unstacking of our v3, but given the ticks we saved in the inner loop, it was well worth it!

 


jmp next
jmp prev
rts
nop
nop