Optimising the Innermost Loop

 Getting a bit faster now...

To assemble and run this example, use the batch file "m4" in the Warpcode directory.  
 

 


This is somewhat better, the frame rate has come up a bit. This is just by doing some packing on the innermost pixel_gen loop. In Real Life, you wouldn't start out optimising anything in the setup code - rule #1 is to start the optimisation process from the innermost loop and then, if necessary, work outwards. If your innermost loop is only ten instructions long, it's worth it to spend all day pondering over those instructions if, at the end of the day, there are only nine of them left. Consider the current example:

 

So even if you don't bother with anything else, you should polish your innermost loop code until it shines. The numbers speak for themselves.

 Here's the pixel-generation loop, after having spent just a few minutes doing 'obvious' packing of instructions:

pixel_gen:

; This is the pixel generation function.  It collects *bilerped* pixels from the 8x8 pattern buffer and
; deposits them in the linear destination buffer for output to external RAM.

        st_s    x,ru                                    ;Initialise bilinear X pointer
        st_s    y,rv                                    ;Initialise bilinear Y pointer

; Here is the bilerp part.

{
        ld_p    (uv),pixel                              ;Grab a pixel from the source
        addr    #1,ru                                   ;go to next horiz pixel
}
{
        ld_p    (uv),pixel2                             ;Get a second pixel
        addr    #1,rv                                   ;go to next vert pixel
}
{
        ld_p    (uv),pixel4                             ;get a third pixel
        addr    #-1,ru                                  ;go to prev horizontal pixel
}
{
        ld_p    (uv),pixel3                             ;get a fourth pixel
        addr    #-1,rv                                  ;go back to original pixel
        sub_sv  pixel,pixel2                            ;make vector between first 2 pixels
}
Here we have just folded the addr instructions in with the pixel loads, and we start the first ALU poeration also as soon as the first pixel pair is loaded.
{
        dec     rc0                                     ;Decrement the counter
        mul_p   ru,pixel2,>>#14,pixel2                  ;scale according to fractional part of ru
        add yi,y                                        ;Add the y_increment
}
        sub_sv  pixel3,pixel4                           ;make vector between second 2 pixels
{
        mul_p   ru,pixel4,>>#14,pixel4                  ;scale according to fractional part of ru
        add     xi,x                                    ;Add the x-increment
}
Here you can see that we have folded some more instructions together, but look what happens next:
        add_sv  pixel2,pixel            ;get first intermediate pixel
        add_sv  pixel4,pixel3           ;get second intermediate pixel
        sub_sv  pixel,pixel3            ;get vector to final value
There is a big pile-up of ALU instructions here that we have to get through before we can start the final multiply to get the result.
        mul_p   rv,pixel3,>>#14,pixel3  ;scale with fractional part of rv
        bra     c0ne,pixel_gen          ;Loop for the length of the dest buffer
        add_sv  pixel3,pixel            ;make final pixel value 
{
        st_p    pixel,(xy)              ;Deposit the pixel in the dest buffer
        addr    #1,rx                   ;increment the dest buffer pointer
}
And finally we are done, for a final inner loop count of 16 ticks per pixel. Now this is still a lot better than the original, naive coding of the bilerp, but it's not nearly as good as we might hope.  Next up, we'll do some serious optimisation.

 


jmp next
jmp prev
rts
nop
nop