To assemble and run this example, use the batch file "m5" in the Warpcode directory.
For deep optimisation of an inner loop, it's generally not good enough to just pack up the instructions that you wrote for the unpacked version and hope for the best. I find it best to take myself away from in front of the computer altogether and make myself a really hot cup of tea and go and pace about in front of the whiteboard.
I find it helpful to first of all consider what the absolute minimum, best case timing could be for the task in hand. In the case of the bilinear interpolation I wrote down something like this (using a different colour for ALU and MUL unit operations, and with a,b,c and d representing the four pixels used in the interpolation). I assumed that at the start of the calculation, all four pixels were already loaded:
b=b-a | |
d=d-c | b*RU |
d*RU | |
b=b+a | |
d=d+c | |
d=d-b | |
d*RV | |
(NOP) | |
d=d+b | |
(write pixel) |
Looking at the algorithm, I realised I could start the third multiply one tick earlier by calculating one of the differences in parallel with the two initial multiplies, as follows:
b=b-a | |
d=d-c | b*RU |
c=a-c | d*RU |
c=c+b | |
d=d-c | |
b=a+b | d*RV |
(NOP) | |
d=d+b | |
(write pixel) |
b=b-a | |||
d=d-c | b*RU | ||
c=a-c | d*RU | copy a to e | load a |
c=c+b | |||
d=d-c | load c | ||
e=e+b | d*RV | load b | |
(NOP) | |||
e=d+e | load d | ||
(write pixel) |
I did consider using the non-index-register version of mul_p to get around this, but that would have entailed fossicking around with the index values using the ALU to get them in the correct position for the shift inherent in the mul_p instruction, and the ALU is already very well loaded.
Then I remembered that I was basically wasting the XY index pair by just using it to address a pair of linear buffers. If I were to write to the destination buffer through a direct scalar address, rather than using the index registers, I could use (XY) to point to the next pixel, whilst maintaining the current pixel address in UV for the calculation. It might mean a little extra code to handle the double buffering, but not much - and if it would speed the inner loop, it would be well worth it.
Once I had that plan of attack in mind, it was just a case of looking to see how I could fit in the various increments and stores to set up the index registers. Eventually, assuming that before entry to the loop not only were the four initial pixels loaded but that the indices XY and UV had been set equal, I came up with this: (p is the pointer to the destination buffer)
addm xi,x | b=b-a | |||
addm yi,y | d=d-c | addr yi,RY | ||
c=a-c | b*RU | mv a,e | addr xi,RX | |
p=p+4 | d*RU | load a | addr xi,RX | |
c=c+b | st_io x,RU | addr #1,RY | ||
d=d-c | load c | addr #-1,RY | ||
e=e+b | d*RV | st_io y,RV | addr #1,RX | |
(NOP) | load b | addr #1,RY | ||
e=d+e | load d | addr #-1,RX | ||
st_p e,(p) | addr #-1,RY |
[SETUP]
(load a,b,c,d)
(set xy and uv)
addm xi,x | b=b-a |
addm yi,y | d=d-c | addr yi,RY | |||
c=a-c | b*RU | mv a,e | addr xi,RX | ||
p=p+4 | d*RU | load a | addr #1,RY | dec rc0 | |
c=c+b | st_io x,RU | ||||
d=d-c | load c | addr #-1,RY | |||
e=e+b | d*RV | st_io y,RV | addr #1,RX | ||
add xi,x | load b | addr #1,RY | bra c0ne,LOOP | ||
e=d+e | load d | addr #-1,RX | |||
b=b-a | st_p e,(p) | addr #-1,RY |
[SETUP]
(load a,b,c,d)
(set xy and uv)
addm xi,x | b=b-a |
d=d-c | b*RU | mv a,e | addr yi,RY | ||
c=a-c | d*RU | st_io x,RU | addr xi,RX | ||
c=c+b | load a | addr #1,RY | dec rc0 | ||
addm yi,y | d=d-c | load c | addr #-1,RY | ||
e=e+b | d*RV | st_io y,RV | addr #1,RX | ||
add #4,p | load b | addr #1,RY | bra c0ne,LOOP | ||
addm xi,x | e=d+e | load d | addr #-1,RX | ||
b=b-a | st_p e,(p) | addr #-1,RY |
pixel_gen: ; This is the pixel generation function. It collects *bilerped* pixels from the 8x8 pattern buffer and ; deposits them in the linear destination buffer for output to external RAM. mv_s dma_dbase,r15 ;save this in a spare reggy in v3 add out_buffer,dma_dbase ;Generate the real address of the buffer push v3 ;I am going to use v3 as an extra pixel holder.Because I am no longer using the xy index registers to address the output buffer, I get the actual address of it into dma_dbase, before I enter my loop. I also stack v3, so I can use it for an extra pixel holder ("e" in the pseudocode).
; Now, outside of the actual loop, I am gonna load up my stuff. st_s x,ru ;Initialise bilinear U pointer st_s y,rv ;Initialise bilinear V pointer st_s x,rx ;Initialise bilinear X pointer st_s y,ry ;Initialise bilinear Y pointer { ld_p (uv),pixel ;Grab a pixel from the source addr #1,ru ;go to next horiz pixel } { ld_p (uv),pixel2 ;Get a second pixel addr #1,rv ;go to next vert pixel } { ld_p (uv),pixel4 ;get a third pixel addr #-1,ru ;go to prev horizontal pixel sub #4,dma_dbase ;point at start of buffer -4 } { ld_p (uv),pixel3 ;get a fourth pixel addr #-1,rv ;go back to original pixel sub_sv pixel,pixel2 ;b=b-a } add xi,x ;pre increment xBy here, I have snarfed up my initial pixels, done the preliminary calculations, and set up XY and UV the way they need to be for entry into the loop proper. So now we are ready to dive into those 8 ticks o'glory...
bilerp: ; Here is the bilerp part. { mv_v pixel,pixel5 ;save a copy of first pixel, freeing up pixel 1. mul_p ru,pixel2,>>#14,pixel2 ;scale according to fractional part of ru sub_sv pixel3,pixel4 ;make vector between second 2 pixels addr yi,ry ;Point ry to next y } { st_s x,ru ;Can now update ru, finished multiplying with it. mul_p ru,pixel4,>>#14,pixel4 ;scale according to fractional part of ru sub_sv pixel3,pixel,pixel3 addr xi,rx ;(XY) now points at next pixel 1 } { ld_p (xy),pixel ;Loading next pixel 1. addr #1,ry ;POinting to next pixel 3. add_sv pixel2,pixel3 ;get first intermediate result dec rc0 ;Decrementing the loop counter. } { ld_p (xy),pixel3 ;getting next pixel 3. sub_sv pixel3,pixel4 ;get vector to final value addm yi,y,y ;Point to next y addr #-1,ry ;Working over to point to pixel 2. } { st_s y,rv ;Can now update this as finished multiplying. mul_p rv,pixel4,>>#14,pixel4 ;scale with fractional part of rv add_sv pixel2,pixel5 ;add pix2 to the copy of pix1 addr #1,rx ;(xy) now points at pixel 2. } { ld_p (xy),pixel2 ;load up next pixel2 addr #1,ry ;point to next pixel 4 add #4,dma_dbase ;Incrementing the output buffer pointer. bra c0ne,bilerp ;start the branch } { ld_p (xy),pixel4 ;get next pixel4 add_sv pixel4,pixel5 ;make final pixel value addr #-1,rx ;start putting these right addm xi,x,x ;do x inc } { st_p pixel5,(dma_dbase) ;Deposit the pixel in the dest buffer addr #-1,ry ;finish putting these right sub_sv pixel,pixel2 ;b=b-a }Heh! That made the bugger work for a living!
; Postamble - get back v3 and the proper buffer address pop v3 ;restore dma stuff nop ;empty delay slot mv_s r15,dma_dbase ;put this back where it was ; Now, the pixel buffer is full, so it is time to DMA it out to external RAM. ; ; To implement simple double-buffering of the DMA out, we have to do ; the following: wait for (a) the PENDING bit to go clear, which will ; mean that DMA is ready to accept a command; and (b), make sure that ; the ACTIVE level is never greater than (#buffers-1). Here we are using ; 2 buffers, so we wait until it is 1. dma_avail: ld_s mdmactl,r0 ;Get the DMA status. nop btst #4,r0 ;Pending? bra ne,dma_avail ;Yeah, gotta wait. bits #3,>>#0,r0 ;Extract the ACTIVE level cmp #1,r0 ;check against (#buffers-1) bra gt,dma_avail,nop ;Wait until it is OK. ; Now we know DMA is ready, so we can proceed to set up and launch the DMA write. mv_s #dmaFlags,r0 ;Get DMA flags for this screentype. ld_s dest,r1 ;Address of external RAM screen base copy destx,r2 ;destination xpos copy desty,r3 ;destination ypos lsl #16,dma_len,r4 ;shift DMA size up or r4,r2 ;and combine with x-position bset #16,r3 ;make Y size = 1 mv_s #dma__cmd,r4 ;address of DMA command buffer in local RAM st_v v0,(r4) ;set up first vector of DMA command add #16,r4 ;point to next vector add out_buffer,dma_dbase,r0 ;point to the buffer we just drew st_s r0,(r4) ;place final word of DMA command sub #16,r4 ;point back to start of DMA command buffer st_s r4,mdmacptr ;launch the DMA ; Because we are double buffering, there is no need to wait for ; DMA to complete. We can switch buffers, return and get straight on with the ; next line. rts ;Return to the main loops. eor #1,<>#-8,out_buffer ;Toggle the buffer offset twixt 0 and 256. nopThere's a couple of extra ticks here for the unstacking of our v3, but given the ticks we saved in the inner loop, it was well worth it!
jmp next jmp prev rts nop nop