Since we're only doing DMA writes in this example, it's easy to implement multiple buffering on the DMA by checking the Pending and Active Level fields in the DMA status register. In a more complex situation where DMA reads and writes may both be pending, more advanced techniques are needed for DMA optimisation, but for this example we'll stick to the easy case. Advanced DMA optimisation could be covered in a later document, if I feel like writing one.
When you're doing this kind of thing, refining the algorithm as a whole rather than optimising code specifically, you'll be very grateful that you didn't just leap in and start excreting big packets as soon as you started coding. Tuning the algorithm is a lot easier while everything is still unpacked and legible.
To implement the double buffering, I just added an extra 64 longs to the output buffer, and then told UV to access that as a 64x2 array of pixels, using V_TILE to constrain v. That way, I can switch between buffers by just adding 1 to v. Here's the code again, with the mods documented:
; ; warp2.a - just get something - anything - up on the screen! ; This just tiles the screen with an 8x8 source tile. ; ; This version adds some simple DMA optimisation - namely, ; double-buffering of the output buffer. ; here's some definitions .include "merlin.i" ;general Merlin things .include "scrndefs.i" ;defines screen buffers and screen DMA type .start go .segment local_ram .align.v ; buffer for internal pixel map (1 DMA's worth) buffer: .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ; output line buffer (1 DMA's worth x2, for double buffering) line: .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0Room here now for 2 buffers.
; DMA command buffer dma__cmd: .dc.s 0,0,0,0,0,0,0,0 ; destination screen address dest: .dc.s dmaScreen2 ; frame counter ctr: .dc.s 0 ; reg equates for this routine x = r8 y = r9 pixel = v1 destx = r12 desty = r13 dma_len = r14 destw = r10 desth = r11 yi = r16 xi = r17 xs = r18 ys = r19 dma_mode = r20 dma_dbase = r21 out_buffer = r22I added a couple of extra reg equates here. dma_dbase holds the base address of the double-buffered output buffer, and out_buffer contains an offset from that base address to the active buffer.
.segment instruction_ram go: st_s #$aa,intctl ;turn off any existing video st_s #(local_ram_base+4096),sp ;here's the SP ; clear the source buffer to *random* pixels, using the pseudo random sequence generator ; out of Graphics Gems 1 mv_s #$a3000000,r2 ;This is the mask mv_s #$b3725309,r0 ;A random seed mv_s #buffer,r1 ;Address of the source buffer st_s #64,rc0 ;This is how many pixels to clear cl_srceb: btst #0,r0 ;Check bit zero of the current seed bra eq,nxor ;Do not xor with the mask if it ain't set lsr #1,r0 ;Always shift the mask, whatever happens dec rc0 ;dec the loop counter eor r2,r0 ;If that bit was 1, xor in the mask nxor: bra c0ne,cl_srceb ;loop for all the pixels st_s r0,(r1) ;store the pixel add #4,r1 ;point to next pixel addressHere I have changed the black background fill of ths source tile into randomly coloured pixels. This looks slightly less boring, and will allow us to see more easily the effects of future code changes.
; set up a simple cross-shaped test pattern in the buffer RAM mv_s #$51f05a00,r0 ;Pixel colour (a red colour) mv_s #buffer+(32*4),r1 ;Line halfway down buffer mv_s #buffer+16,r2 ;Column halfway across top line of buffer st_s #8,rc0 ;Number of pixels to write testpat: st_s r0,(r1) ;Store pixel value at row address. st_s r0,(r2) ;Store pixel value at column address. dec rc0 ;Decrement loop counter. bra c0ne,testpat ;Loop if counter not equal to 0. add #4,r1 ;Increment row address by one pixel. add #32,r2 ;Increment column address by one line. ; now, initialise video jsr SetUpVideo,nop frame_loop: ; generate a drawscreen address mv_s #dmaScreenSize,r0 ;this lot selects one of mv_s #dmaScreen3,r3 ;three drawscreen buffers ld_s dest,r1 ;this should be inited to a ;valid screen buffer address nop cmp r3,r1 bra ne,updatedraw add r0,r1 nop mv_s #dmaScreen1,r1 ;reset buffer base updatedraw: st_s r1,dest ;set current drawframe address ; actually draw a frame jsr drawframe,nop ; increment the frame counter ld_s ctr,r0 nop add #1,r0 st_s r0,ctr ; set the address of the frame just drawn on the video system jsr SetVidBase ld_s dest,r0 nop ; loop back for the next frame bra frame_loop,nop drawframe: ; save the return address for nested subroutine calls push v7,rz ; ensure that any pending DMA is complete. Whilst it ; is not really necessary at the moment, it is good form, ; for later on we may arrive at the start of a routine ; while DMA is still happening from something we did before. jsr dma_finished,nop ; initialise the bilinear addressing registers st_s #buffer,xybase ;I want XY to point at the buffer here. st_s #$104dd008,xyctl ;XY type, derived as follows: ;Bit 28 set, I wanna use CH-NORM. ;Pixel type set to 4 (32-bit pixels). ;XTILE and YTILE both set to 13 (treat the buffer as an 8x8 tilable bitmap). ;The width is set to 8 pixels. st_s #line,uvbase ;set the line buffer address mv_s #line,dma_dbase ;Store the same address as double buffer base. st_s #$1040f040,uvctl ;UV type, derived as follows: ;Bit 28 set, I wanna use CH-NORM. ;Pixel type set to 4 (32-bit pixels). ;UTILE off, VTILE set to mask bits 17-31. ;This means that the integer part of V is ;constrained to 0 or 1. We use it to switch buffers. ;The width is set to 64 pixels. st_s #0,rv ;Init v to point to the first buffer. sub out_buffer,out_buffer ;select buffer offset of 0Since I am treating the 2 output buffers as a 64x2 array, the parameters for uvctl have changed. Now I am specifying a v_tile of 15, which means the v wraps in the range 0-1, and a width of 64, so that v actually means something. I also initialise rv to 0 and the offset out_buffer to 0, so they point to the first buffer. I also save the base address of the buffers in dma_dbase.
; initialise parameters for the routine mv_s #0,desty ;Start at dest y=0 mv_s #0,destx ;Start at dest x=0 ld_s ctr,x ;Use counter, to make it move ld_s ctr,y ;Same for Y lsl #13,x ;make it half a source pixel a frame lsl #14,y ;same mv_s #$2000,xi ;Source X inc mv_s #$400,yi ;Source Y inc mv_s #$c00,xs ;Source X step mv_s #$1400,ys ;Source Y step mv_s #360,destw ;Width of dest rectangle mv_s #240,desth ;Height of dest rectangleI changed the inc and step parameters, so we get a bit of enlargement and distorsion of the grid. And instead of using the field counter from the video interrupt to move the offset, I am using my own counter ctr, which is incremented once per frame that is actually drawn. That way, we can see easily roughly how quick the code is - as we add stuff and the framerate goes down, the resultant scrolling speed of the display will be obvioulsy slower.
; now the outer loop warp_outer: push v2 ;save the source X and Y, and ;the width and height push v3 ;save the dest X and Y ; and now the inner. warp_inner: mv_s #64,r0 ;This is the maximum number of ;pixels for one DMA. sub r0,destw ;Count them off the total dest width. bra gt,w_1 ;do nothing if this is positive nop st_s #0,ru ;Point ru at the first pixel of ;the output buffer add destw,r0 ;If negative, modify the number ;of pixels to generate. jsr pixel_gen ;Go and call the pixel generation loop mv_s r0,dma_len ;Set the dma length in my dma vector st_io r0,rc0 ;Set the counter for the pixgen loop ; Pixel gen function will return here after having ; generated and DMA'd out the pixels cmp #0,destw ;Did the width go negative? bra gt,warp_inner ;No, it did not, carry on the horizontal ;traverse of the dest rectangle add dma_len,destx ;add dma_len to the integer ;part of the dest x position nop ;empty delay slot ; Horizontal span is finished if we fall through to here pop v3 ;restore dest X and Y pop v2 ;restore source X and Y add #1,desty ;point to next line of dest sub #1,desth ;decrement the Y size jmp gt,warp_outer ;loop for entire height add xs,x ;add the X step to the source add ys,y ;add the Y step to the source pop v7,rz ;get back return address nop rts t,nop ;and return pixel_gen: ; This is the pixel generation function. It collects ; pixels from the 8x8 pattern buffer and ; deposits them in the linear destination buffer for ; output to external RAM. st_s x,(rx) ;Initialise bilinear X pointer st_s y,(ry) ;Initialise bilinear X pointer ld_p (xy),pixel ;Grab a pixel from the source dec rc0 ;Decrement the counter st_p pixel,(uv) ;Deposit the pixel in the dest buffer addr #1,ru ;increment the dest buffer pointer bra c0ne,pixel_gen ;Loop for the length of the dest buffer add xi,x ;Add the x-increment add yi,y ;Add the y_increment ; Now, the pixel buffer is full, so it is time to DMA it out to external RAM. ; ; To implement simple double-buffering of the DMA out, we have to do ; the following: wait for (a) the PENDING bit to go clear, which will ; mean that DMA is ready to accept a command; and (b), make sure that ; the ACTIVE level is never greater than (#buffers-1). Here we are using ; 2 buffers, so we wait until it is 1. dma_avail: ld_s mdmactl,r0 ;Get the DMA status. nop btst #4,r0 ;Pending? bra ne,dma_avail ;Yeah, gotta wait. bits #3,>>#0,r0 ;Extract the ACTIVE level cmp #1,r0 ;check against (#buffers-1) bra gt,dma_avail,nop ;Wait until it is OK. ; Now we know DMA is ready, so we can proceed to set up and launch the DMA write. mv_s #dmaFlags,r0 ;Get DMA flags for this screentype. ld_s dest,r1 ;Address of external RAM screen base copy destx,r2 ;destination xpos copy desty,r3 ;destination ypos lsl #16,dma_len,r4 ;shift DMA size up or r4,r2 ;and combine with x-position bset #16,r3 ;make Y size = 1 mv_s #dma__cmd,r4 ;address of DMA command buffer in local RAM st_v v0,(r4) ;set up first vector of DMA command add #16,r4 ;point to next vector add out_buffer,dma_dbase,r0 ;point to the buffer we just drew st_s r0,(r4) ;place final word of DMA command sub #16,r4 ;point back to start of DMA command buffer st_s r4,mdmacptr ;launch the DMA ; Because we are double buffering, there is no need to wait for ; DMA to complete. We can switch buffers, return and get straight on with the ; next line. rts ;Return to the main loops. eor #1,<>#-8,out_buffer ;Toggle the buffer offset twixt 0 and 256. addr #1,rv ;Change the write buffer index.Here is the DMA out with double-buffering. Before initiating the DMA, we wait to see that there is no DMA pending (DMA is ready to accept commands), and that there are no more than (#buffers-1) DMAs waiting to occur. When these conditions are satisfied, we launch the DMA as before, except that now we add out_buffer to dma_dbase to form the current buffer address. Once DMA is underway we add to rv and out_buffer ro point to the opposite buffer, ready for the next pixel generation cycle.
The tail of the code is just the .includes for the video, so I haven't bothered including them here. In the next stage of development, we'll add bilinear filtering to the source tile reads.
jmp next jmp prev rts nop nop