Mild Tweaking

 Yes, it's still a big ugly red grid. But it's a slightly faster big ugly red grid than it was before, and it has random coloured pixels in it.
 
To assemble and run this example, use the batch file "m2" in the Warpcode directory.

 



In this example, we are going to improve the efficiency of the code by adding some double-buffering to the DMA write section.  This will allow us to write out the buffer full of pixels that is generated, and then proceed to generate the next buffer full without hanging around to wait for the first lot to be written to SDRAM once the DMA is launched.

Since we're only doing DMA writes in this example, it's easy to implement multiple buffering on the DMA by checking the Pending and Active Level fields in the DMA status register.  In a more complex situation where DMA reads and writes may both be pending, more advanced techniques are needed for DMA optimisation, but for this example we'll stick to the easy case.  Advanced DMA optimisation could be covered in a later document, if I feel like writing one.

 When you're doing this kind of thing, refining the algorithm as a whole rather than optimising code specifically, you'll be very grateful that you didn't just leap in and start excreting big packets as soon as you started coding. Tuning the algorithm is a lot easier while everything is still unpacked and legible.

 To implement the double buffering, I just added an extra 64 longs to the output buffer, and then told UV to access that as a 64x2 array of pixels, using V_TILE to constrain v. That way, I can switch between buffers by just adding 1 to v. Here's the code again, with the mods documented:

;
; warp2.a - just get something - anything - up on the screen!
; This just tiles the screen with an 8x8 source tile.
;
; This version adds some simple DMA optimisation - namely,
; double-buffering of the output buffer.

; here's some definitions

        .include        "merlin.i"          ;general Merlin things
    .include    "scrndefs.i"        ;defines screen buffers and screen DMA type
        .start  go
        .segment        local_ram
        .align.v

; buffer for internal pixel map (1 DMA's worth)

buffer:

        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

; output line buffer (1 DMA's worth x2, for double buffering)

line:

        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Room here now for 2 buffers.
; DMA command buffer

dma__cmd:

        .dc.s   0,0,0,0,0,0,0,0


; destination screen address

dest:   .dc.s   dmaScreen2

; frame counter

ctr:    .dc.s   0

; reg equates for this routine

        x = r8
        y = r9
        pixel = v1
        destx = r12
        desty = r13
        dma_len = r14
        destw = r10
        desth = r11
        yi = r16
        xi = r17
        xs = r18
        ys = r19
        dma_mode = r20
        dma_dbase = r21
        out_buffer = r22
I added a couple of extra reg equates here. dma_dbase holds the base address of the double-buffered output buffer, and out_buffer contains an offset from that base address to the active buffer.
        .segment        instruction_ram

go:


    st_s    #$aa,intctl                   ;turn off any existing video
    st_s    #(local_ram_base+4096),sp     ;here's the SP

; clear the source buffer to *random* pixels, using the pseudo random sequence generator
; out of Graphics Gems 1

        mv_s    #$a3000000,r2                   ;This is the mask
        mv_s    #$b3725309,r0                   ;A random seed
        mv_s    #buffer,r1                      ;Address of the source buffer
        st_s    #64,rc0                         ;This is how many pixels to clear
cl_srceb:
        btst    #0,r0                           ;Check bit zero of the current seed
        bra     eq,nxor                         ;Do not xor with the mask if it ain't set
        lsr     #1,r0                           ;Always shift the mask, whatever happens
        dec     rc0                             ;dec the loop counter
        eor     r2,r0                           ;If that bit was 1, xor in the mask
nxor:
        bra     c0ne,cl_srceb                   ;loop for all the pixels
        st_s    r0,(r1)                         ;store the pixel        
        add     #4,r1                           ;point to next pixel address
Here I have changed the black background fill of ths source tile into randomly coloured pixels.  This looks slightly less boring, and will allow us to see more easily the effects of future code changes.


; set up a simple cross-shaped test pattern in the buffer RAM

        mv_s    #$51f05a00,r0                   ;Pixel colour (a red colour)
        mv_s    #buffer+(32*4),r1               ;Line halfway down buffer
        mv_s    #buffer+16,r2                   ;Column halfway across top line of buffer
        st_s    #8,rc0                          ;Number of pixels to write

testpat:

        st_s    r0,(r1)                         ;Store pixel value at row address.
        st_s    r0,(r2)                         ;Store pixel value at column address.
        dec     rc0                             ;Decrement loop counter.
        bra     c0ne,testpat                    ;Loop if counter not equal to 0.
        add     #4,r1                           ;Increment row address by one pixel.
        add     #32,r2                          ;Increment column address by one line.

; now, initialise video

    jsr SetUpVideo,nop

frame_loop:

; generate a drawscreen address 

    mv_s    #dmaScreenSize,r0       ;this lot selects one of
    mv_s    #dmaScreen3,r3          ;three drawscreen buffers
    ld_s    dest,r1                 ;this should be inited to a
                                    ;valid screen buffer address
    nop
    cmp     r3,r1
    bra     ne,updatedraw
    add     r0,r1             
    nop
    mv_s    #dmaScreen1,r1          ;reset buffer base
updatedraw:
    st_s    r1,dest                 ;set current drawframe address

; actually draw a frame

    jsr drawframe,nop
    
; increment the frame counter

    ld_s    ctr,r0
    nop
    add #1,r0
    st_s    r0,ctr    
    
; set the address of the frame just drawn on the video system

    jsr SetVidBase
    ld_s    dest,r0
    nop
    
; loop back for the next frame

    bra frame_loop,nop
    


drawframe:

; save the return address for nested subroutine calls

    push    v7,rz
    
; ensure that any pending DMA is complete.  Whilst it
; is not really necessary at the moment, it is good form,
; for later on we may arrive at the start of a routine
; while DMA is still happening from something we did before.

    jsr dma_finished,nop                

; initialise the bilinear addressing registers


        st_s    #buffer,xybase                  ;I want XY to point at the buffer here.
        st_s    #$104dd008,xyctl                ;XY type, derived as follows:
                                                ;Bit 28 set, I wanna use CH-NORM.
                                                ;Pixel type set to 4 (32-bit pixels).
                                                ;XTILE and YTILE both set to 13 (treat the buffer as an 8x8 tilable bitmap).
                                                ;The width is set to 8 pixels.
        st_s    #line,uvbase                    ;set the line buffer address
        mv_s    #line,dma_dbase                 ;Store the same address as double buffer base.
        st_s    #$1040f040,uvctl                ;UV type, derived as follows:
                                                ;Bit 28 set, I wanna use CH-NORM.
                                                ;Pixel type set to 4 (32-bit pixels).
                                                ;UTILE off, VTILE set to mask bits 17-31.
                                                ;This means that the integer part of V is
                                                ;constrained to 0 or 1.  We use it to switch buffers.
                                                ;The width is set to 64 pixels.
        st_s    #0,rv                           ;Init v to point to the first buffer.
        sub out_buffer,out_buffer               ;select buffer offset of 0
Since I am treating the 2 output buffers as a 64x2 array, the parameters for uvctl have changed. Now I am specifying a v_tile of 15, which means the v wraps in the range 0-1, and a width of 64, so that v actually means something. I also initialise rv to 0 and the offset out_buffer to 0, so they point to the first buffer. I also save the base address of the buffers in dma_dbase.
; initialise parameters for the routine

        mv_s    #0,desty                                ;Start at dest y=0
        mv_s    #0,destx                                ;Start at dest x=0
        ld_s    ctr,x                                   ;Use counter, to make it move
        ld_s    ctr,y                                   ;Same for Y
        lsl #13,x                                       ;make it half a source pixel a frame
        lsl #14,y                                       ;same
        mv_s    #$2000,xi                               ;Source X inc
        mv_s    #$400,yi                                ;Source Y inc
        mv_s    #$c00,xs                                ;Source X step
        mv_s    #$1400,ys                               ;Source Y step
        mv_s    #360,destw                              ;Width of dest rectangle
        mv_s    #240,desth                              ;Height of dest rectangle
I changed the inc and step parameters, so we get a bit of enlargement and distorsion of the grid.  And instead of using the field counter from the video interrupt to move the offset, I am using my own counter ctr, which is incremented once per frame that is actually drawn.  That way, we can see easily roughly how quick the code is - as we add stuff and the framerate goes down, the resultant scrolling speed of the display will be obvioulsy slower.
  

; now the outer loop

warp_outer:

        push    v2                      ;save the source X and Y, and
                                        ;the width and height
        push    v3                      ;save the dest X and Y  

; and now the inner.

warp_inner:

        mv_s    #64,r0                  ;This is the maximum number of 
                                        ;pixels for one DMA.
        sub     r0,destw                ;Count them off the total dest width.
        bra     gt,w_1                  ;do nothing if this is positive
        nop
        st_s   #0,ru                    ;Point ru at the first pixel of
                                        ;the output buffer
        add     destw,r0                ;If negative, modify the number
                                        ;of pixels to generate.
        jsr     pixel_gen               ;Go and call the pixel generation loop
        mv_s    r0,dma_len              ;Set the dma length in my dma vector
        st_io   r0,rc0                  ;Set the counter for the pixgen loop

; Pixel gen function will return here after having
; generated and DMA'd out the pixels

        cmp     #0,destw                ;Did the width go negative?
        bra     gt,warp_inner           ;No, it did not, carry on the horizontal 
                                        ;traverse of the dest rectangle
        add     dma_len,destx           ;add dma_len to the integer 
                                        ;part of the dest x position
        nop                             ;empty delay slot

; Horizontal span is finished if we fall through to here

        pop     v3                      ;restore dest X and Y
        pop     v2                      ;restore source X and Y
        add     #1,desty                ;point to next line of dest
        sub     #1,desth                ;decrement the Y size
        jmp     gt,warp_outer           ;loop for entire height
        add     xs,x                    ;add the X step to the source
        add     ys,y                    ;add the Y step to the source

    pop v7,rz                       ;get back return address
    nop
    rts t,nop                       ;and return 


pixel_gen:

; This is the pixel generation function.  It collects 
; pixels from the 8x8 pattern buffer and
; deposits them in the linear destination buffer for 
; output to external RAM.

        st_s   x,(rx)                   ;Initialise bilinear X pointer
        st_s   y,(ry)                   ;Initialise bilinear X pointer
        ld_p    (xy),pixel              ;Grab a pixel from the source
        dec     rc0                     ;Decrement the counter
        st_p    pixel,(uv)              ;Deposit the pixel in the dest buffer
        addr    #1,ru                   ;increment the dest buffer pointer
        bra     c0ne,pixel_gen          ;Loop for the length of the dest buffer
        add     xi,x                    ;Add the x-increment
        add yi,y                        ;Add the y_increment

; Now, the pixel buffer is full, so it is time to DMA it out to external RAM.
;
; To implement simple double-buffering of the DMA out, we have to do
; the following:  wait for (a) the PENDING bit to go clear, which will
; mean that DMA is ready to accept a command; and (b), make sure that
; the ACTIVE level is never greater than (#buffers-1).  Here we are using
; 2 buffers, so we wait until it is 1.

dma_avail:

    ld_s    mdmactl,r0              ;Get the DMA status.
    nop
    btst    #4,r0                   ;Pending?
    bra ne,dma_avail                ;Yeah, gotta wait.
    bits    #3,>>#0,r0              ;Extract the ACTIVE level
    cmp #1,r0                       ;check against (#buffers-1)
    bra gt,dma_avail,nop            ;Wait until it is OK.

; Now we know DMA is ready, so we can proceed to set up and launch the DMA write.    

    mv_s    #dmaFlags,r0            ;Get DMA flags for this screentype.
    ld_s    dest,r1                 ;Address of external RAM screen base
    copy    destx,r2                ;destination xpos
    copy    desty,r3                ;destination ypos
    lsl #16,dma_len,r4              ;shift DMA size up
    or  r4,r2                       ;and combine with x-position
    bset    #16,r3                  ;make Y size = 1
    mv_s    #dma__cmd,r4            ;address of DMA command buffer in local RAM
    st_v    v0,(r4)                 ;set up first vector of DMA command
    add #16,r4                      ;point to next vector
    add out_buffer,dma_dbase,r0     ;point to the buffer we just drew
    st_s    r0,(r4)                 ;place final word of DMA command
    sub #16,r4                      ;point back to start of DMA command buffer
    st_s    r4,mdmacptr             ;launch the DMA

; Because we are double buffering, there is no need to wait for
; DMA to complete.  We can switch buffers, return and get straight on with the
; next line.

    rts                             ;Return to the main loops.
    eor #1,<>#-8,out_buffer         ;Toggle the buffer offset twixt 0 and 256.
    addr    #1,rv                   ;Change the write buffer index.
Here is the DMA out with double-buffering.  Before initiating the DMA, we wait to see that there is no DMA pending (DMA is ready to accept commands), and that there are no more than (#buffers-1) DMAs waiting to occur.  When these conditions are satisfied, we launch the DMA as before, except that now we add out_buffer to dma_dbase to form the current buffer address. Once DMA is underway we add to rv and out_buffer ro point to the opposite buffer, ready for the next pixel generation cycle.

The tail of the code is just the .includes for the video, so I haven't bothered including them here.  In the next stage of development, we'll add bilinear filtering to the source tile reads.

  


jmp next
jmp prev
rts
nop
nop