Initial Implementation

 The none-too-interesting output from the first piece of code - totally boring it may be, but it serves to let me know that the basic structure is OK.

To assemble and run this example, use the batch file "m1" in the Warpcode directory.


As you can see, this initial output is somewhat less graphically impressive than "Gridrunner" on the Commodore 64, to which it bears a passing resemblance. This early on, I'm not too bothered about that. The most important thing is just to lay the groundwork for your routine, make sure that video is coming up, screen buffering is working, the DMA is putting stuff where it ought to, and that you're getting the results you expect.

 For any coders who have been worrying about this strange VLIW thang that the MPEs do - forget it, at least for now. At initial implementation, it's a big mistake to try and write packed instructions. Packed code is notoriously difficult to read and maintain, and you're well advised not to do any packing until your code is running and doing exactly what you want it to. For now, just code it like you would any other RISC processor.

 You'll notice that I have done some Merlin-esque things - like putting instructions into the delay slots after my JSRs and branches - but that's more because I just can't stand to see a NOP if I can possibly avoid it, NOPs offend the eye, rather than for purposes of serious optimisation.

 One good habit it's wise to get into is being a bit anal about commenting things. Try to put a good comment on every line of code that describes what that instruction is doing. When you do start to pack instructions, you'll be moving stuff around all over the place, so the old style of commenting "This block does this... and this next section does that..." doesn't really work any more.

 Okay, let's take a stroll through the code and have a look at what's going on.

 

;
; warp1.a - just get something - anything - up on the screen!
; This just tiles the screen with an 8x8 source tile.

; here's some definitions

        .include        "merlin.i"          ;general Merlin things
        .include        "scrndefs.i"        ;defines screen buffers and screen DMA type
        .start  go
        .segment        local_ram
        .align.v
Merlin.i is an include file that names various bits of Merlin.  We put this one in everything or else we'd be referring to hex addresses all the time, and a right pain that'd be.

Scrndefs.i defines some screen addresses in external RAM, and defines the DMA mode flags for a display that is 360 pixels wide by 240 high, and uses Pixel Mode 4 - 32 bit pixels.

Finally I make sure that I am aligned on a vector boundary before I begin to define the structures that I need.  The buffers that follow need to be aligned at least to a long boundary, since they will be being accessed via the bilinear, XY and UV-address-generator load and store pixel commands.  Being aligned to a vector boundary is usually a Good Thing - it's nice to be able to use vector loads and stores when you feel like it.  Tip - if things start going really strange for no apparent reason, one of the first things to check is that one of your structures hasn't slipped out of alignment.  Loading from a misaligned structure won't usually actually crash the system, but it can definitely mean that you don't get what you expect in the registers after the load.

If in doubt, vector-align.

; buffer for internal pixel map (1 DMA's worth)

buffer:

        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

; output line buffer (1 DMA's worth)

line:

        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        .dc.s   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
These are my two 256-byte buffers - one for the 8x8 source tile, and the other for a 64-pixel linear output buffer.  We'll hold the source tile in buffer, and generate lines of pixels for output to external RAM in line.
; DMA command buffer

dma__cmd:

        .dc.s   0,0,0,0,0,0,0,0
This is a small buffer where DMA commands are constructed before being passed to the DMA system.  You launch a DMA by first building the command in a small buffer, and then pointing the DMA hardware at the command by storing the buffer's address in the DMA command pointer register.  This command structure must lie on a vector boundary - and we know it does, because we were anal and aligned the preceding buffers on a vector, and this is 512 bytes afterwards, so we're still vector aligned.

We'll be using non-chained, bilinear DMA, and as such, this buffer really only needs to be 5 scalars long.  However, I'm feeling anal about alignment, so I've padded it out to 8 scalars to keep my precious vector alignment.

; reg equates for this routine

        x = r8
        y = r9
        pixel = v1
        destx = r12
        desty = r13
        destw = r10
        desth = r11
        yi = r16
        xi = r17
        xs = r18
        ys = r19
        dma_mode = r20
These are my reg equates for the routine. I tend to leave r0-r7 for scratch registers and hacking the kind of small stuff that is too piffling to be bothered writing reg equates for. I'm going to be using x and y to address the source tile; the vector pixel for - you guessed it - holding a pixel; destx and desty are the position on the destination bitmap. destw and desth are the width and height of the destination rectangle. xi and yi are the increment steps taken over the source tile for every horizontal pixel stepped over in the destination space. xs and ys are the step offsets that are added to x and y for each vertical step through destination space. Finally, dma_mode holds a copy of the DMA flags for the destination bitmap.

 okay, let's go!

 

        .segment        instruction_ram

go:

        st_s    #$aa,intctl                   ;turn off any existing video
        st_s    #(local_ram_base+4096),sp     ;here's the SP
That store to intctl turns off any interrupts that might actually be running when the code begins to execute - there probably won't be, but it's best to be sure.  Then, we initialise the SP to the top of MPE data RAM.  (Well, actually, if you're on MPE0 or MPE3 there is more data RAM, but I want this code to run on any MPE, so I'm assuming a 4K maximum os DTRAM).
; clear the source buffer to black pixels

        mv_s    #$10808000,r0   ;A black pixel
        mv_s    #buffer,r1      ;Address of the source buffer
        st_s   #64,rc0          ;This is how many pixels to clear to black

cl_srceb:

        dec     rc0             ;dec the loop counter
        bra     c0ne,cl_srceb   ;loop for all the pixels
        st_s    r0,(r1)         ;store the black pixel  
        add     #4,r1           ;point to next pixel address
This clears the total source buffer to the pixel value loaded into r0 at the start. I could have pre-defined it in the .dc.s statements where I defined the buffer, but it's a lot easier to change here, and anyway, I'll be needing this loop to do something else later as the code develops.  We use one of the counter registers, rc0, to count off the iterations of the loop - 64 of them, since we are writing to an 8x8 pixel buffer.
; set up a simple cross-shaped test pattern in the buffer RAM

        mv_s    #$51f05a00,r0           ;Pixel colour (a red colour)
        mv_s    #buffer+(32*4),r1       ;Line halfway down buffer
        mv_s    #buffer+16,r2           ;Column halfway across top line of buffer
        st_io   #8,rc0                  ;Number of pixels to write

testpat:

        st_s    r0,(r1)         ;Store pixel value at row address.
        st_s    r0,(r2)         ;Store pixel value at column address.
        dec     rc0             ;Decrement loop counter.
        bra     c0ne,testpat    ;Loop if counter not equal to 0.
        add     #4,r1           ;Increment row address by one pixel.
        add     #32,r2          ;Increment column address by one line.
This draws a cross in the source buffer, in the colour loaded into r0 at the start. The buffer is 8 lines of 8 32-bit pixels, so the horizontal pointer advances by 4 and the vertical pointer advances by 32.
; now, initialise video

    jsr SetUpVideo,nop
Now, we need to set up a framework to actually generate a display, and buffer multiple screens so we can animate the display smoothly.  The call to SetUpVideo invokes the necessary voodoo to do that - you don't really need to concern yourself with it right now, but basically it sets up an interrupt routine that mutters the appropriate stuff at the appropriate times to the video display hardware to yield a 360x240, 32-bit, overscanned display area.

Once video is active, we are going to sit in a loop that does the following things, over and over:

frame_loop:

; generate a drawscreen address 

    mv_s    #dmaScreenSize,r0       ;this lot selects one of
    mv_s    #dmaScreen3,r3          ;three drawscreen buffers
    ld_s    dest,r1                 ;this should be inited to a
                                    ;valid screen buffer address
    nop
    cmp     r3,r1
    bra     ne,updatedraw
    add     r0,r1             
    nop
    mv_s    #dmaScreen1,r1          ;reset buffer base
updatedraw:
    st_s    r1,dest                 ;set current drawframe address

; actually draw a frame

    jsr drawframe,nop
    
; set the address of the frame just drawn on the video system

    jsr SetVidBase
    ld_s    dest,r0
    nop
    
; loop back for the next frame

    bra frame_loop,nop
This is the main drawing loop.  The .include file scrndefs.i defines the three screen buffer addresses and the size of an individual screen (dmaScreenSize).  In the data RAM section, we initialised dest to contain the address of one of the buffers.  The code increments dest by one screen size, and resets it to the first screen if it gets incremented past the third screen - we're triple-buffering.  We then call drawframe, which actually does the business on the screen pointed to by dest.  Finally, once drawframe returns, the screen is ready for display, so we call the video driver routine SetVidBase to point the display hardware at the screen we just drew; then we loop back and do it all again.
drawframe:

; save the return address for nested subroutine calls

    push    v7,rz
    
; ensure that any pending DMA is complete.  Whilst it
; is not really necessary at the moment, it is good form,
; for later on we may arrive at the start of a routine
; while DMA is still happening from something we did before.

    jsr dma_finished,nop                
So here is the actual start of the routine to draw the screen.  Since we are calling this as a subroutine, and will be calling subroutines within this one, we have to save the rz value so that the RTS will have the correct address, so first off we push v7,rz.  Since we'll be doing DMA, we want to know that the DMA subsystem is not in the middle of something, so we call dma_finished, which returns when it determines that DMA is idle.
 
; initialise the bilinear addressing registers


        st_s    #buffer,xybase              ;I want XY to point at the buffer here.
        st_s    #$104dd008,xyctl            ;XY type, derived as follows:
                                            ;Bit 28 set, I wanna use CH-NORM.
                                            ;Pixel type set to 4 (32-bit pixels).
                                            ;XTILE and YTILE both set to 13 (treat the buffer as an 8x8 tilable bitmap).
                                            ;The width is set to 8 pixels.
This initialises the xy bilinear pixel addressing registers to point to an 8x8 source map at buffer.  We have set tiling on, which means that addresses outside of the 8x8 range get wrapped, and we never read from outside the tile area.
        st_s    #line,uvbase                ;set the line buffer address
        st_s    #$10400000,uvctl            ;UV type, derived as follows:
                                            ;Bit 28 set, I wanna use CH-NORM.
                                            ;Pixel type set to 4 (32-bit pixels).
                                            ;XTILE and YTILE both set to 0 (no tiling).
                                            ;The width is set to 0 (effectively, V is not used in address generation, since this is a line buffer).
        
This does the same as the last lot, but instead points the UV addressing at the linear output buffer. U- and v-tile are not used and the width is set to zero, which basically means that v has no effect (the linear buffer is addressed by u alone).

 Right, now the initialisation is done, the source buffer contains an image of sorts, and all the DMA points to the right stuff.

 
; initialise parameters for the routine

        mv_s    #0,desty                ;Start at dest y=0
        mv_s    #0,destx                ;Start at dest x=0
        ld_s    __fieldcount,x          ;Use __fieldcount, to make it move
        ld_s    __fieldcount,y          ;Same for Y
        lsl #16,x                       ;make it one whole pixel per fieldcount
        lsl #16,y                       ;same
        mv_s    #$10000,xi              ;Source X inc is 1.0 pixels
        mv_s    #0,yi                   ;Source Y inc is 0 pixels
        mv_s    #0,xs                   ;Source X step is 0 pixels
        mv_s    #$10000,ys              ;Source Y step is 1.0 pixels
        mv_s    #360,destw              ;Width of dest rectangle
        mv_s    #240,desth              ;Height of dest rectangle

Here we define some values for our draw routine.  x and y will be used to index into the source tile; I have loaded them from __fieldcount, which is the field counter incremented once per video field by the display interrupt.  Since this value is constantly incrementing, using it for the offset will make our display scroll diagonally.  The xy offset is a 16:16 value, so the fieldcount is shifted up 16 bits, so the integer part gets incremented once per field.  The dest origin is set to (0,0), the dest size to 360x240 pixels, the source increment to (1.0,0) and the source step to (0,1.0). The increment and step values are 16:16 fixed point values, because fractional increments are more funky. We're ready to rock.
 

; now the outer loop

warp_outer:

        push    v2      ;save the source X and Y, and the width and height
        push    v3      ;save the dest X and Y
We save the source and dest positions. They are gonna get molested as we step horizontally across the source rectangle, and this way we can just pop them off when it comes time to add the step values at the end of the scanline.
; and now the inner.

warp_inner:

        mv_s    #64,r0          ;This is the maximum number of pixels for one DMA.
        sub     r0,destw        ;Count them off the total dest width.
We intend to do a 64-pixel chunk of the destination scanline, so we deduct that from the remaining width. If that does not go negative, the dma length is 64 (in r0).
        bra     gt,w_1          ;do nothing if this is positive
        nop
        st_s   #0,ru            ;Point ru at the first pixel of the output buffer
The previous two instructions get executed anyway regardless of the conditional branch, as they are in delay slots;  here one is initialising RU and the other is empty. Always try and have your delay slots filled with some instructions, however piffling. Nops are so ugly.
        add     destw,r0        ;If negative, modify the number of pixels to generate.
If the width went negative or zero, then it's the end of the scanline, and the DMA length may well be shorter than 64 pixels. Adding the value to r0 leaves it with the correct DMA length.
w_1:
        jsr     pixel_gen       ;Go and call the pixel generation loop
        mv_s    r0,dma_len      ;Set the dma length in my dma vector
        st_s   r0,rc0           ;Set the counter for the pixgen loop
Here is where we actually call the pixel generation function. In the delay slots on the way, the dma length is copied to dma_len, and it also is used to initialise the counter rc0. The pixel generation function fills up the destination buffer with rc0 pixels, and then dma's them out to the address in destx and desty.
; Pixel gen function will return here after having generated and DMA'd out the pixels

        cmp     #0,destw        ;Did the width go negative?
        bra     gt,warp_inner   ;No, it did not, carry on the horizontal 
                                ;traverse of the dest rectangle
        add     dma_len,destx   ;add dma_len to the dest x position
        nop                     ;empty delay slot
If the width did not go negative, we loop on around until it does, filling the destination scanline.
; Horizontal span is finished if we fall through to here

        pop     v3              ;restore dest X and Y
        pop     v2              ;restore source X and Y
        add     #1,desty        ;point to next line of dest
        sub     #1,desth        ;decrement the Y size
        jmp     gt,warp_outer   ;loop for entire height
        add     xs,x            ;add the X step to the source
        add     ys,y            ;add the Y step to the source
Here is the tail of the outer loop code, which gets executed when the scanline is complete. Source and destination addresses are restored, and 1 is added to the destination Y position, moving to the next scanline down. The height is decremented by one and if it isn't 0, we loop back for another pass, adding the source step values to the source XY address on the way.
; all done!

    pop v7,rz                    ;get back return address
    nop
    rts t,nop                    ;and return 
And that's it for the actual draw subroutine, apart from the actual function to draw the pixels. It's pretty simple as you can see, and the DMA isn't that much of a pain in the arse, as we'll find out next.

 Now comes the most important part of the routine, the pixel-generation function. Right now, just while I get things going, I'm keeping this stupidly simple. All it does is collect pixels from the source and copy them to the destination buffer, and increment the various buffer pointers.

pixel_gen:

; This is the pixel generation function.  It collects pixels 
; from the 8x8 pattern buffer and
; deposits them in the linear destination buffer for output to 
; external RAM.

        st_s    x,(rx)          ;Initialise bilinear X pointer
        st_s    y,(ry)          ;Initialise bilinear Y pointer
        ld_p    (xy),pixel      ;Grab a pixel from the source
        dec     rc0             ;Decrement the counter
        st_p    pixel,(uv)      ;Deposit the pixel in the dest buffer
        addr    #1,ru           ;increment the dest buffer pointer
        bra     c0ne,pixel_gen  ;Loop for the length of the dest buffer
        add     xi,x            ;Add the x-increment
        add yi,y                ;Add the y_increment
It's totally non-optimal, but it's plain to see what's going on. At this stage, it's important to do everything in a very obvious way, so you know everything's working properly. There's plenty of time to worry about being optimal later. You'll be spending a lot of time staring at this inner loop code.
; If it falls through here, the output buffer is full.
; So I am gonna call my general dma out
; function, which waits for DMA available, then
; starts the command going

        push    v0,rz           ;Save the call stack pointer
We're about to call a subroutine from within a subroutine, so we need to push the call stack pointer before we do.
    mv_s    #dmaFlags,r0            ;Get DMA flags for this screentype.
    ld_s    dest,r1                 ;Address of external RAM screen base
    copy    destx,r2                ;destination xpos
    copy    desty,r3                ;destination ypos
    lsl #16,dma_len,r4              ;shift DMA size up
    or  r4,r2                       ;and combine with x-position
    bset    #16,r3                  ;make Y size = 1
    mv_s    #dma__cmd,r4            ;address of DMA command buffer in local RAM
    st_v    v0,(r4)                 ;set up first vector of DMA command
    add #16,r4                      ;point to next vector
    mv_s    #line,r0                ;address of line buffer in local RAM
    st_s    r0,(r4)                 ;place final word of DMA command
    sub #16,r4                      ;point back to start of DMA command buffer
    st_s    r4,mdmacptr             ;launch the DMA
This code chunk launches a bilinear DMA event.  dmaFlags is defined in scrndefs.i and is specific to our 360-pixel wide, Mode 4 screen. It is the first scalar of the DMA command, and I get it into r0.  Next, the base of the current screen buffer is loaded into r1 from dest.  The next two scalars define the x and y position of the DMA and the size - position in the low 16 bits, size in the high 16 bits of each scalar, one each for X and Y.  We are transferring a line of pixels that is dma_len wide and 1 pixel high, so we set up r2 and r3 accordingly.  Then we place the first four scalars of the DMA command into the buffer at dma__cmd, using a vector store.  The final scalar of the command is the internal buffer address, so we add that to the command structure.  Finally, we launch the DMA, by placing the address of the command buffer into mdmacptr.
        jsr     dma_finished,nop    ;Call a function that waits until DMA is finished -
Our mission here is almost done. The call to dma_finished ensures that the DMA system has completed writing out the output buffer, so we can return and start filling it with fresh pixels. Of course it's kind of silly to have to hang around and wait for that to happen, and we'll do something about that in a later version of the code.
        pop     v0,rz           ;Restore the call stack pointer.
        nop                     ;Delay while the pop completes.
        rts t,nop                     ;Return to the main loops.
Finally, with the DMA done and the buffer ready for re-use, we pop off the old return address, and rts the hell out of here.  The t,nop form just means that we don't have to put nops in for the delay slots of the RTS - it saves a bit of space, and there isn't really anything useful to do in those slots.
dma_finished:

; Wait 'till all DMA has actually finished

        ld_s    mdmactl,r0              ;get DMA status
        nop
        bits #4,>>#0,r0                 ;wait until Pending and Active Level are zero
        bra     ne,dma_finished,nop
        rts t,nop
This routine, dma_finished, simply polls mdmactl and loops until the status indicates that all DMA has completed, then returns.

; here is the video stuff

    .include    "video.def"
    .include    "video.s"
These two includes define the video parameters for our display mode, and include the interrupt and setup routines for the video display stuff.



 And that's it - simple, non-optimised, not-trying-to-be-a-clever-git kind of code at the moment, but that's all we want at the moment, just to get something up and running, get a framework in place that we can build on. Next, we are going to add some simple DMA optimisation, to save having to use dma_finished at all.

 


jmp next
jmp prev
rts
nop
nop