To assemble and run this example, use the batch file "m1" in the Warpcode directory.
For any coders who have been worrying about this strange VLIW thang that the MPEs do - forget it, at least for now. At initial implementation, it's a big mistake to try and write packed instructions. Packed code is notoriously difficult to read and maintain, and you're well advised not to do any packing until your code is running and doing exactly what you want it to. For now, just code it like you would any other RISC processor.
You'll notice that I have done some Merlin-esque things - like putting instructions into the delay slots after my JSRs and branches - but that's more because I just can't stand to see a NOP if I can possibly avoid it, NOPs offend the eye, rather than for purposes of serious optimisation.
One good habit it's wise to get into is being a bit anal about commenting things. Try to put a good comment on every line of code that describes what that instruction is doing. When you do start to pack instructions, you'll be moving stuff around all over the place, so the old style of commenting "This block does this... and this next section does that..." doesn't really work any more.
Okay, let's take a stroll through the code and have a look at what's going on.
; ; warp1.a - just get something - anything - up on the screen! ; This just tiles the screen with an 8x8 source tile. ; here's some definitions .include "merlin.i" ;general Merlin things .include "scrndefs.i" ;defines screen buffers and screen DMA type .start go .segment local_ram .align.vMerlin.i is an include file that names various bits of Merlin. We put this one in everything or else we'd be referring to hex addresses all the time, and a right pain that'd be.
Scrndefs.i defines some screen addresses in external RAM, and defines the DMA mode flags for a display that is 360 pixels wide by 240 high, and uses Pixel Mode 4 - 32 bit pixels.
Finally I make sure that I am aligned on a vector boundary before I begin to define the structures that I need. The buffers that follow need to be aligned at least to a long boundary, since they will be being accessed via the bilinear, XY and UV-address-generator load and store pixel commands. Being aligned to a vector boundary is usually a Good Thing - it's nice to be able to use vector loads and stores when you feel like it. Tip - if things start going really strange for no apparent reason, one of the first things to check is that one of your structures hasn't slipped out of alignment. Loading from a misaligned structure won't usually actually crash the system, but it can definitely mean that you don't get what you expect in the registers after the load.
If in doubt, vector-align.
; buffer for internal pixel map (1 DMA's worth) buffer: .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ; output line buffer (1 DMA's worth) line: .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 .dc.s 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0These are my two 256-byte buffers - one for the 8x8 source tile, and the other for a 64-pixel linear output buffer. We'll hold the source tile in buffer, and generate lines of pixels for output to external RAM in line.
; DMA command buffer dma__cmd: .dc.s 0,0,0,0,0,0,0,0This is a small buffer where DMA commands are constructed before being passed to the DMA system. You launch a DMA by first building the command in a small buffer, and then pointing the DMA hardware at the command by storing the buffer's address in the DMA command pointer register. This command structure must lie on a vector boundary - and we know it does, because we were anal and aligned the preceding buffers on a vector, and this is 512 bytes afterwards, so we're still vector aligned.
We'll be using non-chained, bilinear DMA, and as such, this buffer really only needs to be 5 scalars long. However, I'm feeling anal about alignment, so I've padded it out to 8 scalars to keep my precious vector alignment.
; reg equates for this routine x = r8 y = r9 pixel = v1 destx = r12 desty = r13 destw = r10 desth = r11 yi = r16 xi = r17 xs = r18 ys = r19 dma_mode = r20These are my reg equates for the routine. I tend to leave r0-r7 for scratch registers and hacking the kind of small stuff that is too piffling to be bothered writing reg equates for. I'm going to be using x and y to address the source tile; the vector pixel for - you guessed it - holding a pixel; destx and desty are the position on the destination bitmap. destw and desth are the width and height of the destination rectangle. xi and yi are the increment steps taken over the source tile for every horizontal pixel stepped over in the destination space. xs and ys are the step offsets that are added to x and y for each vertical step through destination space. Finally, dma_mode holds a copy of the DMA flags for the destination bitmap.
okay, let's go!
.segment instruction_ram go: st_s #$aa,intctl ;turn off any existing video st_s #(local_ram_base+4096),sp ;here's the SPThat store to intctl turns off any interrupts that might actually be running when the code begins to execute - there probably won't be, but it's best to be sure. Then, we initialise the SP to the top of MPE data RAM. (Well, actually, if you're on MPE0 or MPE3 there is more data RAM, but I want this code to run on any MPE, so I'm assuming a 4K maximum os DTRAM).
; clear the source buffer to black pixels mv_s #$10808000,r0 ;A black pixel mv_s #buffer,r1 ;Address of the source buffer st_s #64,rc0 ;This is how many pixels to clear to black cl_srceb: dec rc0 ;dec the loop counter bra c0ne,cl_srceb ;loop for all the pixels st_s r0,(r1) ;store the black pixel add #4,r1 ;point to next pixel addressThis clears the total source buffer to the pixel value loaded into r0 at the start. I could have pre-defined it in the .dc.s statements where I defined the buffer, but it's a lot easier to change here, and anyway, I'll be needing this loop to do something else later as the code develops. We use one of the counter registers, rc0, to count off the iterations of the loop - 64 of them, since we are writing to an 8x8 pixel buffer.
; set up a simple cross-shaped test pattern in the buffer RAM mv_s #$51f05a00,r0 ;Pixel colour (a red colour) mv_s #buffer+(32*4),r1 ;Line halfway down buffer mv_s #buffer+16,r2 ;Column halfway across top line of buffer st_io #8,rc0 ;Number of pixels to write testpat: st_s r0,(r1) ;Store pixel value at row address. st_s r0,(r2) ;Store pixel value at column address. dec rc0 ;Decrement loop counter. bra c0ne,testpat ;Loop if counter not equal to 0. add #4,r1 ;Increment row address by one pixel. add #32,r2 ;Increment column address by one line.This draws a cross in the source buffer, in the colour loaded into r0 at the start. The buffer is 8 lines of 8 32-bit pixels, so the horizontal pointer advances by 4 and the vertical pointer advances by 32.
; now, initialise video jsr SetUpVideo,nopNow, we need to set up a framework to actually generate a display, and buffer multiple screens so we can animate the display smoothly. The call to SetUpVideo invokes the necessary voodoo to do that - you don't really need to concern yourself with it right now, but basically it sets up an interrupt routine that mutters the appropriate stuff at the appropriate times to the video display hardware to yield a 360x240, 32-bit, overscanned display area.
Once video is active, we are going to sit in a loop that does the following things, over and over:
frame_loop: ; generate a drawscreen address mv_s #dmaScreenSize,r0 ;this lot selects one of mv_s #dmaScreen3,r3 ;three drawscreen buffers ld_s dest,r1 ;this should be inited to a ;valid screen buffer address nop cmp r3,r1 bra ne,updatedraw add r0,r1 nop mv_s #dmaScreen1,r1 ;reset buffer base updatedraw: st_s r1,dest ;set current drawframe address ; actually draw a frame jsr drawframe,nop ; set the address of the frame just drawn on the video system jsr SetVidBase ld_s dest,r0 nop ; loop back for the next frame bra frame_loop,nopThis is the main drawing loop. The .include file scrndefs.i defines the three screen buffer addresses and the size of an individual screen (dmaScreenSize). In the data RAM section, we initialised dest to contain the address of one of the buffers. The code increments dest by one screen size, and resets it to the first screen if it gets incremented past the third screen - we're triple-buffering. We then call drawframe, which actually does the business on the screen pointed to by dest. Finally, once drawframe returns, the screen is ready for display, so we call the video driver routine SetVidBase to point the display hardware at the screen we just drew; then we loop back and do it all again.
drawframe: ; save the return address for nested subroutine calls push v7,rz ; ensure that any pending DMA is complete. Whilst it ; is not really necessary at the moment, it is good form, ; for later on we may arrive at the start of a routine ; while DMA is still happening from something we did before. jsr dma_finished,nopSo here is the actual start of the routine to draw the screen. Since we are calling this as a subroutine, and will be calling subroutines within this one, we have to save the rz value so that the RTS will have the correct address, so first off we push v7,rz. Since we'll be doing DMA, we want to know that the DMA subsystem is not in the middle of something, so we call dma_finished, which returns when it determines that DMA is idle.
; initialise the bilinear addressing registers st_s #buffer,xybase ;I want XY to point at the buffer here. st_s #$104dd008,xyctl ;XY type, derived as follows: ;Bit 28 set, I wanna use CH-NORM. ;Pixel type set to 4 (32-bit pixels). ;XTILE and YTILE both set to 13 (treat the buffer as an 8x8 tilable bitmap). ;The width is set to 8 pixels.This initialises the xy bilinear pixel addressing registers to point to an 8x8 source map at buffer. We have set tiling on, which means that addresses outside of the 8x8 range get wrapped, and we never read from outside the tile area.
st_s #line,uvbase ;set the line buffer address st_s #$10400000,uvctl ;UV type, derived as follows: ;Bit 28 set, I wanna use CH-NORM. ;Pixel type set to 4 (32-bit pixels). ;XTILE and YTILE both set to 0 (no tiling). ;The width is set to 0 (effectively, V is not used in address generation, since this is a line buffer).This does the same as the last lot, but instead points the UV addressing at the linear output buffer. U- and v-tile are not used and the width is set to zero, which basically means that v has no effect (the linear buffer is addressed by u alone).
Right, now the initialisation is done, the source buffer contains an image of sorts, and all the DMA points to the right stuff.
; initialise parameters for the routine mv_s #0,desty ;Start at dest y=0 mv_s #0,destx ;Start at dest x=0 ld_s __fieldcount,x ;Use __fieldcount, to make it move ld_s __fieldcount,y ;Same for Y lsl #16,x ;make it one whole pixel per fieldcount lsl #16,y ;same mv_s #$10000,xi ;Source X inc is 1.0 pixels mv_s #0,yi ;Source Y inc is 0 pixels mv_s #0,xs ;Source X step is 0 pixels mv_s #$10000,ys ;Source Y step is 1.0 pixels mv_s #360,destw ;Width of dest rectangle mv_s #240,desth ;Height of dest rectangleHere we define some values for our draw routine. x and y will be used to index into the source tile; I have loaded them from __fieldcount, which is the field counter incremented once per video field by the display interrupt. Since this value is constantly incrementing, using it for the offset will make our display scroll diagonally. The xy offset is a 16:16 value, so the fieldcount is shifted up 16 bits, so the integer part gets incremented once per field. The dest origin is set to (0,0), the dest size to 360x240 pixels, the source increment to (1.0,0) and the source step to (0,1.0). The increment and step values are 16:16 fixed point values, because fractional increments are more funky. We're ready to rock.
; now the outer loop warp_outer: push v2 ;save the source X and Y, and the width and height push v3 ;save the dest X and YWe save the source and dest positions. They are gonna get molested as we step horizontally across the source rectangle, and this way we can just pop them off when it comes time to add the step values at the end of the scanline.
; and now the inner. warp_inner: mv_s #64,r0 ;This is the maximum number of pixels for one DMA. sub r0,destw ;Count them off the total dest width.We intend to do a 64-pixel chunk of the destination scanline, so we deduct that from the remaining width. If that does not go negative, the dma length is 64 (in r0).
bra gt,w_1 ;do nothing if this is positive nop st_s #0,ru ;Point ru at the first pixel of the output bufferThe previous two instructions get executed anyway regardless of the conditional branch, as they are in delay slots; here one is initialising RU and the other is empty. Always try and have your delay slots filled with some instructions, however piffling. Nops are so ugly.
add destw,r0 ;If negative, modify the number of pixels to generate.If the width went negative or zero, then it's the end of the scanline, and the DMA length may well be shorter than 64 pixels. Adding the value to r0 leaves it with the correct DMA length.
w_1: jsr pixel_gen ;Go and call the pixel generation loop mv_s r0,dma_len ;Set the dma length in my dma vector st_s r0,rc0 ;Set the counter for the pixgen loopHere is where we actually call the pixel generation function. In the delay slots on the way, the dma length is copied to dma_len, and it also is used to initialise the counter rc0. The pixel generation function fills up the destination buffer with rc0 pixels, and then dma's them out to the address in destx and desty.
; Pixel gen function will return here after having generated and DMA'd out the pixels cmp #0,destw ;Did the width go negative? bra gt,warp_inner ;No, it did not, carry on the horizontal ;traverse of the dest rectangle add dma_len,destx ;add dma_len to the dest x position nop ;empty delay slotIf the width did not go negative, we loop on around until it does, filling the destination scanline.
; Horizontal span is finished if we fall through to here pop v3 ;restore dest X and Y pop v2 ;restore source X and Y add #1,desty ;point to next line of dest sub #1,desth ;decrement the Y size jmp gt,warp_outer ;loop for entire height add xs,x ;add the X step to the source add ys,y ;add the Y step to the sourceHere is the tail of the outer loop code, which gets executed when the scanline is complete. Source and destination addresses are restored, and 1 is added to the destination Y position, moving to the next scanline down. The height is decremented by one and if it isn't 0, we loop back for another pass, adding the source step values to the source XY address on the way.
; all done! pop v7,rz ;get back return address nop rts t,nop ;and returnAnd that's it for the actual draw subroutine, apart from the actual function to draw the pixels. It's pretty simple as you can see, and the DMA isn't that much of a pain in the arse, as we'll find out next.
Now comes the most important part of the routine, the pixel-generation function. Right now, just while I get things going, I'm keeping this stupidly simple. All it does is collect pixels from the source and copy them to the destination buffer, and increment the various buffer pointers.
pixel_gen: ; This is the pixel generation function. It collects pixels ; from the 8x8 pattern buffer and ; deposits them in the linear destination buffer for output to ; external RAM. st_s x,(rx) ;Initialise bilinear X pointer st_s y,(ry) ;Initialise bilinear Y pointer ld_p (xy),pixel ;Grab a pixel from the source dec rc0 ;Decrement the counter st_p pixel,(uv) ;Deposit the pixel in the dest buffer addr #1,ru ;increment the dest buffer pointer bra c0ne,pixel_gen ;Loop for the length of the dest buffer add xi,x ;Add the x-increment add yi,y ;Add the y_incrementIt's totally non-optimal, but it's plain to see what's going on. At this stage, it's important to do everything in a very obvious way, so you know everything's working properly. There's plenty of time to worry about being optimal later. You'll be spending a lot of time staring at this inner loop code.
; If it falls through here, the output buffer is full. ; So I am gonna call my general dma out ; function, which waits for DMA available, then ; starts the command going push v0,rz ;Save the call stack pointerWe're about to call a subroutine from within a subroutine, so we need to push the call stack pointer before we do.
mv_s #dmaFlags,r0 ;Get DMA flags for this screentype. ld_s dest,r1 ;Address of external RAM screen base copy destx,r2 ;destination xpos copy desty,r3 ;destination ypos lsl #16,dma_len,r4 ;shift DMA size up or r4,r2 ;and combine with x-position bset #16,r3 ;make Y size = 1 mv_s #dma__cmd,r4 ;address of DMA command buffer in local RAM st_v v0,(r4) ;set up first vector of DMA command add #16,r4 ;point to next vector mv_s #line,r0 ;address of line buffer in local RAM st_s r0,(r4) ;place final word of DMA command sub #16,r4 ;point back to start of DMA command buffer st_s r4,mdmacptr ;launch the DMAThis code chunk launches a bilinear DMA event. dmaFlags is defined in scrndefs.i and is specific to our 360-pixel wide, Mode 4 screen. It is the first scalar of the DMA command, and I get it into r0. Next, the base of the current screen buffer is loaded into r1 from dest. The next two scalars define the x and y position of the DMA and the size - position in the low 16 bits, size in the high 16 bits of each scalar, one each for X and Y. We are transferring a line of pixels that is dma_len wide and 1 pixel high, so we set up r2 and r3 accordingly. Then we place the first four scalars of the DMA command into the buffer at dma__cmd, using a vector store. The final scalar of the command is the internal buffer address, so we add that to the command structure. Finally, we launch the DMA, by placing the address of the command buffer into mdmacptr.
jsr dma_finished,nop ;Call a function that waits until DMA is finished -Our mission here is almost done. The call to dma_finished ensures that the DMA system has completed writing out the output buffer, so we can return and start filling it with fresh pixels. Of course it's kind of silly to have to hang around and wait for that to happen, and we'll do something about that in a later version of the code.
pop v0,rz ;Restore the call stack pointer. nop ;Delay while the pop completes. rts t,nop ;Return to the main loops.Finally, with the DMA done and the buffer ready for re-use, we pop off the old return address, and rts the hell out of here. The t,nop form just means that we don't have to put nops in for the delay slots of the RTS - it saves a bit of space, and there isn't really anything useful to do in those slots.
dma_finished: ; Wait 'till all DMA has actually finished ld_s mdmactl,r0 ;get DMA status nop bits #4,>>#0,r0 ;wait until Pending and Active Level are zero bra ne,dma_finished,nop rts t,nopThis routine, dma_finished, simply polls mdmactl and loops until the status indicates that all DMA has completed, then returns.
; here is the video stuff .include "video.def" .include "video.s"These two includes define the video parameters for our display mode, and include the interrupt and setup routines for the video display stuff.
jmp next jmp prev rts nop nop