In the first place, code everything linear so that you are sure that your algorithm is working correctly.
When you have that running, examine your algorithm for possible shortcuts and cheats, and implement them while the code is still in a readable format.
Once you are ready to pack up your inner loop, go offline and make plenty of tea. I like to work at a whiteboard, laying out pseudocode as columns, with one column and colour per function unit. Some people use spreadsheets. Pick a method that is comfortable for you. Work out the minimum possible amount of steps your algorithm requires, forgetting for a moment about the things that have to go on in the periphery like juggling index registers and loading and storing pixels. You may spot extra shortcuts while you are doing this - test them out before your code gets too convoluted.
Be prepared to slip the phase of your loop a bit. It's okay to have a bit of setup code that is executed before you dive into the critical loop, so that you can begin your calculation and already be preparing for the next iteration while you are proceeding through the current one.
If it would help your inner loop code to have extra registers free, or to do things in a slightly more complex way outside the inner loop if it meant freeing up resources for inside, then do it. A few pushes and pops before entering the loop to free registers don't matter at all if it means you save a tick in there.
When you're putting everything together, there are a few helpful little tricks that you should remember:
Avoid ALU bottlenecks by using the multiplier. ADDM and SUBM are extremely useful for incidental additions that don't need the condition codes setting, like incrementing pointers and suchlike. You can also use the multiplier to get up to three register-to-register moves in one packet - if you have a register with a zero in it, you can use mv_s (or mv_v), copy and addm source,zeroreg,dest all together.
Use the counters and index registers to full effect. Using rc0 and rc1 for counting your iterations is greatly to be preferred to using a standard register and using the ALU. You can stick a dec rc0 onto just about any packet, and the branch conditional flag isn't smashed if you happen to be doing other, more important stuff with the ALU before you get to your branch. When you're stepping over a data structure, addr instructions can handle a lot of common pointer manipulations - again leaving the ALU proper free to do more useful stuff.
Remember that the index registers are read/write. You could dump perfectly good values into the index registers if you weren't using 'em for anything else, and then be able to manipulate them with addr in parallel with anything else you might have going on in the ALU and the multiplier. You'll need to allow a spare register and a tick to ld_io them out when you need them, but staying off the ALU and the multiplier might be worth it to you.
jmp next jmp prev rts nop nop