Unless you really need the precision, floating-point computation just isn't fast enough to be worth using. This is especially true on the Yaroze where FP instructions are emulated in software. Fortunately with just a little effort you can instead implement a fixed-point numbering system. Using integers is much faster and more natural for the Yaroze (or any CPU for that matter). This article will touch on a few issues regarding fixed-point notation such as: how many bits to use, where to place the binary point, and algorithms for doing fixed-point arithmetic.

So you've decided to use fixed-point notation for your new game. The amount of precision you need and the range of values you represent will help you choose how many bits to use. In this discussion I may refer to 8-bit quantities as "chars", 16-bits as "shorts" and 32-bits as "longs".

First you must understand how a fixed-point number is
composed. There are two parts: the mantissa and the decimal.
When you add the number of bits used for each you should get
the total number of bits in your data type (16 for shorts).
It's important to note that the binary point is something
that **you** define ("binary point" is used when talking
about base-2 numbers as opposed to "decimal point" when
talking about base-10 numbers). We never actually tell the
program where the binary point is. It's something we need to
keep track of ourselves. The mantissa portion is simply the
integer part of the number. The decimal part is all the stuff
to the right of the binary point. For instance if we had
11100.101, the mantissa would be 11100 (28 in base-10) and the
decimal would be 101 (1 * 2^{-1} + 0 * 2^{-2} +
1 * 2^{-3} = 0.625 in base-10).

Since determining the number of bits to use and where to
place the binary point go hand-in-hand we'll just jump in with
an example. The first thing you need to determine is how large
a number space you want to represent. Maybe these numbers will
represent X and Y screen coordinates for your sprites. If your
game world is at most 5 screens wide and 5 screens tall then
you will need to represent values up to 1600 (320*5 - assuming
you're using low resolution). A quick calculator computation
of log_{2}1600 gives you about 10.6. Rounding up means
you would need at least 11 bits of mantissa (which actually
gives you a screen size of 2048x2048). Also suppose your
decimal portion needs to have a precision of plus or minus 0.1
units. A log_{2}0.1 gives -3.3. Flip the sign and you
see that you need at least 4 bits to dedicate to the decimal
portion (giving you an actual precision of +/- 0.0625).

You have a total of 15 bits now (11 for mantissa, 4 for decimal). Since 32-bit numbers are more difficult to deal with (as we will see when trying to multiply them), your best bet is to stay with a total of 16 bits. You only need 15 so you can use that extra bit to expand either the mantissa (doubling your maximum screen size) or the decimal (improving the precision by a factor of two).

It's always nice to have too many bits to use, but what if you don't have enough? What if our world size needs to be 10 screens wide (3200 pixels) and we need a precision of +/- 0.01? Then we would need 13 bits for mantissa and 7 bits for decimal. Okay in this case we might be tempted to use longs (giving us 12 extra bits to work with - wohoo!). But what if we're flexible? We're only 4 bits over the length of a "short". We could reduce the mantissa by 1 bit and the decimal by 3 bits. If you can live with it then go for it. This is a trade off you'll have to handle on a case-by-case basis.

NOTE: For all of these examples I will assume a fixed-point notation of 12 bits of mantissa and 4 bits of decimal.

So we know how large our numbers will be and where the binary point is located. How do we use them? In two cases, addition and subtraction, the answer is easy. Just treat them like any other number! Think about our normal base-10 system. If we want to add 10.7 to 16.5 we can simply remove their decimal points (making 107 and 165), add the two integers (272), and put the decimal point back where it was (27.2). The overflow from the decimal portion carries directly into the integer part. Bottom line, unless your result is too large to represent in the bits you have, you can just add (or subtract) normally.

u_short source1, source2; u_long result; . . . result = (u_long) source1 + (u_long) source2; // Do the add if (result & 0xFFFF0000) { // Check for overflow - you can remove >overflow code< // this code if you can guarantee the } // add will never overflow |

Unfortunately, life's not as nice when doing multiplication and division. The problem is finding out where to put the binary point in the result.

The first thing to consider when you multiply an X-bit number by a Y-bit number is that your result will be X+Y bits long. This is why I said that 32-bit multiplicands would be a pain; when you multiply two of these you end up with a 64-bit number which can't be contained in one variable (unless your CPU has 64-bit registers - which the Yaroze does not). You would need to put in a lot more steps to get the result you want. In the interests of simplicity we will not talk about how to deal with this case; that will be the subject of a future article.

One point I should bring up is that you can significantly
speed up multiplication by using shifts instead of multiplies.
If you are multiplying by 16, for instance, that is the same
as shifting left by 4 bits. If you are not multiplying by a
power of 2, you can break it into several shifts and adds.
For example, (X * 10) is the same as (X * (8 + 2)) which is
equivalent to ((X * 8) + (X * 2)). So you can shift X left
by 3 bits (2^{3} = 8), save this value, shift X left
by 1 bit (2^{1} = 2), and add the two together. You
can do this for any multiplication, but depending on how many
terms it takes it may or may not be faster than doing an
actual multiply.

This is only helpful when one of the operands is a constant. If they both are variables then we have to do it the hard way. Put your thinking caps on for this.

If we multiply two 16-bit short variables we can store the result in a 32-bit long to play with it. We need to do one minor adjustment before we can use the result though. Like stated above, the number of bits in the decimal portion of the result is equal to the sum of the decimal bits of the two multiplicands. If our fixed-point notation has 4 bits of decimal in each of the two inputs then our result will have 8 decimal bits. To get this back down to 4 bits, we shift it right by 4.

So we're left with a 32-bit result: the lower 4 bits are decimal, the next 24 bits are mantissa (12 bits each from the two sources), and the top 4 bits are unused (since we shifted right by 4 bits, they don't contain actual data). To convert this back to our 16 bit fixed-point format we have to condense the 24 bits down to 12. If the multiplication result was small enough that it didn't use the top 12 of those bits then we're peachy. Just copy the lower 16 bits of the result to our new value. However, if the multiplication did end up with a result larger than we can represent in 16 bits we need to handle it differently. You will have to handle this one on your own depending on your program. Two options are to saturate the result at the largest possible 16-bit value (all 1's for an unsigned value), or just raise a flag that the multiplication overflowed and have some other code handle it.

#define NUM_DECIMAL_BITS 4 // Number of bits representing the decimal u_short source1, source2; u_long result; result = ((u_long) source1 * (u_long) source2) >> NUM_DECIMAL_BITS; if (result & 0xFFFF0000L) { // Check if number is too large to fit >overflow code< // in 16 bits. } |

Division is a little nicer since we won't have to worry
about having our result being too large; the number of bits
will be reduced instead of increased. Of course division
naturally takes much longer (in terms of CPU cycles) than
addition or even multiplication. If you can't avoid it then
just try to minimize how many divisions you do (combine
multiple divisions into one step if possible). Also, similar
to multiplication, remember that shifting can accomplish the
same thing if you are dividing by a power of 2. The value
(X/8) is the same as (X shifted right 3 bits). Note you
*cannot* break up divisions the same way we broke up
(X*10) in the previous section.

Before we divide two numbers, we need to do some
pre-shifting. This is because the act of dividing will cause
us to lose some significant bits. When you divide a number
with X decimal bits by a number with Y decimal bits, the
result has X-Y bits for the decimal. So, assuming X and Y are
equal, if we just do a straight divide we well end up with
zero bits for the decimal! To keep this from happening we
need to shift the first source to the left Y bits
*before* doing the division (you may need to cast it into
a larger data type, like a "long", before dividing). Then
just do the division. The bottom 16 bits of the result will
hold your final value.

#define NUM_DECIMAL_BITS 4 // Number of bits representing the decimal u_short source1, source2; u_long result; result = ((u_long) source1 << NUM_DECIMAL_BITS) / (u_long) source2; |

So that's it! You can now implement a fixed-point numbering system in your game if you keep your values 16 bits or less. Next issue I will discuss some algorithms for multiplying and dividing large sized sources.

This web page and its contents are © 1997 Scott Cartier