DSI Tech

Working with Fixed Point Numbers

Unless you really need the precision, floating-point computation just isn't fast enough to be worth using. This is especially true on the Yaroze where FP instructions are emulated in software. Fortunately with just a little effort you can instead implement a fixed-point numbering system. Using integers is much faster and more natural for the Yaroze (or any CPU for that matter). This article will touch on a few issues regarding fixed-point notation such as: how many bits to use, where to place the binary point, and algorithms for doing fixed-point arithmetic.

Mantissa & Decimal

So you've decided to use fixed-point notation for your new game. The amount of precision you need and the range of values you represent will help you choose how many bits to use. In this discussion I may refer to 8-bit quantities as "chars", 16-bits as "shorts" and 32-bits as "longs".

First you must understand how a fixed-point number is composed. There are two parts: the mantissa and the decimal. When you add the number of bits used for each you should get the total number of bits in your data type (16 for shorts). It's important to note that the binary point is something that you define ("binary point" is used when talking about base-2 numbers as opposed to "decimal point" when talking about base-10 numbers). We never actually tell the program where the binary point is. It's something we need to keep track of ourselves. The mantissa portion is simply the integer part of the number. The decimal part is all the stuff to the right of the binary point. For instance if we had 11100.101, the mantissa would be 11100 (28 in base-10) and the decimal would be 101 (1 * 2^-1 + 0 * 2^-2 + 1 * 2^-3 = 0.625 in base-10).

Since determining the number of bits to use and where to place the binary point go hand-in-hand we'll just jump in with an example. The first thing you need to determine is how large a number space you want to represent. Maybe these numbers will represent X and Y screen coordinates for your sprites. If your game world is at most 5 screens wide and 5 screens tall then you will need to represent values up to 1600 (320*5 - assuming you're using low resolution). A quick calculator computation of log₂1600 gives you about 10.6. Rounding up means you would need at least 11 bits of mantissa (which actually gives you a screen size of 2048x2048). Also suppose your decimal portion needs to have a precision of plus or minus 0.1 units. A log₂0.1 gives -3.3. Flip the sign and you see that you need at least 4 bits to dedicate to the decimal portion (giving you an actual precision of +/- 0.0625).

You have a total of 15 bits now (11 for mantissa, 4 for decimal). Since 32-bit numbers are more difficult to deal with (as we will see when trying to multiply them), your best bet is to stay with a total of 16 bits. You only need 15 so you can use that extra bit to expand either the mantissa (doubling your maximum screen size) or the decimal (improving the precision by a factor of two).

It's always nice to have too many bits to use, but what if you don't have enough? What if our world size needs to be 10 screens wide (3200 pixels) and we need a precision of +/- 0.01? Then we would need 13 bits for mantissa and 7 bits for decimal. Okay in this case we might be tempted to use longs (giving us 12 extra bits to work with - wohoo!). But what if we're flexible? We're only 4 bits over the length of a "short". We could reduce the mantissa by 1 bit and the decimal by 3 bits. If you can live with it then go for it. This is a trade off you'll have to handle on a case-by-case basis.

Arithmetic

NOTE: For all of these examples I will assume a fixed-point notation of 12 bits of mantissa and 4 bits of decimal.

So we know how large our numbers will be and where the binary point is located. How do we use them? In two cases, addition and subtraction, the answer is easy. Just treat them like any other number! Think about our normal base-10 system. If we want to add 10.7 to 16.5 we can simply remove their decimal points (making 107 and 165), add the two integers (272), and put the decimal point back where it was (27.2). The overflow from the decimal portion carries directly into the integer part. Bottom line, unless your result is too large to represent in the bits you have, you can just add (or subtract) normally.

  u_short source1, source2;
  u_long result;
    .
    .
    .
  result = (u_long) source1 + (u_long) source2;  // Do the add

  if (result & 0xFFFF0000) {   // Check for overflow - you can remove
    >overflow code<            // this code if you can guarantee the
  }                            // add will never overflow

Multiplication

Unfortunately, life's not as nice when doing multiplication and division. The problem is finding out where to put the binary point in the result.

The first thing to consider when you multiply an X-bit number by a Y-bit number is that your result will be X+Y bits long. This is why I said that 32-bit multiplicands would be a pain; when you multiply two of these you end up with a 64-bit number which can't be contained in one variable (unless your CPU has 64-bit registers - which the Yaroze does not). You would need to put in a lot more steps to get the result you want. In the interests of simplicity we will not talk about how to deal with this case; that will be the subject of a future article.

One point I should bring up is that you can significantly speed up multiplication by using shifts instead of multiplies. If you are multiplying by 16, for instance, that is the same as shifting left by 4 bits. If you are not multiplying by a power of 2, you can break it into several shifts and adds. For example, (X * 10) is the same as (X * (8 + 2)) which is equivalent to ((X * 8) + (X * 2)). So you can shift X left by 3 bits (2³ = 8), save this value, shift X left by 1 bit (2¹ = 2), and add the two together. You can do this for any multiplication, but depending on how many terms it takes it may or may not be faster than doing an actual multiply.

This is only helpful when one of the operands is a constant. If they both are variables then we have to do it the hard way. Put your thinking caps on for this.

If we multiply two 16-bit short variables we can store the result in a 32-bit long to play with it. We need to do one minor adjustment before we can use the result though. Like stated above, the number of bits in the decimal portion of the result is equal to the sum of the decimal bits of the two multiplicands. If our fixed-point notation has 4 bits of decimal in each of the two inputs then our result will have 8 decimal bits. To get this back down to 4 bits, we shift it right by 4.

So we're left with a 32-bit result: the lower 4 bits are decimal, the next 24 bits are mantissa (12 bits each from the two sources), and the top 4 bits are unused (since we shifted right by 4 bits, they don't contain actual data). To convert this back to our 16 bit fixed-point format we have to condense the 24 bits down to 12. If the multiplication result was small enough that it didn't use the top 12 of those bits then we're peachy. Just copy the lower 16 bits of the result to our new value. However, if the multiplication did end up with a result larger than we can represent in 16 bits we need to handle it differently. You will have to handle this one on your own depending on your program. Two options are to saturate the result at the largest possible 16-bit value (all 1's for an unsigned value), or just raise a flag that the multiplication overflowed and have some other code handle it.

  #define NUM_DECIMAL_BITS  4  // Number of bits representing the decimal

  u_short source1, source2;
  u_long result;

  result = ((u_long) source1 * (u_long) source2) >> NUM_DECIMAL_BITS;
  
  if (result & 0xFFFF0000L) { // Check if number is too large to fit 
    >overflow code<           // in 16 bits.
  }

Division

Division is a little nicer since we won't have to worry about having our result being too large; the number of bits will be reduced instead of increased. Of course division naturally takes much longer (in terms of CPU cycles) than addition or even multiplication. If you can't avoid it then just try to minimize how many divisions you do (combine multiple divisions into one step if possible). Also, similar to multiplication, remember that shifting can accomplish the same thing if you are dividing by a power of 2. The value (X/8) is the same as (X shifted right 3 bits). Note you cannot break up divisions the same way we broke up (X*10) in the previous section.

Before we divide two numbers, we need to do some pre-shifting. This is because the act of dividing will cause us to lose some significant bits. When you divide a number with X decimal bits by a number with Y decimal bits, the result has X-Y bits for the decimal. So, assuming X and Y are equal, if we just do a straight divide we well end up with zero bits for the decimal! To keep this from happening we need to shift the first source to the left Y bits before doing the division (you may need to cast it into a larger data type, like a "long", before dividing). Then just do the division. The bottom 16 bits of the result will hold your final value.

  #define NUM_DECIMAL_BITS  4  // Number of bits representing the decimal

  u_short source1, source2;
  u_long result;

  result = ((u_long) source1 << NUM_DECIMAL_BITS) / (u_long) source2;

Good Luck!

So that's it! You can now implement a fixed-point numbering system in your game if you keep your values 16 bits or less. Next issue I will discuss some algorithms for multiplying and dividing large sized sources.

Top