DSI Tech

Fixed-Point Notation with 32-bit Values

As mentioned last issue, I will describe how to write fixed-point arithmetic routines using 32-bit values. I am assuming you have read the last installment or have a grasp of fixed point notation.

The largest variable size the Yaroze can handle is 32-bits. Therefore when our computation (addition, subtraction, and multiplication) might give a result with more than 32 bits, we will need to devise special routines to deal with them. If you want to use more than 32 bits in each value then you are talking a lot more work; also the CPU time spent in your computation functions goes up dramatically. If you still want to do it you can take the concepts presented below and apply them to larger value sizes. For this article, though I will only consider variables 32 bits long. Again I will cover addition (and subtraction) first and then multiplication and division.

Arithmetic

Adding (or subtracting) two 32-bit values can give you a result with 33 bits. Since variables don't hold that many bits we need to do something special to determine overflow. The simplest way to deal with this is to only use 31 bit values for each source. Then the result will be 32 bits which lets us just add normally.

If we need that extra bit then it is necessary to have some code which detects the overflow into the 33rd bit. We can't just add the two 32-bit values because the final carry-out (which represents the overflow condition) is lost. We can get around this by shifting the two inputs to the right one bit before doing the add. We will lose some accuracy from the loss of the lower bit, but can compensate for that later.

So shift and do the add to get the result. The 32nd bit determines the overflow condition. If it's set then we had overflow (I am assuming unsigned values - if you are using signed numbers then you need to compute overflow accordingly). But what about those bits we shifted out? Well in the worst case they could both be 1. Then when they are added together they will cause a carry into the second bit. If the result is all ones then this carry will propagate all the way up and will also signal an overflow. The last part of our overflow computation has to check for this.

  u_long x, y, overflow_check, result;

  result = x + y;                                 // Get the raw result
  overflow_check = (x >> 1) + (y >> 1);
  if ((overflow_check & 0x80000000L) ||           // Check if upper bit is set...
      ((x & 0x00000001L) && (y & 0x00000001L) &&  //   or if the last bit
       (overflow_check == 0x7FFFFFFFL))) {        //   propagates all the way up.
    >overflow handler code<
  }

Multiplication

This one is very tricky. Unless your two inputs only use the lower 16 bits then you are going to have overflow into the upper 32-bits of the result (remember multiplying two 32-bit numbers will result in a 64-bit value). How (or if) you convert this back to your original 32-bit fixed-point format is implementation-specific. Hopefully you don't need need to keep that number permanently (i.e. it's a temporary result in the middle a series of computations). I'll leave how to deal with that up to you. Like before some suggestions are to saturate the result at the largest possible 32-bit value, or just raise a flag and have some other code deal with the overflow.

When multiplying two 32-bit numbers we need to split each of the operands up into two (or more) variables which contain a lower number of bits. For example, each 32-bit input can be split into two 16-bit quantities (one for the lower word and one for the upper word). Consider a value X. This is the same as (Y*2¹⁶ + Z) where Y is the upper 16 bits and Z is the lower 16 bits of X. For now we're going to drop the 2¹⁶ and just call the upper 16 bits "Y". When we multiply two of these numbers we can distribute the terms and solve each on their own, combining them all in the end.

  (Y1 + Z1) * (Y2 + Z2) = (Z1 * Z1) + (Y1 * Z2) + (Z1 * Y2) + (Y1 * Y2)

In our program we can add them all together to get the final 64-bit result. The key is remembering the relative magnitudes of each factor. Every time we multiply by a "Y", that portion of the result is shifted left 16 bits (so multiplying two "Y"s together is shifted left 32 bits). This following diagram shows how the four factors will add together. The last term is shifted left by 32 bits, the second and third are shifted by 16-bits and the first is not shifted.

  <---Y1 * Y2--->................
  ........<---Y1 * Z2--->........
  ........<---Y2 * Z1--->........
+ ................<---Z1 * Z2--->
---------------------------------
  <---------- Result ----------->

Start from the bottom and work your way up. First multiply Z1 by Z2. The lower 16 bits of that result are equal to the lower 16 bits of the result (none of the other factors will affect those bits). Then take the upper 16 bits and add them to the next two factors (making sure to line them up correctly). The lower 16 bits of that addition result are the next 16 bits of the final output. Take the upper 16 bits of the add and combine it with the last factor. That becomes the upper 32-bits of the result.

Note this takes four 16-bit multiplies to do one 32-bit multiply. Thus it isn't very efficient. Steps should be taken to optimize the design to potentially eliminate some multiplies.

  u_long x, y, factor1, factor2, factor3, factor4, result_high, result_low;
  u_long high_x, high_y, low_x, low_y;

  high_x = (x & 0xFFFF0000L) >> 16;  // ANDs and SHIFTs are not too intensive
  high_y = (y & 0xFFFF0000L) >> 16;
  low_x  = (x & 0x0000FFFFL);
  low_y  = (y & 0x0000FFFFL);

  factor1 = low_x  * low_y;
  factor2 = low_x  * high_y;
  factor3 = high_x * low_y;
  factor4 = high_x * high_y;

  result_low = factor1 & 0x0000FFFFL;          // Bottom 16 bits of result
  factor1 >>= 16;
  factor1 += factor2;
  factor1 += factor3;
  result_low |= (factor1 & 0x0000FFFFL) << 16; // Next 16 bits of result
  factor1 >>= 16;
  result_high = factor1 + factor4;             // High 32 bits of result

Division

Aside from the penalty you pay for the division instruction itself, the process of dividing is simpler than multiplication. This stems from the fact that division causes you to lose bits rather than gain them. Your result will easily fit within your 32-bit variable.

If you'll remember from last time, to divide we need to first shift source 1 to the left by the number of decimal bits in source 2. If you're using all 32 bits of your variable this is a problem. Similar to multiplication we will have to split up source 1 (but not source 2).

Consider 32-bit values X1 and X2. We want to get the result X1/X2. As before, X1 can be represented as (Y1 + Z1) with Y1 assumed to have 2¹⁶ times the magnitude. Thus we have (Y1 + Z1)/X2 which can be split to (Y1/X2) + (Z1/X2). Since Y1 and Z1 are only 16 bits each, we can preshift them with no problems (as long as the number of decimal bits in X2 is 16 or less). Just keep in mind that when you add the two factors together again you must compensate for relative magnitudes.

  #define NUM_DECIMAL_BITS  10

  u_long x, y, high_x, low_x, tmp1, tmp2, result;

  high_x = (x & 0xFFFF0000L) >> 16;
  low_x  =  x & 0x0000FFFFL;

  tmp1 = (high_x << NUM_DECIMAL_BITS) / y;  // Pre-shift the numerator
  tmp2 = (low_x << NUM_DECIMAL_BITS) / y;

  result = tmp2 + (tmp1 << 16);

That completes this discussion. If I made a typo (or an outright error) in this article please write me quick so I can fix it. I hope it was of some interest to you.

Top