BCL Refresher: Floating Point Types - The Good, The Bad and The Ugly [Inbar Gazit, Matthew Greig]

Article
05/29/2007

So here is another BCL refresher on the topic of floating point types in the BCL.

Believe it or not, we have 3 different floating point types: Single (float) Double and Decimal. Each has their own characteristics and abilities and so let’s try to learn a little bit about them and what we can do with them.

So here’s a question: how is a number like 231.312 represented in bits? There are a few techniques one can use. One way to represent this number in binary is to shift it so it becomes 231312 (an integer) and remember how much it was shifted (3 decimal places). This is called floating-point representation. This is how the BCL types (System.Single, System.Double and System.Decimal) work. Another way is to store the 231 (whole part) and the .312 (fraction part) separately. This is called fixed-point representation. The main difference between Decimal and Single/Double is the base used to store the number. Decimal uses a base ten representation so can give an exact representation of a number with a fixed number of digits when written in base 10. Whereas Single/Double are represented in base 2 so will often give us the closest estimate for some numbers that are easily written in base 10 (for example 0.1 which might be actually 0.100000005960465188081882963405). But when we print them with the “G” format by default we get the result rounded as we expect. Single/Double can give an exact representation of a number with a fixed number of digits when written in base 2.

Single, is a 32 bits floating point type and can represent numbers from negative 3.402823e38 to positive 3.402823e38 with 7 decimal digits of precision (that’s even though internally we maintain 9 digits of precision). However, Single only approximates the numbers and does not have a unique representation for any 7 digits precision numbers in this range (as would be clear since it’s only using 32 bits). To see how we represent different numbers, use this code:

float f = 2.3F;

foreach (byte b in System.BitConverter.GetBytes(f))

Console.Write(b.ToString("X2"));

Console.Read();

By replacing the 2.3F with various numbers you could see the bits representation (in hex) for various numbers. See how close-enough numbers would have the same representation (try 12007.13009F and 12007.13008F—both return 859C3B46).

When we store these numbers we store three parts—a sign (1 bit), an exponent (8 bits), and a mantissa (23 bits). The sign is simply 0 if the number is positive and 1 if it is negative. The exponent is similar to the shift we talked about earlier except for two things. First, it is in base 2 rather than base 10, so rather than the number of decimal digits of the shift, it is the number of binary digits shifted. Secondly, it is the shift plus an offset rather than just the shift so that we can easily represent negative shifts and well as positive ones (i.e. precision with numbers between 1 and -1). In single that offset is 127, so in the 8 bit exponent instead of shifts from 0 to 255, we can represent shifts of -127 to 128 .Additionally if you have an exponent of 000 or 7FF these are reserved and our normal value formula will not apply. The exponent used will be one that shifts the value being represented to one between 1 and 2 (including 1, excluding 2). Thus we know the value we now need to store will have a whole part of 1 and a fractional part. Since the whole part is always one we don’t need to store it at all—we already know it is 1. The fractional part is stored in the 23 bit mantissa. Since there are 23 bits to the mantissa the maximum value it can store is 8388607. So we can think of the fractional value we are storing as being rounded to the nearest 8388607^th and the numerator of the fraction then being stored in the mantissa. Writing this as a formula it gives the values stored as

Value = -1^(sign) * 2^(exponent - offset) * (1 + (mantissa / maximum mantissa))

Adding in the values for the single type—offset of 127 and maximum mantissa of 8388607 we get:

SingleValue = -1^(sign) * 2^(exponent - 127) * (1 + (mantissa / 8388607)).

We understand that this can be a little confusing so let’s take a closer look at a couple values to better understand how these values are being represented. If we try the values of 1.0 and 0.0 we get the following output from our code, respectively, 0000803F and CDCCCC3D. The order of these bytes can make the discussing them a little tricky so lets reorder them to 3F800000 and 3DCCCCCD in order to ease our discussion a little. Now if we convert these to binary, and separate them into the sign, exponent and mantissa we get the following:

1.0 stored as 0 01111111 00000000000000000000000.

0.1 stored as 0 01111011 10011001100110011001101

If we convert these values back to base 10 we get

1.0 stored as sign of 0, exponent of 127, mantissa of 0.

0.1 stored as sign of 0, exponent of 123, mantissa of 5033165

Putting these values into our formula for the value stored. Remember that single does not store the numbers exactly we can see what number is actually stored (We will also be rounding in these calculations to be at a greater precision than the Single type can handle so we can see the imprecision using Singles would be introducing in these cases). For 1.0 we get:

SingleValue = -1^(0) * 2^(127-127) * (1+ (0/8388607)

SingleValue = 1 * 2^0 * (1+0)

SingleValue = 1 * 1 * 1

SingleValue = 1

In the case of 1.0, it turns out that Single does actually represent the number exactly. The number is still imprecise though since other numbers will also be represent exactly as 1 and with just the Single we cannot determine if this number is exactly 1.0 or just something else that is represented as 1.0 (i.e. 1.0000000000001). To see how imprecision is introduced let’s find he value stored for our 0.1 case.

SingleValue = -1^(0) * 2^(123-127) * (1+ (5033165/8388607)

SingleValue = 1 * 2^(-4) * (1+ 0.60000009536744300931012741448014)

SingleValue = 1 * 0.0625 * 1. 60000009536744300931012741448014

SingleValue = 0. 100000005960465188081882963405.

Now we see that 0.1 is actually stored as a value closer to 0. 100000005960465188081882963405 (I rounded some too remember). The reason that a number as seemingly simple as 0.1 is not stored exactly is to remember that we storing them in base 2 and not base 10 so 0.1 while simple to represent in base 10 is not so simple in base 2.

Now, let’s talk about Double. Double is doing a better job at capturing values precisely as it’s using 64 bits and so you can represent numbers in the range of negative 1.79769313486232e308 to positive 1.79769313486232e308 with 15 digits of accuracy. Double uses a 1 bit sign, 11 bit exponent with an offset of 1023, and a 52 bit mantissa. Again, this is an estimation of the number only and not an exact representation. By running the same code as before on decimal we actually get a different result for 12007.13009 (C503CAA69073C740) vs. 12007.13008 (EF2076A69073C740). Now, again, notice that the right part is the same since the numbers are similar in value but that the left part is completely different as this is the extra precision in the Double type.

The third type of floating point is Decimal and it’s rather different. With Decimal we have 96 bits representing numbers for negative 79,228,162,514,264,337,593,543,950,335 to positive 79,228,162,514,264,337,593,543,950,335. However, decimal is actually representing the exact number with a limited range and limited precision. We need slightly different code to see the bits in Decimal:

Decimal myDecimal = 2.3M;

foreach (Int32 i in Decimal.GetBits(myDecimal))

Console.Write(i.ToString("X4"));

This will output 00170000000010000, where if we change the 2.3 to 23 we’ll get 0017000000000000. The bit that’s missing is the one telling us that we need to shift one bit to the left (try .23M and you’ll get 00170000000020000 etc.)

OK, so which one do you choose? Let’s look at this example:

Single s1 = 1300.40F;

Single s2 = 1359.48F;

Single s = s2 - s1;

s *= 1000000000000;

Double d1 = 1300.40;

Double d2 = 1359.48;

Double d = d2 - d1;

d *= 1000000000000;

Decimal e1 = 1300.40M;

Decimal e2 = 1359.48M;

Decimal e = e2 - e1;

e *= 1000000000000;

Console.WriteLine(d);

Console.WriteLine(s);

Console.WriteLine(e);

When trying this code you’ll get 3 different answers to the same calculation!

Single, tells you the answer is 59079960000000. Double tells you the answer is 59079999999999.9 and Decimal is telling you the answer is 59080000000000. Which one is it? Well if calculate it yourself you’ll see that it’s 5908000000000000. So, in this example it would seem that Decimal is giving us the best result. This would generally be true in calculations that are in the “normal” range of numbers. Decimal is therefore targeting financial applications where numbers in the ranges of 10-20 digits with relatively little precision needed. If, however, you need very large numbers or very small numbers, you may want to choose Double, which has a much greater range and ability to represent very small numbers. As for Single, you should only use it if you’re space conscious and maybe have an array with millions of items or so and are trying to improve memory utilization of your application.

OK, now, after reading all this we know you are going to ask—who is the bad, who is the good and who is the ugly? Well, since we have three types, there aren’t that many options are there? :-)

BCL Refresher: Floating Point Types - The Good, The Bad and The Ugly [Inbar Gazit, Matthew Greig]

Additional resources