Fun with Floating Point Arithmetic, Part Four

Article
01/18/2005

A reader also asked the other day why it is that in VBScript, CSng(0.1) = CDbl(0.1) is False.

Forget about binary floating point for a moment. Suppose that we had two fixed-point decimal systems, say one with five digits after the decimal place and one with ten. You want to represent one-third. In our first system, the closest we can get is 0.33333. In our second system, the closest we can get is 0.3333333333.

Now we compare these two things. But this is comparing apples to oranges -- two things need to be the same type to sensibly compare them. We have a choice -- we can either convert the type with more precision to the less-precise format and then compare, or we can convert the less-precise type to the more precise format and compare.

If we did the former, then we'd truncate the long one and it would compare equal to the short one. If we did the latter, then clearly they would not be equal because we'd be comparing 0.3333300000 with 0.3333333333.

The analogy holds for doubles and singles. In a single, 1/10 in binary is

0.00011001100110011001100. In a double, it's 0.000110011001100110011001100110011001100110011001100110. If we compare by converting the double to a single, then clearly they are equal -- and, in fact a billion or so doubles which are close enough to 1/10 also compare equal. If we compare by converting the single to a double before comparing, clearly they are not equal.

VBScript always converts to the more precise format before doing the comparison.

You might think that this is kind of bogus. Surely if we're comparing a more precise value to a less precise value, it makes sense to say that the less significant bits are, well, less significant and throw them away. By converting the less precise format to a more precise format, we are essentially manufacturing new precision that didn't previously exist. We're just making it up out of whole cloth.

In the world of science, there's a word for that. It's called "fraud".

Yep, we are totally cheating. This is one of those unfortunate "gotchas" which you've got to be very careful of if you're mixing doubles with singles. There is some justification for it though.

Consider addition, for example. If you have a single, say

0.10000000000000000000000, and a double 0.10000000000000000000000111000…, and you add them together, do you expect that the result will be a single or a double? For this situation, many people would say that the sensible thing to do is to treat the single as a double and add them together, rather than losing the information in the less significant bits of the double. Yet this is once more manufacturing new precision for the single.

It comes down to a simple decision. Which is more important: not losing existing information, or not creating new information arbitrarily?

Once you pick which factor is more important, you've got to apply the rules that entails consistently. You can't say that for addition and subtraction, you convert singles to doubles, but do it the other way for comparisons. If you do that then you get into the rather ridiculous situation that two numbers can have a nonzero difference and yet compare as equal!

The Visual Basic designers decided that loss of information is worse than manufacturing new information and applied that rule consistently to the variant arithmetic logic. Hence, the same goes for operations between integer and floating point types; the integer types are converted to floating point types and the operations are done in floats. You'd certainly never say that 100 + 0.25 should avoid manufacturing new precision, convert the double to an integer, and result in 100, I hope. Similarly, comparisons between the integer 100 and the double 100.25 are done by converting the integer to a double, not converting the double to an integer.

In one case, a comparison can be done by converting to neither type. If you're comparing a 32 bit integer to a single-precision float, you can't convert the single to an integer or the integer to a single without one of them being potentially lossy. In that case, both are converted to doubles. In the VBScript implementation we consult this handy table for what conversion is used when comparing currency, 8-byte float, 4-byte float, 4-byte integer and 2-byte integer to each other:

(As an aside, in JScript .NET, where we have 64 bit integers and 64 bit floats which could be compared, we're in this cleft stick again, but this time with no clear way out! There is no larger type to which both can be losslessly converted. Comparing a 64 bit integer to a 64 bit float is a bad idea.)

Unless you have really compelling backwards-compatibility reasons, avoid using single precision floats altogether. In VBScript both a single and a double are stored as a 16 byte VARIANT, so there is no space savings. And on the chip, both single and double precision floats are converted to an internal extended format (which I believe is 80 bits), processed in that format, and then converted back to singles or doubles when the operation is done. There are no significant savings in either time or space obtained by using singles, and you get potentially a lot of pain because things don't compare the way you might think they do. Avoid, avoid, avoid.

Fun with Floating Point Arithmetic, Part Four

Additional resources