Floating Point: A Dark & Scary Corner of Code Generation
Why using == with float & double is always wrong
As the dev lead for the .NET JIT compiler, I tend to see bugs coming from customers before anyone else. At least once a month, I get a bug from a customer that tends to look something like this:
This code works fine when it runs (debug, retail, x86, x64), but when it runs on (retail, debug, x64, x86) it doesn’t behave the way my good solid understanding of mathematics tells me it should. Please fix this obvious bug in the compiler immediately!
This is followed by some code snippet that includes a float or double, along with either an equality comparison, or base-10 output, or both. I’ve written a whole lot of responses which have run the gamut from “please go read the Wikipedia entry about floating point” to a detailed explanation about the binary representation of IEEE754 floating point. Let me first apologize to anyone who’s gotten the “please go read Wikipedia” response. That’s generally only sent when I’m tired & crabby. That said, it’s an excellent article, and if you find this information interesting, you should definitely go read it.
So, in a completely self-serving effort to have a prebaked response to these bugs, let’s discuss what makes floating point so fundamentally confusing & difficult to use properly. There are basically 3 characteristics of FP, all somewhat related, that cause untold confusion. The first is the concept of significant digits. The second is the fact that floating point is not a base 10 representation, but a base 2 representation. And the third is that floating point is really a polynomial representation of a number where the exponent is also variable. And because this is my story, I’m going to begin in the middle.
Problem #2: Base 10 vs. Base 2.
Back when you were learning long division in grade school/primary school/evil villain preschool, you learned that only certain rational numbers (values that can be represented as the quotient of two integers) can be accurately represented as a decimal value. Then you learned to draw bars over the repeating digits, and for most people things continued on. But if you’re like me, you dug in a little deeper and figured out that the only fractions that are accurately representable in base 10 are fractions that when fully simplified, the divisor consist of only 5’s and 2’s. Any other factor sitting in there turns into repeating decimal values. This is because 5’s and 2’s are special in base 10 because they’re the prime factors of 10. If I had stuck with math in college, I could probably prove this mathematically, but I went for computer science, where if it holds for all integers from -231 to 231, it must be true (“Proof by Exhaustion” is what my scientific computation professor called it). So when you’re representing values in base 2, you can only accurately represent a fraction if the divisor’s factors are 2’s and nothing else, for the exact same reason. So first, the only way a number can be accurately represented in floating point is if it’s representation involves nothing but sums of powers of two. There’s a high level overview of problem #2. Let’s move on to Problem #3.
Problem #3: Variable exponents
The IEEE floating point representation contains 3 components: a sign bit, a mantissa, and an exponent. The mantissa is basically the set of constants sitting in each entry of a very simple polynomial. I’m not really a math guy (college math cured me of that problem), but let’s for the sake of discussion use mx for the digits in the mantissa, and e for the value of the exponent. Normal IEEE single-precision floating point has a sign bit plus 31 bits of ‘numeric’ value expression: 7 bits for the exponent (plus a sign) and 24 digits for the mantissa. Double precision is the same representation, but with more bits of accuracy in both the exponent and the mantissa, so we’ll just stick with single-precision. The algebraic expression of a single-precision floating value is represented like this:
1*2e + m0*2e-1 + m1*2e-2 + … + m21*2e-21 + m22*2e-22
If you consider the mostly standard way of doing multi-digit addition, this is a very similar representation, with one very important difference: you only have 24 columns to use! So if you’re adding 2 different numbers, but they have the same exponent, everything works as you would expect. But if your numbers have different values in the ‘exponent’ field, the smaller one loses precision. As an example, let’s design a similar base 10 system with 4 digits of accuracy. Adding the numbers 1999 and 5432000, both of which can be represented accurately in 4 digits (with an exponent) looks like this:
So I just lost the 999, because my representation doesn’t allow me to represent it any more accurately than that. So while you might hope it would at least result in 5434000, the values may not exist (and I’m a little fuzzy on this particular area, but hopefully my point is clear).
Okay, now stick with me, because we’re headed back to the first point.
Problem #1: Significant Digits
In high school chemistry, my teacher spent at least half the time drilling into our heads the idea of ‘significant digits’. We were only allowed to report data to the right number of significant digits, which reflected the accuracy of our measuring equipment. If we had measuring equipment that would report accuracy to the gram and milliliter, we were expected to estimate the next digit. So you’d report that the goo had a mass of 21.3g and a volume of 32.1mL, but if you were reporting density, you didn’t report that it was .6635514g/mL: you only had 3 significant digits of accuracy, so reporting anything beyond 3 digits is noise. The density was .664 g/mL. The final digit was actually expected to fluctuate, simply because you were estimating it based on the quality of your instruments, and your eyeballs. The same is true for IEEE floating point. The last digit of the value will fluctuate, because it doesn’t round in the normal fashion, because there’s nothing to round it with. So error begins to grow from that single bit, on up. The interesting thing about significant digits is that some operations drive accuracy down quickly. Multiplying large matrices can result in accuracy being completely eliminated quickly. Why do I know this? Because a couple engineers at Boeing told me many years ago, and I believed them because if you can make almost a million pounds fly, I figure you know your stuff.
Wait, so how does this affect me?
Putting these three things together lands in a place where your numeric/algebraic intuition is wrong for the abstraction. The single & double precision floating point types are abstractions, and they provide operators that mostly work the way you might expect. But those operators, when used with integer types, have a much clearer direct link to algebra. Integers are easy & make sense. Equality is not only possible, but easily, provably, correct. When accuracy is an issue, well, truncation occurs in the obvious place: division. Overflow wraps around, just like you figured out in CS201. The values that can be accurately represented land in a nice, regular, smooth cadence. And if you’ve got overflow & truncation understood, you’re golden. Floating point, however, doesn’t have a smooth, logical representation. Overflow rarely occurs, but truncation and rounding errors occur everywhere. The values that can be accurately represented are kind of irregular, and depend on the most significant digit of the value being calculated. And then there are things like NaN’s, where fundamental concepts like reflexive operations fail, and denormals where accuracy becomes even weirder. Ugh!
Okay, but why shouldn’t I use ==?
Everything in our floating point representation is an approximation. So whenever you use an equality expression, the question actually being answered is “does the approximation of the first expression exactly equal the approximation of the second expression”? And how often is that actually the question you want answered? I believe the only people who ever write code where that’s really what they mean are engineers validating either hardware or compilers.
Doesn’t liberal use of epsilon make the problem go away?
This is sort of true, but despite the fact that the .NET framework exposes a value called “Double.Epsilon” as well as “Float.Epsilon”, their values are mostly useless. The “correct” way to write an equality comparison is to do something like this:
if (Math.Abs(a – b) < epsilon)
Console.WriteLine(“Good enough for government work!”);
but thinking back to our variable exponent problem, the value of epsilon is a function of the least significant value represented in the larger of the two values. Something more like this:
if (Math.Abs(a – b) < LeastSignificantDigit(Math.Max(a, b)))
Console.WriteLine(“Actually good enough, really!”);
So why don’t we just compile that in, instead of doing the silly bit-wise comparison that we do? Because there are a small number of people who understand this stuff far better than me (or my team) and they Know What They’re Doing. And sometimes the algorithm where ‘approximate equality’ is needed cares that a and b are orders of magnitude different, even though one is much larger than the other. So instead, we just pretend like algebra still works on this misleading abstraction over the top of some very complex polynomial arithmetic, and I continue to resolve bug reports that arise from developers stumbling across this representation as “By Design”.
Well, this has been fun. I hope it’s helped folks understand a little bit more about why Floating Point Math is Hard (do NOT tell my 12 year old daughter I said that!) I’d like to apologize to the well informed floating-point people out there. I’m absolutely certain that I screwed up a number of details. Some were intentional (the implied 1 on the mantissa though it snuck into the algebraic expression, and the way the biased exponent is actually represented), others weren’t (and if I knew what they were, I wouldn’t have screwed them up so I don’t have any examples). Thanks for sticking with me, and good luck in your future floating point endeavors!
.NET Code Gen Dev Lead