Floating-point arithmetic


What's got a mantissa field, an exponent field, is included in nearly every major programming language, yet relatively few programmers really understand? The answer is floating-point arithmetic. Most programmers realize that floating-point computation is "not exact" and that error can accumulate over time, but they have little conception of which operations cause how much error to build up how quickly, and how to avoid or repair these kinds of problems.


The applied field of mathematics which studies floating-point arithmetic (among other things) is numerical analysis. The algorithms created by this discipline go directly into advanced numerical libraries used in scientific computing such as LAPACK and are carefully crafted to be numerically stable as possible (you would be surprised how horribly wrong the answers produced by a naive Gaussian elimination algorithm or Newton's method can be in some cases). One of the best sources I know for a crash course in the basics of floating-point theory, including rounding error and common sources of error, the IEEE format and operations, and compiler optimization of floating-point computations, is David Goldberg's What Every Computer Scientist Should Know About Floating-Point Arithmetic (PDF here).


If you don't feel like reading a detailed paper on the topic, then let me sum things up in one simple rule: use doubles. Your first line of defense against roundoff error should always be to add precision - it's easy and it works. Don't think that just because you only "need" 3 to 7 digits of accuracy that floats are fine - you need more accuracy in the intermediate results to ensure that you get those 3 to 7 digits. In some cases the performance impact is unacceptable, or the error is still too high, and you must design (or find) clever numerical algorithms to get the most out of your bits, but to do so up-front is simply premature optimization. In fact, I'd go so far as to say that you should always use the highest-precision floating-point type available to you, and only think about lowering it when a performance problem manifests. This strategy will save you a lot of headaches in the end.


But is precision really a substitute for good algorithms? Surprisingly, in many cases, yes it is. I wrote a short C program computing log(1 + x) for small x using floats in three ways:



  • The direct, obvious way, using floats. Mean relative error: 0.0157
  • A more numerically stable algorithm given by xlog(1 + x)/((1 + x) − 1), using floats. Mean relative error: 3.2 × 10-8
  • The direct, obvious way, using doubles. Mean relative error: 3.1 × 10-11

Relative error was roughly constant in each case across the trial values. As you can see, a good algorithm helps a lot, but there are many times when precision really can make up for a stupid algorithm.


Derrick Coetzee
This posting is provided "AS IS" with no warranties, and confers no rights.

Comments (2)

  1. Derrick,

    Outstanding. I found you through Wes Moise (cause he somehow always knows all of the cool people). Subscribed. You’ve been blogging, what, 10 days. I’m already a fan.

    This (and your other posts) are obvious information that is often ignored or missed in the daily habits of programmers, especially those involved in numerical analysis or other peripheral disciplines.

    Thanks and regards,

    —O

  2. Andy says:

    This is a huge pet peeve of mine! Developers who don’t understand how floating points actually work. MSFT finally listened and is allowing us to control intermediate precision in it’s 2005 compiler which is awesome but then they go and make long double’s store their precision in the same space as a regular double. WTF? Who was the genius that came up with that idea? David Goldberg’s paper that you pointed to is something I have pointed many people to as well but only a few of them ever read it much less actually understand it. Then you try and make people like that work with a chipset that doesn’t support floating point types so you have to write your own fixed point libraries and the major headaches ensue. I don’t know if where you work you come into contact with regular corporate developers a lot or not but the lack of understanding out here in corporation land of basic Comp Sci principals is astounding. Questions like "what’s a data structure?" are so normal for me to hear that I don’t even bat an eye now days when I hear stuff like that.

Skip to main content