Floating-point arithmetic

Article
08/24/2005

What's got a mantissa field, an exponent field, is included in nearly every major programming language, yet relatively few programmers really understand? The answer is floating-point arithmetic. Most programmers realize that floating-point computation is "not exact" and that error can accumulate over time, but they have little conception of which operations cause how much error to build up how quickly, and how to avoid or repair these kinds of problems.

The applied field of mathematics which studies floating-point arithmetic (among other things) is numerical analysis. The algorithms created by this discipline go directly into advanced numerical libraries used in scientific computing such as LAPACK and are carefully crafted to be numerically stable as possible (you would be surprised how horribly wrong the answers produced by a naive Gaussian elimination algorithm or Newton's method can be in some cases). One of the best sources I know for a crash course in the basics of floating-point theory, including rounding error and common sources of error, the IEEE format and operations, and compiler optimization of floating-point computations, is David Goldberg's What Every Computer Scientist Should Know About Floating-Point Arithmetic (PDF here).

If you don't feel like reading a detailed paper on the topic, then let me sum things up in one simple rule: use doubles. Your first line of defense against roundoff error should always be to add precision - it's easy and it works. Don't think that just because you only "need" 3 to 7 digits of accuracy that floats are fine - you need more accuracy in the intermediate results to ensure that you get those 3 to 7 digits. In some cases the performance impact is unacceptable, or the error is still too high, and you must design (or find) clever numerical algorithms to get the most out of your bits, but to do so up-front is simply premature optimization. In fact, I'd go so far as to say that you should always use the highest-precision floating-point type available to you, and only think about lowering it when a performance problem manifests. This strategy will save you a lot of headaches in the end.

But is precision really a substitute for good algorithms? Surprisingly, in many cases, yes it is. I wrote a short C program computing log(1 + x) for small x using floats in three ways:

The direct, obvious way, using floats. Mean relative error: 0.0157
A more numerically stable algorithm given by xlog(1 + x)/((1 + x) − 1), using floats. Mean relative error: 3.2 × 10^-8
The direct, obvious way, using doubles. Mean relative error: 3.1 × 10^-11

Relative error was roughly constant in each case across the trial values. As you can see, a good algorithm helps a lot, but there are many times when precision really can make up for a stupid algorithm.

Derrick Coetzee
This posting is provided "AS IS" with no warranties, and confers no rights.

Floating-point arithmetic

Additional resources