Fun with Floating Point Arithmetic, Part Five

I went to Joel Spolsky's geek dinner at Crossroads the other night, which was a lot of fun. I didn't get much of a chance to chat with Joel, as he was surrounded by a cadre of adoring fans three deep the whole time.  I mostly hung out with KC and Larry and some other attendees and had an interesting talk about digital rights management, corporate blogging, and the difficulties of finding good quality Jackie Chan movies in the original Chinese.

I ran into Wesner Moise, who I first met long ago when he was working on Access or Excel or one of those Office kind of products but haven't really run into since. He was rather surprised that I remembered his name, but hey, how many Wesner Moises do you think I meet? 

Anyway, coincidentally, Wesner has also been running a series on the perils of floating point and integer mathematics on his blog recently. Check it out! 


I said a while back that floating point math is nothing like the math we're used to. Consider some of the properties that define real number addition. For instance:

closure: x + y is a number
commutative: x + y = y + x
unique zero: a + b = a if and only if b = 0
associative: (x + y ) + z = x + (y + z)

and so on, and similarly for multiplication. Commutivity still holds, but many of these rules do not work in floating point arithmetic.  I'll dig into a few of them here -- for more math rules that don't work in floating point, see Wesner's blog articles on the subject.  He lists dozens of them!

Consider the closure property, for example. It's not true in VBScript:

print 10^308 + 10^308 ' Overflow error

It is true in JScript, if you consider Infinity to be a number.

The commutative property is true in both VBScript and JScript, but the unique zero, and hence the associative property, are true in neither:

print 10^20 = 10^20 + 5000 prints True -- So clearly, zero is not a unique number which, when added, results in the same value. (Zero is unique in that it is the only number when added to EVERY number, results in no change. But almost every number has multiple values which result in no change when added.)

That means that the associative property goes out the window:

print 10^20 + (5000 + 5000) = (10^20 + 5000) + 5000 prints False

The fact that the order in which you make the additions can affect the result makes a difference if you are designing algorithms that must add up lots of little things to one big thing. In those cases, you should try to add together all the little things first, and then add the total to the big thing. That way, the small additions are done with the most precision possible.

There's a related error due to rounding off. 1/100 cannot be represented perfectly accurately in binary any more than 1/3 can be represented perfectly in decimal:

var steps = 100;
var start = 10;
var stop = 11;
var current = start;
  current = current + (stop-start)/steps;
} while (current < stop)

Which ends up with


This actually takes 101 steps, because the last one through the loop gets to 10.999999999999978, which is less than 11.0.  This tiny accumulated error results in the algorithm running for 1% too many steps. A better algorithm is to do the looping in integers and compute the current value anew every time:

var steps = 100;
var start = 10;
var current;
for (var step = 0; step < steps ; ++step)
  current = start + step/steps;

That actually runs the right number of steps.

A corollary of this is that you should almost never compare two floating point numbers for equality, because you never know when some rounding error might have crept in. Rather, subtract them and look at the absolute difference.  For instance, if you know that x and y are positive floats that are likely to be close to each other, don't say if (x==y), say if (Math.abs(x-y) < 0.0001) or if (Math.abs(x-y) < 0.0000001 * x) or whatever makes sense in your application.  (If you need to deal with NaNs and infinities, simple subtraction is anything but!)

We saw above that addition of a small number to a large number can lead to an erroneous result, but the error was extremely small.  An error of 5000 in a number as big as 1020 is 0.00000000000002%, which ain't bad.  We'd expect that subtraction of two numbers very close to each other would also produce an erroneous result, but with a similar error percentage.

We'd be wrong. Here's an example:

Suppose you have to solve an arbitrary quadratic equation:

A x2 + B x + C = 0

The solutions are well known, so we can write a little subroutine:

Sub SolveQuadratic(A, B, C)
Discriminant = B*B-4*A*C
If Discriminant < 0 Then
Print "No real solutions"
Print (-B + Sqr(Discriminant)) / (2 * A)
Print (-B - Sqr(Discriminant)) / (2 * A)
End If
End Sub

SolveQuadratic 2, 5, -12

And sure enough, it prints out -4 and 1.5, done. What about

SolveQuadratic 1, -10000000.0000001, 1

The correct solutions are 10000000 and 0.0000001, but this prints out 10000000 and 9.96515154838562E-08, yielding an error of nearly 3.5%! We've got about a hundred trillion times as much error here as we did in the addition. Why?

Because B and the root of the discriminant are very, very close to each other.  You only get 15 decimal places of accuracy, and we've used them all up. Therefore, the difference is going to be very inaccurate. Their sum, however, is going to be quite accurate, since they're of similar size and precision.

Fortunately, in this example there's a trick. The product of the two solutions is always C / A, so we can use this fact to write a better algorithm:

Sub SolveQuadratic(A, B, C)
Discriminant = B*B-4*A*C
If Discriminant < 0 Then
Print "No real solutions"
Exit Sub
End If
Soln1 = (-B - Sqr(Discriminant)) / (2 * A)
Soln2 = (-B + Sqr(Discriminant)) / (2 * A)
If Abs(Soln1) < Abs(Soln2) Then Soln1 = Soln2
Soln2 = C / (A * Soln1)
Print Soln1
Print Soln2
End Sub

Which produces a much more accurate result.

But wait a minute -- addition is order-dependent, as we've seen. so are multiplication and division. Should that be ( C / A ) / Soln1 or C / (A * Soln1) ? Or does it even matter? Figuring that out is left as an exercise for the reader!

Comments (11)

  1. Peter Milley says:

    Commutativity, at least, should hold in any system which meets IEEE standards for floating-point arithmetic. Roughly speaking, the computed result of any operation should be the representable number which is closest to the actual result of the operation. Since the actual values of a+b and b+a are the same, the computed values of a+b and b+a had better be the same as well.

    Breaking the other rules occasionally is unavoidable in a floating-point setting, but there’s no excuse for breaking commutivity.

  2. Nicholas Allen says:

    IEEE is not necessarily commutative due to the non-uniqueness of NaNs and SNaNs. A binary operation involving two NaNs resulting in NaN can vary depending on the specific NaN value of a particular operand. Many implementations use a single unique NaN though and can ignore this.

  3. Eric Lippert says:

    Correct — if one lives in the (strange!) world where one distinguishes between different NaN values, then that can be the case.

    But for regular numbers, IEEE floats always have commutative addition.

  4. someone says:

    "print FormatNumber(10^20)

    print FormatNumber(10^20 + 250000)




    So clearly, zero is not a unique number which, when added, results in the same value."

    Clearly? Why should I assume that if "print FormatNumber(x)" and "print FormatNumber(y)" print the same string, x and y must be equal?

  5. Eric Lippert says:

    Aha! You got me!

    In fact, you have no reason to believe that. Because of the decimal rounding issues, this is a precision-losing operation. In fact those two numbers are not equal.

    I’ll fix the text.

  6. another someone says:

    Re Peter Milley’s comments. I haven’t really read IEEE’s standards itself, but didn’t floating point arithmatic only promise to give one of the two represented numbers that are closest to the desired value?

    a+b and b+a could produce two different floating point numbers c and d such that c <= a+b <= d. I believe the process is still compliant to IEEE standard if there does not exist any floating point number e such that c < e < d.

  7. Frederik Slijkerman says:

    One quite unrelated aspect of IEEE floats is that you can compare them as integers, if they are both positive.

    This used to be handy in assembler because float comparisions were costly in x86 processors prior to the Pentium II/III family when the FCOMI instruction was added.

    Example with singles:

    fld [x]

    fld [y]


    fstsw ax


    jg …

    Is equivalent to:

    mov eax, [x]

    cmp eax, [y]

    jg …

    If the floats can be negative, it is more complicated. Doubles can be compared as well in two steps.

  8. Nicholas Allen says:

    You can compare floats as integers because floating point numbers are lexicographically ordered. Two negative floats can also be compared directly as integers although you then have to negate the result. The sign causes a problem because floats use sign-magnitude while integers use 2’s-complement. However, you can convert by subtracting a negative valued float as an integer from 0x80000000 (or the appropriate number for your bit size float).

    This can also be used as the test for the difference between floats. The difference between two floats as integers is their difference in ulps. Frequently this can turn out to be a better measure of "nearly equal" than using the code Eric posted above. A drawback to this method is that Infinity is 1 ulp away from the largest finite number, possibly not what you want! NaNs can also cause bad comparisons although on x86 they typically appear as millions of ulps away from the finite numbers.

  9. KC Lemson says:

    Great article, except you made one fatal flaw… there is no such thing as a good quality Jackie Chan movies in the original Chinese.


  10. Eric Lippert says:

    We’ll sit down sometime and watch "Operation Condor", the English version and "Armour Of God 2: Operation Condor" in the original Chinese, and you’ll very quickly realize that the original is the far superior film.

    Note that English "Operation Condor" is cut from "Armour of God 2: Operation Condor", but English "Operation Condor 2: The Armour of the Gods" is cut from "Armour of God", part 1. It can be very confusing. Armour of God 2: Operation Condor is the superior of the two, though Armour of God 1 does have the scene that almost killed Jackie Chan — it was a fall from a tree that went wrong and cracked his skull. I believe he still has a metal plate in his skull as a result.

    (The "Drunken Master" movies are also in reverse chronology in English, by the way.)

  11. Michael says:

    Can you explain why

    WSH.echo 1317.445 – 1300




Skip to main content