A number of readers asked for an explanation of my offhand comment that Benford’s Law can be used to show that binary is the best base for doing floating point math.

One of the desired properties of a floating point system is that the “representation error” is as small as possible. For example, suppose we want to express “one third”, in ten-place fixed point decimal notation. The closest we can get is “0.3333333333”, which has a representation error of 0.0000000000333333…

A useful way to get a handle on representation error is to look at the **granularity** of the system. By granularity I mean the smallest difference we can make between two values. For ten-place fixed point decimal notation, the smallest nonzero difference between any two numbers is 10^{-10}. In a floating point system, the granularity changes as the exponent changes. Really large numbers have large granularity; the difference between two successive floats might be millions or billions. And really small numbers have tiny granularity. **The maximum possible representation error is half the granularity**. Since there is a clear relationship between granularity and representation error, I’m just going to talk about granularity from now on. Small granularity = small representation error = goodness.

Let’s consider two similar systems, one which uses a binary mantissa and one which uses a hexadecimal mantissa. I’m going to continue my convention of showing binary numbers in blue, and we’ll do hex in green.

Suppose we’ve got on the one hand, 32 bit IEEE floats, aka “singles”. That is, we’ve got one sign bit, nine bits of exponent biased by -255 (so exponents can be from -255 to 256), a 22 bit mantissa, and an implicit leading “1.” So a number like

(0, 100000011, 1100000000000000000000) would be +1.1100 x 2^{4} = 28

I’m getting sick of spelling out the exponents in non-biased binary. From now on, I’m just going to give them in decimal. And I’m going to give the sign bit by plus and minus, not zero and one.

We now want a hex system with roughly similar range that doesn’t take up more storage. Suppose we’ve got one sign bit, seven bits of exponent biased by -63, and six hex digits for the mantissa. We don’t get a leading “1.” because not every hex number can be so expressed, so we’ll have to use a leading 0. That system has roughly the same range, and if we were to “behind the scenes” express this thing as bits, we’d still be using 32 bits. One for the sign, seven for the exponent, and 24 for the mantissa.

In this system,

(+, 2, A00000) would be +0.A00000 x 16^{2} = 160

An immediate disadvantage of this system is that numbers have multiple representations.

(+, 2, A00000) and (+, 3, 0A0000) are the same number. Let’s ignore that for now. We’ll ignore all hex mantissas with leading zeros. (And besides, the binary IEEE system wastes lots of cases for denormals and NaNs too, so it’s not clear that this is any worse.)

What’s the granularity of the hex system when, say, the exponent is 2? Well, consider a number like

(+, 2, A00000) – the smallest possible number higher than this is (+, 2, A00001) which is 2^{-16} larger. Clearly, 2^{-16} is the granularity for all values with an exponent of 2. More generally, if the exponent is N then the granularity is 2^{4N-24}.

How would we represent

(+, 2, A00000) in our binary system? That would be (+, 7, 0100000000000000000000) = +1.01 x 2^{7} = 160. The next largest number that can be represented in our system is (+, 7, 0100000000000000000001) which is 2^{-15} larger. So the hex system has smaller granularity and hence smaller representation error, and is therefore the better system, right?

Not so fast. Let’s make a chart.

Decimal Hex Binary Binary Granularity Exponent

16 (+, 2,

100000) (+, 4, 0000000000000000000000) -18

32 (+, 2, 200000) (+, 5, 0000000000000000000000) -17

48 (+, 2, 300000) (+, 5, 1000000000000000000000) -17

64 (+, 2, 400000) (+, 6, 0000000000000000000000) -16

80 (+, 2, 500000) (+, 6, 0100000000000000000000) -16

96 (+, 2, 600000) (+, 6, 1000000000000000000000) -16

112 (+, 2, 700000) (+, 6, 1100000000000000000000) -16

128 (+, 2, 800000) (+, 7, 0000000000000000000000) -15

144 (+, 2, 900000) (+, 7, 0010000000000000000000) -15

160 (+, 2, A00000) (+, 7, 0100000000000000000000) -15

176 (+, 2, B00000) (+, 7, 0110000000000000000000) -15

192 (+, 2, C00000) (+, 7, 1000000000000000000000) -15

208 (+, 2, D00000) (+, 7, 1010000000000000000000) -15

224 (+, 2, E00000) (+, 7, 1100000000000000000000) -15

240 (+, 2, F00000) (+, 7, 1110000000000000000000) -15

The hex system always has a granularity exponent of -16. In 8/15ths of the cases, the binary system has worse granularity than the hex system. In 4/15ths of the cases, they have the same granularity, and in only 3/15ths of the cases, the binary system has better granularity. Therefore the hex system is clearly the same or better most of the time. We should use this system instead of binary floats.

Not so fast! We’ve forgotten Benford’s Law! That’s true if any number is as likely as any other, but that’s not realistic.

Suppose the numbers that we are manipulating obey Benford’s Law. In that case, we would expect that fully a quarter of them would begin with 1 when encoded in hex. We’d expect another quarter of them to begin with 2 or 3, another quarter to begin with 4, 5, 6 or 7, and the remaining quarter to begin with 8, 9, A, B, C, D, E and F. If we make that assumption then we must conclude that the binary system is better half the time, equal a quarter of the time, and worse a quarter of the time.

Clearly this isn’t the case just for the exponent 2. For any hex exponent N, the hex system will have a granularity exponent of 4N-24. For numbers in the range expressible by the hex system with that exponent, a quarter of the time, the binary system will have a granularity exponent of 4N-26, a quarter of the time it’ll be 4N-25, a quarter of the time it’ll be 4N-24, and the remaining quarter it’ll be 4N-23, so three-quarters of the time it’ll be as good or better. **On average, the binary system is considerably better if data are distributed according to Benford’s Law. **

This is just one example, not a proof. But we could generalize this example and show that for any system with a given number of bits, binary mantissas yield smaller representation errors on average than mantissas in any other base.

Effectively in IEEE the granularity is a function of magnitude, or, absolute error varies but relative error remains (more) constant. I don’t think it’s good to mix ‘Benfords Law’ in here, as it isn’t that some numbers are better than others, or more popular (think of Google ðŸ˜€ ) its just a mechanism that attempts to keep the relative error constant.

If you wanted to maintain constant absolute error (and hence varying relative error) then normalized #s work pretty nicely and all the math can be done with integers and a little fiddling

That said, nice post

Indeed, the typical relative representation error for the binary system is always between 24 and 25 (binary) orders of magnitude smaller than the value being approximated.

The typical relative representation error for the hex system is between 22 and 26 orders of magnitude.

So the binary system does have the nice property that the average relative representation error is more tightly bound, and that alone might be a good reason to choose it.

But we also need to demonstrate not just that the bounds are tighter, but that the 26 best case for the hex system does not outweigh the 24/25 cases for the binary system.

It IS the case that floating point chips will on average do more math on values with a hex leading significant digit of 1, 2 and 3 than with 8-F put together.

Hmmmmm. OK, for non-normalized numeric ranges and ruling out nyc investment bankers calculating their bonuses (and once upon a time, msft people calculating option yields) I’d agree.

However now something has intrigued me – Is it per chance an inverse exponential that that dropoff follows ? Is a signifcant digit of 8 several orders of magnitude less likely than a significant digit of 0 ?

If it is exponential, and purely as a mental exercise (the vast majority of the world has problems enought with ieee764), I wonder if Benfords law implies there is a more efficient representational form – in short, is there a mechanism where one can steal the magnitude bits for lower values (0,1,2,3) and use them in the mantissa ? (basically, make it more expensive/less accurate for the larger numbers in order to benefit the majority of the use cases)

Consider an extreme case, where you dropped the range by permanently stealing a bit from the exponent (partial bit swipe possible but too complicated to be fun) – Now, if this bit was set to 1, you could assume that there was no exponent, and that all remaining bits were part of the mantissa

I only say this because of another interesting side effect of base 2 systems – every bit you add doubles the resolution, so swiping back 1 or two bits for the most common exponents could double or quadruple the resolution of the mantissa.

This is purely a bit of hypothetical fun however – in the real world I’ll stay with fixed point ðŸ™‚

> Is a signifcant digit of 8 several orders of magnitude less likely than a significant digit of 0 ?

I assume you mean "of 1", not "of 0" — no numbers begin with zero except zero!

Yes, it is an exponential. In decimal, the probability that a number begins with a 1 is 30%, the probability that it begins with a 9 is 4.5%.

It is an irony that in the hex scheme I mentioned above, that the typical relative error is smallest for exactly the situations where the leading digit is less likely.

As you conjecture, if we could come up with a system that had the opposite property — as numbers get larger leading digits, their precision diminishes, thereby granting better precision to more common numbers — we might be able to squeeze out an even lower typical relative representation error.

However, such a system is probably too complicated to be practical.

Somewhere at home I have a math journal (you may have had the same one, Eric) which describes the reason for the distribution of the numbers in a slightly more detailed way than "multiplication explains it" …

It goes something like this: whatever distribution the leading digits you encounter have, you’d expect that distribution to be the same no matter what base you used to represent the numbers. A constant distribution of decimal digits will only be stable across representations if the numbers you encounter are evenly distributed across the real number system , which is preposterous (we encounter numbers between 1 and 999 far more often than numbers between six gooogle + 8635001 and six google + 8635999). This kind of dropoff of leading digits mirrors the dropoff of numbers you typically encounter (well, it is the same dropoff, really).

It’s the same reason why more scientific constants begin with 1 than with any other digit, which is sometimes used to indicate mystical forces at work in the structure of the universe. Actually the opposite is true; any other scenario would mean something magic about base 10.

Yeah, I deliberately glossed over that.

It’s not just base ten being not special either — we’d also expect that scientific constants should begin with 1 in base ten about 30% of the time whether they were measured in Calories or dynes!

Essentially, Benford’s Law works on any system in which the probability distribution for first digits ought to be invariant over changes in scale. You can use straightforward differential equations to show that the only such distribution must be logarithmic.