While I was poking through my old numeric analysis textbooks to refresh my memory for this series on floating point arithmetic, I came across one of my favourite weird facts about math.

A nonzero base-ten integer starts with some digit other than zero. You might naively expect that given a bunch of "random" numbers, you'd see every digit from 1 to 9 about equally often. You'd see as many 2's as 9's. You'd see each digit as the leading digit about 11% of the time. For example, consider a random integer between 100000 and 999999. One ninth begin with 1, one ninth begin with 2, etc.

But in real-life datasets, that's not the case at all. If you just start grabbing thousands or millions of "random" numbers from newspapers and magazines and books, you soon see that about 30% of the numbers begin with 1, and it falls off rapidly from there. About 18% begin with 2, all the way down to less than 5% for 9.

This oddity was discovered by Newcomb in 1881, and then rediscovered by Frank Benford, a physicist, in 1937. As often is the case, the fact became associated with the second discoverer and is now known as Benford's Law.

Benford's Law has lots of practical applications. For instance, people who just make up numbers wholesale on their tax returns tend to pick "average seeming" numbers, and to humans, "average seeming" means "starts with a five". People think, I want something between $1000 and $10000, let's say, $5624. The IRS routinely scans tax returns to find unusually high percentages of leading 5's and examines those more carefully.

Benford's result was carefully studied by many statisticians and other mathematicians, and we now have a multi-base form of the law. Given a bunch of numbers in base B, we'd expect to see leading digit n approximately ln (1 + 1/n) / ln B of the time.

But what could possibly explain Benford's Law?

Multiplication. Most numbers we see every day are not random quantities in of themselves. They're usually computed qualities with some aspect of multiplication to them.

Consider, for example, any property which grows on a percentage basis. Like, say, the Dow Jones Industrial Average. It typically grows a few percent a year. Suppose, just to pick a rate, that on average the DJIA grows at 7% a year. At that rate, it doubles about every ten years. Suppose that the DJIA is 10000. After ten years of having 1 as the leading digit, it finally gets to 20000. Ten years go by again, but in that ten years, it doubles to 40000, not 30000. Therefore, those ten years were spent about half starting with 2, and about half starting with 3. Ten more years go by, and it doubles again to 80000. Now ten years have 4, 5, 6 and 7 as the leading digits in only ten years. Eventually we get up to 100000, and spend another ten years starting with 1. Pick a random date and you'd expect that the DJIA on that day would be twice as likely to start with 1 as 2, and four times as likely to start with 1 as 5.

We can easily write a program that demonstrates Benford's Law. As we multiply more and more numbers together, they tend to clump based on Benford's Law:

var counters = [

[0,0,0,0,0,0,0,0,0,0],

[0,0,0,0,0,0,0,0,0,0],

[0,0,0,0,0,0,0,0,0,0],

[0,0,0,0,0,0,0,0,0,0] ];

for (var multiplications = 0 ; multiplications <= 3; ++multiplications)

{

for (var trial = 0 ; trial < 10000 ; ++trial)

{

var num = 1;

for (var mult = 0 ; mult <= multiplications; ++mult)

num = num * (Math.floor(Math.random() * 1000) + 1);

var lead = num.toString().substr(0,1);

counters[multiplications][lead] ++;

}

}

print(counters.join("\n"));

A typical run produces data from which we can draw this table:

Leading Digit

Mults 1 2 3 4 5 6 7 8 9

0 1102 1069 1085 1125 1167 1107 1083 1124 1138

1 2416 1752 1443 1162 1019 756 643 453 356

2 3046 1854 1265 979 778 632 551 468 427

3 3090 1814 1197 924 779 661 582 491 462

predicted 3010 1761 1249 969 792 669 580 511 458

As you can see, with no multiplications, the distribution is a flat 11% for each. But by the time we get up to two or three multiplications, we're almost exactly at the distribution predicted by Benford's Law.

What does this have to do with floating point math? Well, we could conceivably design chips that did decimal or hexadecimal floating-point arithmetic. Would such chips yield more accurate results? Well, recall that last time, we used the fact that you can stuff a leading 1 onto a bit field to define a number. Binary is the only system in which every number except 0 begins with a leading 1! You can make a statistical argument which shows that for bases other than binary, in which you cannot always assume a leading digit, have on average a larger representation error. The argument is somewhat subtle, so I'm not going to actually go through the details of it, but suffice to say that we can show that for typical uses, binary is the least error-producing system we can come up with given that we'll almost always be working with data which follow Benford's Law.

Thanks, Eric. Great educational post!

Heh, you going to put this on the daily WTF? I was with Eric until the last paragraph and then I was going "WTF"!

Thanks for taking the time to write this up. Really interesting!

How odd, I was just discussing Benford with a mate over a pizza the other day, and we sort of touched upon most of this.

Great post, as always though.

Subtle indeed. What does the leading digit have to do with errors in representation?

I remember that the guy who discovered this noticed that the a used book containing the logaritmic tables was worn more around the numbers starting with the "1" digit, and less and less going to higher digits...

Interesting, but I'm sure if you gave a bunch of people in the financial industry the option of working with floating point in base ten, they would jump at the chance of an order-of-magnitude speed improvement. Their error bounds are defined in base 10 anyway.

1/12/2005 2:50 PM Adi Oltean

> a used book containing the logaritmic tables

> was worn more around the numbers starting

> with the "1" digit,

Yes, I think Martin Gardner published that story (though was he first or second to do so?). Anyway, it made me think that the opposite book, containing exponential tables, would be uniformly worn. The contents of such a book would be uniformly distributed on the value of the argument (say x) but skewed on the value of the exponential (e**x), and most of the values found in it would be where the numeral for the e**x value starts with a relatively low digit.

1/12/2005 4:10 PM James

> Interesting, but I'm sure if you gave a

> bunch of people in the financial industry

> the option of working with floating point in

> base ten, they would jump at the chance of

> an order-of-magnitude speed improvement.

What speed improvement? Hardware to do base 10 arithmetic is grossly slower than hardware to do base 2 arithmetic, and making it floating instead of fixed would not lessen that slowness.

> Their error bounds are defined in base 10

> anyway.

Sure. Do you mean that financial programmers would find an order-of-magnitude speed improvement by doing programming using base 10 floating instead of either base 10 fixed or binary floating? I doubt that very much.

'Course, financial programmers will always have it easier than programmers for space ships or even cars. Accuracy equivalent to twenty decimal digits can be done by using twenty decimal digits, but that's nowhere near accurate enough for some kinds of physical operations.

Norman Diamond, Mike Cowlishaw's presentation to the C committee from before Kona contains some interesting figures (I missed it, being C++):

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1037.pdf

top post. great exposition (i'd encountered this before, but hadn't realised it wasn't benford's)

Eric,

thanks - that was fun.

WM_THX

thomas woelfer

Cheers for another fascinating post, Eric!

Reminds me of a favorite Dilbert cartoon, in which he gets a tour of the accounting dungeon... The tour guide says, "Over here we have our Random Number Generator..." as they pass a troll spouting the words "NINE NINE NINE NINE NINE NINE NINE NINE". Dilbert asks, "Are you sure that's random?"

The response--"That's the problem with randomness: you can never be sure." Heh.

Oh, and thanks for the tax tips!

~ewall

Awsome post.

Just as a comment...

The advantage to decimal over the current format is that most readings taken "in the real world" by "real people" of anything of scientific or financial intrest aren't in base 2, and therefore anything that better represents things of interest to most people in the real world in the manner in which they are measured is likely to be better accepted.

It's slower, and optimizing hardware for it is more difficult, but if given a choice between being able to represent a .1 interval precisely as it is measured or not being able to measure it precisely as it is measured, I think most people would pick slow and precise over fast and inaccurate for their simulations, etc.

Hence the "decimal" type in C#, VB6 and VB.NET. A decimal is a 96 bit integer scaled by a power of ten. The power can be any exponent from 0 to -28, which means that 0.1 can be represented exactly.

Similarly, the Currency type in VBScript is a 64 bit integer with a fixed exponent of 10^-4. That can also represent 0.1 exactly.

Both systems trade execution speed for decimal precision. I don't know why more people don't use them.

Multiplications explain the tendency for financial data to follow Benson's, but what about data from nature? River lengths show it for instance, irrespective of the units (miles, kilometres - even inches) used.

Look at it this way: take a set of river lengths. Pick Measuring System #1, and determine the distribution of first digits. Suppose it's "miles"

Now pick a random number, say 12.3. Say a "frob" is 1/12.3 miles. Multiply every length by that number, so that you've now got a set of measurements in frobs, not miles.

Check out the new distribution.

Would you expect the distributions of first digits to be roughly the same whether in miles or frobs? Would you expect it to be the same given ANY choice of measuring unit?

Such sets are "scale invariant", and the only distribution which is scale invariant is the Benford's Law distribution.

Very Interesting. Now I'm wondering if as a Fibonacci Series. In Elliot Wave Theory, the stock market moves up and down based on Fibonacci series. And This series can be found ad infinitum in long term and short term charts. So my question is, does Benfords law apply if let's say the Stock market hit 1000. Could you drop the first digit and apply benfords law to the 2nd digit as if it were the 1st. And so forth. Am I just blabbering or is there patterns within patterns using Benfords Law

This is a "MSD" law (Most Significant Digit). It is interesting to think about what an LSD law would look like. Consider what I will call a "Grocery Store Receipt" law. I am thinking about a receipt from a grocery store that shows quantities, prices, and total cost (Q*P). Assume quantities are uniformly distributed (not true in reality) & that prices are uniformly distributed. Then the distribution of the LSD of total costs is NOT uniformly distributed.

Jerry -

Check out the graphic chart at this link:

http://www.aicpa.org/pubs/jofa/may1999/nigrini.htm

I as per Mark Nigrini's web site, I made a list of the sizes of all the files on my C: drive (Window's machine) and applied Benford's Law to the list of numbers. It was quite amazing to see that the law applied to the file sizes with two anomalies: those beginning with the digits 5 and 7.. these happen to be font files and DLL's. Applying Benford's to not only the first digit, but the first and second combination, first second and third, etc... can reveal all types of anomalies.

There are a wide variety of uses for the application of Benford's law - detection of tax fraud, voting (ballot box) fraud, "curb-stoning" in surveys, etc. I have also posed an Excel workbook at http://www.ezrstats.com/EZSXL/Test_Benfords_law.xls with all the formulae - first 1-3 digits, last 1-2 digits, second digit, etc.

Eric,

Do you have a vba code example that will help me understand your example?

Could Benford's law be used to test the legitimacy of a gaussian time series?

There is a free utility which will do Benford's law analysis on Comma Separated Value files here:

http://www.checkyourdata.com

sounds related to the central limit theorem and the tendency to converge on a normal distribution when there's enough things going on.

I don't see the connection at all. Can you explain? - Eric

Here's in my opinion a very elegant explanation of the "law": http://www.dsprelated.com/.../55.php

I found an Excel spreadsheet to investigate Benford's Law at investexcel.net/.../benfords-law-excel