Numbers are bad.

That might sound like a strange statement coming from a professional software developer with a degree in mathematics. Let me try to justify that.

Numbers are bad the way that power tools are bad. I mean, I like power tools. A lot. But you give me a power tool and pretty soon I start making up excuses to use it, whether it makes any sense to use it or not. And do I take the time to use it carefully, and read the manual first, and put on safety glasses? Usually. But maybe not always — I haven’t nailed my foot to the floor yet, but it’s probably only a matter of time. When used judiciously by well-trained people, power tools can do great things, but there are a lot of trips to the emergency room for the rest of us dilettantes.

I see this all the time at work, on the news, everywhere: people lured in by the power of numbers, making up new, but not necessarily sensible, uses for the numeric power tools at their disposal.

Let me give you an example of a time when I almost misused numbers recently.

Once a year in the summer at Microsoft we have a formal review of what we’ve accomplished over the last twelve months and what we want to accomplish in the next. The review process is kind of complicated, and I don’t want to go into the boring details of how it all works. Suffice to say that by the time we’re done we have a long document describing accomplishments and goals, where the last thing on the review before the signatures is one of three categories. We could call the categories “I Met My Goals”, “I Exceeded My Goals” and “I Totally Rocked!” (In actuality there are more categories, but the vast majority of employees fall into one of these three buckets, so we might as well consider just these three.) The review system here is not perfect, but it works pretty well and I have no complaints with it.

I said that we *could* call the categories “I Rocked!” and so on, but *in fact* we call them “3.0”, “3.5” and “4.0”. By naming the categories after numbers, **it becomes very tempting to use power tools on them**. Last night I got to thinking “*I wonder what my career average for yearly reviews has been?”* Fortunately, I stopped myself because I realized that I was about to nail my foot to the floor.

When you make something into a number you make it amenable to all the power tools that three thousand years of mathematicians have invented, **whether or not those power tools make the slightest bit of sense**. For example, you can’t take an average of some Golds, some Silvers and some Bronzes. “Average” simply doesn’t apply to those things! But (let’s make up some fictional numbers for a fictional employee Alice) you can take an average of three 4.0’s, two 3.5’s and a 3.0 ! What could it possibly mean, that average? What can we deduce from it? **Nothing, because we do not know whether the weighting is actually sensible.**

After six years, Alice has a career average of 3.67 — so what? Compare Alice against Bob, who has a career of six 3.0’s. How do we compare two numbers? How about percentages? Yeah, that’s the ticket! Clearly Alice rocks 22% more than Bob. Great! But to what do we apply that 22%? Salary? Bonus? Vacation time? What does it mean?

The 22% has no meaning because I’m not taking a percentage of anything meaningful! I have no evidence that the system was deliberately designed so that 3.0, 3.5 and 4.0 have meaningful mathematical properties when averaged. What if the numbers were 1.0, 10.0 and 100.0? Then Alice would have an average of 53, Bob would have an average of 1, Alice is over 5000% “better”! What if the numbers were 0.0, 1.0 and 2.0? Then Alice would have an average of 1.33, Bob would have an average of 0.0, and **percentage differences would suddenly cease to make any sense. **

**By changing the weightings, the comparative benefit of being an overachiever changes and even the set of mathematical tools you can use changes. **If the weightings are arbitrary, an average is also arbitrary, so two averages cannot be sensibly compared. This is the old computer science maxim in action, only this time its “arbitrariness in, arbitrariness out”. You can slice and dice numbers until the cows come home, but since the original data were completely arbitrary choices that do not reflect any measurable feature of reality, the results are going to be little more than curiosities.

ASIDE: I am reminded of a sign I saw in a pottery factory I once toured in England. It said that the 1100 degree Celcius kiln was “eleven times hotter than a kettle boiling at 100 degrees Celcius.” I, being a smartass, asked the tour guide whether it was also “negative eleven hundred times hotter than an ice cube at -1 degrees Celcius.” I got in reply a very strange look. Some things you just can’t sensibly add, subtract, multiply and divide even though they are numbers. (Temperatures can only be divided by temperatures when measured in an absolute scale, like degrees Kelvin.)

It gets worse. In this particular case, we don’t even need to consider the question of whether the weighting is sensible to determine that averages are meaningless. We can determine from first principles that an average like “3.67” is meaningless.

Consider a pile of two-by-fours, all of which are exactly 3.0, 3.5 or 4.0 feet long. You toss them into three piles based on their size, multiply the number in each pile by the length, add ‘em up, divide through by the total number, and you obtain an extremely accurate average. Why you care what the average is, I don’t know — but at least you definitely have an accurate average.

Now consider a pile of two-by-fours of completely random lengths, but all between 2.75 and 4.25 feet long. Divide them up into piles by rounding to the nearest of 3.0, 3.5 and 4.0, and again, multiply the number in each pile by 3.0, 3.5 or 4.0, add ‘em up, divide through, and what have you got?

You’ve lost so much information by rounding off that you’ve got an “average” which is only likely to be close to the actual average when the number of two-by-fours is large. Furthermore, I said “random”, but I didn’t say what the *distribution* was. In a “normal” or “bell curve” distribution, a 2.76 is NOT necessarily just as likely as a 3.01, and you have to take that into account when determining what the likelihood of error in the average is.

When the number of two-by-fours total is tiny — say, six — and you’re averaging three 4.0 +/- .25, two 3.5 +/- .25 and one 3.0 +/- 0.25, well, I’m not sure what you’ve got. You’ve got 3.67 +/- “something in some neighbourhood of 0.25, some of the time”, where I could work out the “somethings” if I dug out my STATS 241 notes (or called up my statistician friend David, which would be faster).

My point is that because of rounding error, the so called “average” is smeared over so large a range that it is largely useless.

Probably many of those 3.5’s are “actually” 3.7’s who didn’t *quite* make it to the 4.0 bucket. But that information about the extra 0.2 is lost when it comes time to take an average. 3.67 is way, way too precise. All we know about Alice’s average is that her *actual* average is somewhere between 3.0 and 4.0, probably closer to 4.0 — which we knew already just from the range of the data!

And we’re just on averages! We haven’t even begun to consider the power-tool-mishap possibilities of, say, trend lines.

I’m glad I stopped myself. As a developer and a mathematician, I love both the practicality of numbers and the beauty of mathematics for its own sake. **But just because something has a number attached to it doesn’t mean that any of that mathematical machinery is the right tool for the job**. Sometimes numbers are just convenient labels, not mathematical entities.

In other news, I was looking at the blog server statistics last night. Of the 207 Microsoft bloggers on this site, I’m ninth in terms of page hits from non-rss-aggregator browsers on a per week basis. Clearly, I rock. But what about you guys, my readers? If we take the number of comments and divide by the number of posts on a per-week basis, and then take a (geometric fit) trend line, I see that… OW! MY FOOT!

Oh, it’s so funny you mention this. At my former place of employment, this sort of "reporting" happened all the time. What a joke. Thing is, I tried to explain to management why some of the stats they were creating were spurious. All to no avail.

I rate this entry 1315. I promise that this number is not meaningless.

Once again, your blog has forced me to think and to smile. Thanks! Now if I could get my 13 year old daughter to get these concepts, I’d be in business. And as for the rating of 1315, come on…this is clearly a 2630.

If I stubbed my foot while reading this, does it count?

I think a similar principle applies to grades in school, as they often have to be fitted to a curve to match the expected distribution – which often means that you don’t know how much you *really* got. And the average grade (of all the students who take a particular subject) becomes meaningless.

> Temperatures can only be divided by temperatures when measured in an absolute scale, like degrees Kelvin

Well, you can also divide temperature differences. It’s not entirely wrong to say that if you touched that kiln with your hand (which has a temperature of ~35 degrees C) it would feel (1100-35)/(100-35) ~= 16 times hotter than the kettle (or to be more precise it would cause a burn 16 times faster :).

So the guide was actually not that far off.

Correct — that would be measuring relative rate of heat transfer to a given object, not relative total heat energy. If we’re measuring heat transfer, it is the case that an 1100 degree kiln transfers heat _into an object at zero Celcius_ 11 times faster than a 100 degree kettle. (And it is also the case that the transfer happens "-1100" times faster than it would from an object at -1 — the negative makes sense because the heat is being transfered OUT in one case and IN on the other.

But whoever wrote the sign simply didn’t know what they were talking about. They were trying to get across the idea of total heat, not heat transfer.

We could also talk about the extra "latent" heat required to boil a kettle, which is heat energy that’s gone unaccounted for in this analysis, but I think I’ve laboured the point enough…

you are right Eric, one-half of nothing is still nothing..

ok, we all agreed that this system doesn’t work. What would be the prefect one?

And we all know that "non-RSS-aggregator" stats are useless anyway because all the kool kids are using aggregators these days… or didn’t you get the memo?

Eric – longtime listener, firstime caller.

Haven’t checked out any of the other MS blogs, but yours is a great resource and geekily funny.

As a JS developer it’s nice to see how some of stuff works behind the scenes (at least in MS stuff).

Keep writing and screw the numbers.

-David

WSU Student

All too true. I don’t know how many times I’ve had to make this point to people whom insist on using numerical Likert scales.

> ok, we all agreed that this system

> doesn’t work. What would be the

> prefect one?

Hold on a minute — I just said "The review system here is not perfect, but it works pretty well and I have no complaints with it." We are NOT all agreed that this system doesn’t work.

In fact, it works very well. It’s not perfect, but we don’t need a _perfect_ system, we need an _acceptable_ system.

My point is not at all that the review system is broken, because it isn’t. I’m using the review system as an example of a system where numbers which do not actually quantify anything can be accidentally used as though they were quantities.

I could have picked any number of such systems — Likert scales, as the previous commenter notes, are a good example.

Pain quantification is another example — when recovering from sugery for example, patients are often asked to rate their pain on a scale from 1 to 10, and then get painkillers appropriate to their level. There are all kinds of ways to misuse those numbers.

I picked the review system because that’s the example that got me on this line of thought, not because I think there is anything particularly broken with our system.

So basically, this is a long way to explain away your low career score average

ha ha, just kidding. Thanks for another though-provoking post.

er, "thought" provoking post.

> Temperatures can only be divided by temperatures when measured in an absolute scale, like degrees Kelvin

I don’t think it should be degrees Kelvin for two reasons, firstly you don’t use degrees with kelvin measurements for precisely the reasons you’ve stated, secondly it should be a lowercase ‘k’.

Pedantically yours or completely wrong

Joe

PS Fantastic blog

Excellent point, and I like what Damit brought up as well.

I have been particularly annoyed at the current GPA system which doesn’t really say anything about what a student is good at.

I had friends in high school with near 4.0 gpas that were taking classes that an elementary school kid could probably get an A in. Yet I was getting 3.03 taking numerous AP classes. When you looked at the numbers, it looked like I was the dumb one (maybe I was, since I should of just skipped the worthless AP classes and enjoyed my life a little).

I often wondered if it wouldn’t be better to break things down into simple skill sets with a rating of pass or fail. Either you know it or not.

Then there wouldn’t be those people who, with the 3.0 in algebra, get into trig, but who really shouldn’t have gone on, since the things they missed may have been really key concepts.

Heck, if I were an employer, I would much rather see if a person’s skill sets matched my job requirements more than if the person got a 4.0 in a related major. 4.0 doesn’t tell me how good the person is, or what they are particularly good at.

Anyway, sorry for going on and on, this one struck a nerve (who am I kidding, I always go on and on).

Later

Re: "I’m ninth in terms of page hits "

Dude..

Google search of VBscript+Pause+reference stumbled me onto your blog and i HAVE to tell you…

YOU DO ROCK ! ! !

Fantastically entertaining yet still very educational blog. Thank you very much for sharing. ~ Bravo Sincerely, SloLearner

Hey, thank you for writing a great article about a topic that so many people seem to get wrong. A professor of the university I study at was back from a long hospital stay, rambling about the lack of understanding of exactly this topic. He stated that the whole medical system could be much more efficient, if all the fancy values that can be measured would really be understood and interpreted right by the doctors. Sadly, he died from a heart attack. Maybe, if he’s right, he would be still alive, maybe not, but I strongly believe he has a point.

i really find this article interesting as well as helpful for preparing my statistics presentation.