Be skeptical of advertised spam blocking rates

Whenever I see rival companies boast about their spam-blocking rates, I have to be a little skeptical.  I have seen some claim that they catch 98% of all spam.  My skepticism is fueled by my inability to comprehend how they generate those statistics.

I am well aware of the fact that in all likelihood, I am probably missing something.  I am simply not accounting for the obvious and have overlooked an easier method.  You see, I know a little bit about statistical quality control.  When manufacturers publish their failure or success rates, they literally have QA teams taking pieces of their equipment and manually inspecting them.  If they make 30,000 widgets, they will sample 1000 of them and verify that they are working.  If 950 of them are working, then the manufacturer can claim that 95% of their widgets work properly.  In fact, the real number is 95%, +/- 3.02%, 19 times out of 20 (in other words, if they run the test 20 times, in 19 of those the widgets will be between 92% and 98% effective).

It seems clear to me that in order to advertise a spam-blocking percentage, you would have to do the same thing.  You would have to capture a random sample of mail, sort it into spam and non-spam, and then run it through your spam filter at the time you captured the sample.  Only then could you realistically know what your spam blocking percentage is.  The key is that it takes time to sort a corpus like that.  Not only that, but a 3% sampling error is pretty high.  I think I would want about a 1.5% sampling error.  To get that, based on a message population of 30 million messages (which is low) you would need a corpus of 4000 messages.  Whose got time to sort a corpus of 4000 messages?  I've sorted 4000 messages before (many times), it takes a couple of hours, but if you want to do it properly and ensure that there are no errors in your corpus, it takes forever.  Dear Lord does it take forever... my glasses prescription has gotten stronger in the past year thanks to all my sorting.

Even if you capture a sample of mail on that day, you only get the spam-blocking rate on that day.  Is your spam filtering improving or getting worse over time?  Eventually, you'll need to update your numbers so you'll have to run the test again (and trust me, I hate sorting corpuses and so will the spam analyst who does the test).  After a few tests you'll eventually get a trend but in order to keep accurate tabs on your effectiveness, you need to keep running tests periodically.

That's the only way I can see to effectively measure spam-blocking efficiency.  If anyone else has a better, more accurate way I would love to hear it (because it's my job here at Microsoft to measure this... and I'm not looking forward to doing it).  So, whenever we hear how such-and-such company has xx% spam blocking, we need to ask ourselves "Just how did they get those figures?" 

To short-circuit this, we can take shortcuts.  We can do a really good test once and use those figures indefinitely.  Or, we can measure how much total mail we block, make an estimation of our non-spam traffic and then subtract the difference.  Or, we can sort 1000 messages (much quicker) and run tests on that, but if everyone did that the figures would all be similar and the margins of error would erase any significant differences.  Then, the company with the best marketing would eventually be the winner.  Hmm, maybe Microsoft has the upper hand there after all.