Why you need large data sets to measure yourself effectively

There are a few organizations out there that measure the effectiveness of various vendors – Virus Bulletin, West Coast Labs, and ICSA to name a few.  These vendors will compare and measure a particular filter’s spam effectiveness on a scale of 1-100%, and the false positive ratio.  Measuring spam effectiveness isn’t all that difficult.  Spam occurs in huge volumes and in order to get a representative sample, you can fairly easily generate large numbers.

The problem is when it comes to false positives.  If you use a small sample size (less than 2500 messages, for example), then there are two problems with this:

  1. One false positive – which can occur randomly – will throw off the test and make you look really bad.  For example, if the test only contains 2500 messages and you get 1 FP, that is a 0.04% error rate.  0.04% is a huge error rate and there are very few filters that would ever publish those statistics.  The problem is that on a bad day, a filter can flag a single message as spam.  One vendor’s 0% vs another’s 0.04% looks bad but the reality is that it is one message.  However, it is percentages that are published, not number of false positives in the latest test.

  2. You need a large sample set to ensure that your statistical measurements are accurate.  Let’s say that a filter has advertised a 1/2500 FP rate (legitimate messages, not total messages including spam + nonspam).  You take 2500 messages and run it through the spam filter and lo-and-behold, you get 1 false positive!  Does this mean that your false positive rate is actually 1/2500?  What happens if you take the test and this time you get zero false positives?  Does this mean that you are actually better than 1/2500?

    The answer is not necessarily.  The fact is that you may have gotten lucky.  Suppose you have a stream of messages, 10,000, and sample the middle 2500.  You might have a known FP rate of 1/2500 which means that in your total stream you would have 4 false positives.  Yet the way you do the test, the following occurs:

    image

You can see from the chart above that we have 4 false positives in the entire data set, but you just so happened to pick the exact 2500 email interval where  an under represented zero false positives occur!  Yet had you adjusted your time frame one message forward or one message backward, your measured FP rate would have been 1/2500 instead of zero.  Similarly, had you picked the interval corresponding to n=7500 to n=10,000, your measured FP rate would have been over represented and been 1/1250.

The fact is that if your FP rate is 1/2500 and you take a sample of 2500 messages and measure exactly 1 false positive, the probability of that happening is 36.8%, or slightly more than 1/3.  You could run the test again on another data stream of 2500 messages and measure zero false positives, and the probability of that occurring is also 36.8%, or you could measure exactly 2 false positives and the probability of that occurring is 18.4%.  The probability of measuring at least 0 or 1 is 63.2% (the previous 36.8% for each of zero and one does not add together because we are calculating this using the binomial distribution).  You also have a 26% chance of measuring 2 or more false positives in your sampling interval, which is two times as much or more! than your predicted FP rate  Thus, you need to ensure that when you do your measurement, you do it properly because if you don’t your results will be misleading.  You will either think that you are doing better than you are, or doing worse than you are.

The rule of thumb is that in order to measure your FP rate, you need triple the number of samples of the reciprocal of your rate.  For example, if your FP rate is 1/2500, then you need to sample 2500 x 3 = 7500 messages.  If you sample 7500 messages and get 3 FPs, then your FP rate is 1/2500.  If you sample 2500 messages and get zero FPs, then you don’t know yet what your FP rate is.

Why does all of this matter?  It matters because unless you understand this then industry-published statistics are “white-washed.”  Let’s assume that a published FP rate is 1/250,000 messages (and let’s assume that it is in legitimate messages, not legitimate plus spam).  If a test only uses 2500 messages, then the probability of measuring at least 1 false positive (0 or 1) is only 10%.  To put it another way, there is a 90% chance that your test will not catch any false positives but that’s because the test is too small.  In order to measure very small FP rates, you need very large corpuses.

And this matters because many vendors advertise at least a 1/250,000 FP rate.  How could you accurate measure this?  If you do it with proper statistical sampling, you would need a corpus of mail of at least 750,000 (3/4 million) legitimate messages.  How easy is it to generate a corpus of legitimate mail containing nearly 1 million messages?  If you are a testing organization, there are barriers to this:

  1. You must get sign-off from the people who’s mail you are sampling.  You should not randomly sample people’s mail and run it through a corpus because if you are going to check afterwards, you are checking other people’s mail.  How do Privacy statutes apply in this case?

  2. Even if you assembled a corpus once, you need to update it.  Spam is always changing and therefore so are spam rules and spam engines.  If someone generates a corpus 6 months ago and they have a good FP rate, then 6 months later they may have evolved the engine such that it has a higher FP rate on newer types of mail today.  Although non-spam stays relatively consistent and doesn’t change much over time, it still changes.  This means that you need to acquire a big corpus on a regular basis.

I bring this up because the challenges of measuring antispam effectiveness are non-trivial.  Testers do a good job but whenever you see a test published, you need to ask yourself how representative of reality it actually is.