Why change the FP metrics?

Article
11/23/2007

In the comments in my other post on the other side of accurate metrics, a fellow blogger writes the following:

In my experience every vendor who quotes a FP figure bases it on the total number of inbound messages (including those that get 5xx-rejected).

On the other hand, it is arguably the fairest way to measure FPs, as it reflects the total workload of the spam filter. All those messages have to go through the filter, so it makes sense to reflect them in the calculations.

In the past two weeks, internally here at Microsoft I have been arguing that measuring false positives as a proportion of total inbound traffic is not an accurate representation of the user experience and therefore we should avoid using it.

Using Spam Filter A - a user receives 100 legitimate messages and 1000 spam messages. The spam filter correctly filters 95 legitimate messages and marks 5 of the legit ones as spam. Using the traditional way of measuring false positives, the FP rate is 5 / (100 + 1000) = 5 / 1100 = 0.45%

Using Spam Filter B - a user receives 100 legitimate messages and 3000 spam messages. The spam filter correctly filters 93 legitimate messages and marks 7 of the legit ones as spam. Using the traditional way, the FP rate is 7 / (100 + 3000) = 7 / 3100 = 0.23%

Using the traditional FP metric, Spam Filter A's FP rate is double Spam Filter B's and therefore Filter B is better. However, this ignores the effect of false positives on the user experience. Spam Filter A's FP rate as a proportion of legitimate mail is 5%, while Spam Filter B's is 7%. Looking at it this way, Spam Filter A is superior.

I propose that measuring false positives as a proportion of a user's legitimate mail stream is the proper metric for the following reasons:

The increase in spam this year is making the traditional method irrelevant. Doubling the mail stream of spam without improving the accuracy on non-spam does not make for a better filter. The increase in spam is merely dwarfing the amount of legitimate mail and making it a smaller piece of the puzzle. The metric is skewed.
A user's mail stream stays more or less constant; at the most, it increases slowly over time. They are talking to the same people, subscribing to the same newsletters and reading the same jokes. Thus, when they look at FPs, they are experiencing it according to how many messages they want to see, not how many messages they wanted to see + didn't want to see.

Finally, with regards to the final point:

Personally, I can see both sides of the argument, but the pragmatic fact is that "the market" measures FPs as a proportion of *total* email, so arguments that they should do otherwise are a bit academic.

This is a valid point. This is certainly the way the market (ie, industry) advertises its FP rates. I would counter it by saying that the market has been ambiguous on the point, if not in the past, then certainly now. It's time for a redefinition of success.

Why change the FP metrics?

Additional resources