I just finished a series on spam metrics and I submitted to the CEAS in order to get it accepted such that I could speak at the conference this year. I put it together in two days. Well, as it turns out, it was rejected.
The reviews on it were anonymous, but I believe that since they rejected it I have the right to respond to the comments on it. Here is what the first reviewer said:
Does not reference other work on which FP rate to use. See for example Joel Snyder’s discussion on which FP rate to use: http://www.networkworld.com/reviews/2004/122004spamside3.html
In this article, he proposes the use of the PPV.
Does not discuss the ROC approach used by researchers. See all sorts of papers at CEAS and also see the TREC methodology documents. In the introduction, the paper makes claims that the vendors can make various claims about the effectiveness of filters.
In the software industry, testing the quality of software is difficult. I would say that evaluating the quality of spam filters is done better than most. Quite a few reputable organizations / magazines have published evaluations:
On the first point, I’ll credit them that. I don’t reference other work on which FP rate to use. The PPV in the linked article is my first metric for FP rate, Messages incorrectly flagged as spam / Total legitimate messages. Yet, the reviewer and even the writer of the article plainly assume that this is the metric that everyone uses. It isn’t; both Postini and MessageLabs use my second metric for FP rate, Messages incorrectly flagged as spam / Total legitimate messages, as part of their SLA (if they don’t, then the language is ambiguous and it’s probably intentional).
I don’t have a problem with this definition of FP rate. My point is that the researcher and reviewer might think it’s obvious which FP rate ought to be used, but in the industry we don’t agree. That’s why I was proposing a common set of metrics to be used by the industry, not trying to define something that no one has ever thought of.
Secondly, the reviewer says that evaluating the quality of spam filters is done better (ie, easier) than most, which contrasts with the testing of software in general (which is difficult). If that were actually true, then you wouldn’t need so many different metrics in order to tell you which one is the best with so many caveats.
The linked article has PPV, NPV, and so many conditions. Well, this filter is better than that one but you also have to remember the FP rate, but then again the catch rate on this one makes it so that the FP rate balances out. So basically, if you look at so many different variables you really can’t make a determination without weighing one thing vs another because there are so many different definitions.
This is confusing and that’s the point… again. I was trying to illustrate a move towards a common set that everyone could agree on. Secondly, at the end of the day, you really need one metric to tell you which is the best. If your boss walks into your office and says "Quick, tell me, which filter is the best?" what are you going to say? "Well, if you look at this factor, filter A is better, if you look at filter B, it has better factors for this…" and so forth. My common metrics was designed to wrap everything into one metric and say "This particular filter is the best one because this super-metric says so." No hedging, just a straight answer.
In the stock market, if someone were to ask me which stock I invested in was the best, I could say "Well, this one had accelerating sales, this one had a good price-volume relationship and this one had the highest dividend rate, so it depends on what you look at." You see, I could say that, but the correct answer is "the one that made me the most money." Simple, no hedging.
And when it comes to spam filters, that was my point.