The Merits and Hazards of Two-Point Filtering

Article
01/30/2007

In my previous post, I defined Two-Point Filtering as the process of using an end-user feedback loop to train a spam filter without verification of the user classifications. I borrowed the Web 2.0 term to refer to the greater community of people contributing to the filter as opposed to a narrower group of people (ie, spam analysts) training the filter. There are a couple of big advantages to doing two-point filtering.

The first, and major, advantage is that rather than having a small sample of people training the filter, you have a large community of people submitting to the feedback loop. There is only so much time in a day that a small set of people (spam analysts) can spend training a filter and the large community allows a filter to utilize the leverage of many different sets of eyes coming across a much wider variety of spam. A lot of people can train a filter faster than a few people can.
Mistakes can correct themselves. While some people may misclassify spam or ham, because of the user-feedback loop eventually a larger range of people will correctly classify the mail and the mistakes will fix themselves (or rather, users will correct each other).

These are two tremendous advantages that Two-Point filtering has. Unfortunately, if it were that simple, the founders of Wikipedia never would have created Scholarpedia. The difference between a filter that primarily relies on Two-Point filtering and one that relies on something else (or supplements it with something else) is the difference between a good spam filter and a great spam filter.

In this business, speed and accuracy count, and they count a lot . While Two-Point filtering has an advantage of speed when it comes to initially classifying mail as spam, it suffers from the weaknesses of the misclassification of mail and the amount of time it takes to respond to that misclassification. Most users want their spam to stay out of their inbox, but they find it even more important to keep valid mail out of their spam quarantine. The one thing I've noticed about the general end-user community when it comes to classifying mail as spam and not-spam is that they are not particularly accurate. Legitimate mail gets classified as spam (less often) and spam gets classified as non-spam (very often). Thus, this misclassification would necessarily lead to errors and delays when it comes to accuracy - errors because the spam/non-spam trained on the wrong data, and delays in responding to the errors because the spam/non-spam has to retrain. In the meantime, we get false positives (legitimate mail marked as spam) and false negatives (spam not detected as spam and delivered to the user's inbox).

It would be better if this community of classifiers paid much more attention to the accuracy rather than the speed of classification. I've been doing this for a long time and I can classify mail as spam and non-spam very accurately. I trust myself to do it properly more than I trust a person selected at random. Thus, it would make more sense to select fewer people to classify the mail accurately than to simply throw all the mail into a self-training system without verifying that classification. Accuracy counts! I can't stress that enough. In my opinion, it's better to make sure that your system is training on the correct data than to simply train it on anything without verifying the accuracy of that data. Mistakes can be very expensive to correct and customers will notice. There are tradeoffs between accuracy and speed but never forget that the former is incredibly important.

Some may simply reply "Oh, so what, a self-training system is fast and classifies mail quickly. We can live with the initial inaccuracy because eventually the errors will correct themselves. It's good enough." I would agree that such a system may be good enough, but it would not be a great system. Great systems respond quickly and they are accurate; they are accurate from the very beginning.

The Merits and Hazards of Two-Point Filtering

Additional resources