The Merits and Hazards of Two-Point Filtering

In my previous post, I defined Two-Point Filtering as the process of using an end-user feedback loop to train a spam filter without verification of the user classifications.  I borrowed the Web 2.0 term to refer to the greater community of people contributing to the filter as opposed to a narrower group of people (ie, spam analysts) training the filter.  There are a couple of big advantages to doing two-point filtering.

  1. The first, and major, advantage is that rather than having a small sample of people training the filter, you have a large community of people submitting to the feedback loop.  There is only so much time in a day that a small set of people (spam analysts) can spend training a filter and the large community allows a filter to utilize the leverage of many different sets of eyes coming across a much wider variety of spam.  A lot of people can train a filter faster than a few people can.
  2. Mistakes can correct themselves.  While some people may misclassify spam or ham, because of the user-feedback loop eventually a larger range of people will correctly classify the mail and the mistakes will fix themselves (or rather, users will correct each other).

These are two tremendous advantages that Two-Point filtering has.  Unfortunately, if it were that simple, the founders of Wikipedia never would have created Scholarpedia.  The difference between a filter that primarily relies on Two-Point filtering and one that relies on something else (or supplements it with something else) is the difference between a good spam filter and a great spam filter.

In this business, speed and accuracy count, and they count a lotWhile Two-Point filtering has an advantage of speed when it comes to initially classifying mail as spam, it suffers from the weaknesses of the misclassification of mail and the amount of time it takes to respond to that misclassification.  Most users want their spam to stay out of their inbox, but they find it even more important to keep valid mail out of their spam quarantine.  The one thing I’ve noticed about the general end-user community when it comes to classifying mail as spam and not-spam is that they are not particularly accurate.  Legitimate mail gets classified as spam (less often) and spam gets classified as non-spam (very often).  Thus, this misclassification would necessarily lead to errors and delays when it comes to accuracy – errors because the spam/non-spam trained on the wrong data, and delays in responding to the errors because the spam/non-spam has to retrain.  In the meantime, we get false positives (legitimate mail marked as spam) and false negatives (spam not detected as spam and delivered to the user’s inbox).

It would be better if this community of classifiers paid much more attention to the accuracy rather than the speed of classification.  I’ve been doing this for a long time and I can classify mail as spam and non-spam very accurately.  I trust myself to do it properly more than I trust a person selected at random.  Thus, it would make more sense to select fewer people to classify the mail accurately than to simply throw all the mail into a self-training system without verifying that classification.  Accuracy counts!  I can’t stress that enough.  In my opinion, it’s better to make sure that your system is training on the correct data than to simply train it on anything without verifying the accuracy of that data.  Mistakes can be very expensive to correct and customers will notice.  There are tradeoffs between accuracy and speed but never forget that the former is incredibly important.

Some may simply reply “Oh, so what, a self-training system is fast and classifies mail quickly.  We can live with the initial inaccuracy because eventually the errors will correct themselves.  It’s good enough.”  I would agree that such a system may be good enough, but it would not be a great system. Great systems respond quickly and they are accurate; they are accurate from the very beginning.

Comments (11)

  1. Nikki says:

    I like the thinking behind a two-point system, especially that it uses a large pool of users to increase the accuracy of spam/non-spam clasifications. The speed problem you mentioned is a result of the dependency on user actions. With spam today being blasted out by huge botnets, spam outbreaks are concentrated into very short time-periods. So as fast as humans may be, spam botnets will always be faster. This is the reason an automated system that listens to massive amounts of email traffic can detect spam with high accuracy, low-false positives and without putting the end user to work.

    I mean, isn’t the whole point of computers that they can do work better and faster than humans? I don’t see humans beating botnets with millions of zombies, and endless CPU.

  2. tzink says:

    Two-Point filtering has as its advantage a large pool of users.  I think that selective sampling of the user-feedback and then grading the feedback would improve its performance.  For example, if User A consistently classifies their messages properly while User B does it right 50% of the time, then ignore User B’s classifications.

    This requires some up-front work and on-going maintenance (with regards to record-keeping) but in my view the tradeoff is worth it.

  3. Nikki says:

    A lot of the image-based spam sent today is randomly generated, meaning no two spam messages in a campaign/outbreak are exactly alike. So if I’m in your network and I clasify one of these random image-spam messages as spam, how does that protect anyone else?

    Even if I spot unsophisticated mass-spam at 1:00pm, by the time I classify it as spam everyone else in the network will already have it in their inboxes. Remenber the spam messages are send out by the millions from massive botnets in avery short period of time.

    Speed is a very serious issue. Even if you can try to make the human network fast, can it ever possibly be fast enough?

  4. Your definition of "two point" doesn’t actually mention that you want to use multiple users for feedback.

    I’m not convinced that is a good idea.  As you say, you know what spam is and others don’t.  

    You can use only your own judgements to train the filter.  A decent filter will learn very quickly and then have an error rate of less than 1% — you only have to train the filter when it makes a mistake.

    I have seen no evidence that training a filter on somebody else’s email will improve its performance on your email.  At the very beginning, sure.  Train on a handful of somebody else’s messages.  But beyond that you’re better off solo.


    P.S. Hotmail does something like what you say:  it polls lots of users about what is spam and what is not.  GMail (which works better, I think) learns when you correct it.

  5. ... says:

    Du musst ein Fachmann sein – wirklich guter Aufstellungsort, den du hast!

  6. ... says:

    pagine piuttosto informative, piacevoli =)

  7. ... says:

    luogo grande:) nessun osservazioni!

  8. ... says:

    Great site! Good luck to it’s owner!

  9. ... says:

    Luogo molto buon:) Buona fortuna!

  10. ... says:

    pagine piuttosto informative, piacevoli =)

  11. ... says:

    9 su 10! Ottenerlo! Siete buoni!