Theorem 3 – spam filters are not 100% effective at catching spam because some spam can contain legitimate content

Theorem 3 is a corollary to Theorem 2, the two basically go together.

One of the things that spam filters must do is catch as much spam as possible.  This would be fairly easy if it weren’t for the fact that a great deal of spam contains content that can routinely be found in legitimate messages or resembles a message that appears to be legitimate.  Social engineering spam advertising products is notorious for this.  Consider the following example:

Hi, how’s it going with your weight loss program?  I wasn’t doing too well with it myself, but I thought I would give it a whirl.  You should try going to the following site.

http://www.some spam

Talk to you later.

On the surface of this, it appears to be very legitimate.  Each of those sentences in and of themselves are perfectly reasonable.  They might be used in a casual conversation between people.  A spam filter would need to be able to interpret the context of the message in order to make a determination of whether or not this is spam.  Even a person, if not familiar with this type of spam, might be fooled into clicking on the above link (tsk, tsk).

Another point of contention is examining the content of email and classifying as spam all mail that contains poor grammar.  Some time ago, one of my management superiors jokingly remarked that we ought to classify all email as spam that uses butchered grammar, but then he realized some of his messages would be classified as spam as well.  It was a comment said in jest but he actually hit the nail on the head – people, in everyday conversations, can and do use grammar that is exceedingly poor.  Emoticons and internet-speak (abbreviations like LOL and LMAO) are only the tip of the iceberg.  People will often use poor grammar when referring to each other, “forget” to include punctuation, call each other names, and so forth.  Email that looks butchered to some people is capable of being interpreted by others.  Thus, simply targetting bad message composition structure is not good enough grounds alone to make a decision on the classification of spam because while spam can contain that content, so can regular mail.

In fact, even spam that makes extensive use of javascript or HTML are borrowing from newsletters that do the same thing.  Spam that contains a single image (stock spam is notorious for this) is borrowing from other people’s legitimate uses when they send pictures of themselves or their kids to others.  It is difficult to target this type of mail because there is nothing inherently spammy about the structure of the mail.  In order to make a determination the spam filter would either need to decode the image (image processing is processor intensive) or take a gamble that the type of message contains only spam.  Unfortunately, a significant number of people use these quite legitimate constructs when sending each other mail.

Theorems 2 and 3 are two sides of the same coin.  A spam filter cannot unilaterally be overly aggressive because doing so would result in flagging legitimate mail that uses the same patterns.

Comments (0)