Theorem 2 – Spam filters are not 100% effective at avoiding false positives because legitimate email messages can contain spammy content

In the world of spam filtering, a false positive is a message that is flagged as spam when it should not have been.  Sometimes these are newsletters, sometimes these are personal messages and other times these are business-related messages.

The reason false positives occur is because spam filters have to figure out, by the use of heuristics (usually), whether or not a message is spam or legitimate.  This process would be a lot simpler if end users always wrote email in a structured fashion.

The problem is that users writing legitimate messages routinely insert content into their messages that is commonly found in spam.  The prime example of this are people who swear routinely.  Curse words are very often used as euphemisms for sexual experiences and therefore a spam filter, quite rightly, targets words like these.  There are other words in the English language that are legitimate but are similarly used in the same context.  However, when a person uses informal language in a person-to-person manner but uses these words (particularly swear words) the spam filter will often flag these as spam even though the message is not unsolicited commercial email.  While a spam filter is good at identifying patterns, it is not nearly as good at interpreting the context of those patterns that it finds. 

A second example are the classic 419 scams.  A 419 is a message from a person in Africa who has a deceased relative who died in an accident or a military coup.  Right before their death, they very conveniently managed to stash a load of money somewhere and they need your help to recover it.  In exchange for sending them a large fee to get the process going, you get to keep a percentage of the profits.  Now, these messages are frequently WRITTEN IN ALL UPPER-CASE.  Thus, when we see them, we instantly recognize them as 419 scams because no knowledgeable person knowingly writes emails in all uppercase letters.  It is the internet equivalent of shouting.  The problem arises when unknowledgeable people compose emails and write subject lines in upper case or even the body text in upper case.  Stock or financial reports do this as well when they write reports in plain text and want section headings to stand out.  While we might think that all upper case email is very spammy, the fact is that legitimate messages can and do contain this characteristic.

Another example of content that is difficult to filter are newsletters.  Spammy newsletters are usually advertising and often make extensive use of HTML tags to make the email look nicer (I assume that a nice looking spam generates more leads and revenue).  However, a legitimate newsletter can make use of these exact same HTML constructs and therefore in order for a spam filter to avoid the legitimate uses it would have to be able to intelligently interpret the contents of those newsletters.  Of course, as of this writing, no spam filter is capable of doing this.  The problem once again is that both spam and legitimate messages can use the same content in their messages.  This makes it more difficult for a filter to target the spam and avoid the legitimate mail.

The trick, then, is to use some sort of combination technique wherein a spam filter requires multiple combinations of patterns normally found in spam but not in legitimate messages.  While a legitimate message can contain some spammy content, it is unlikely to contain a lot of it.  If a spam filter were to come across a message with multiple patterns that are routinely found in spam, then it can be fairly certain that the message is spam itself. 

These types of messages are in the bottom line of the Spam Curve, the so called overlap area.  These types of messages contain patterns that are not especially unique to either spam or non-spam.  The corollary is that these types of messages will never fall on the far right hand side of the curve because by definition (according to Theorem 1), the messages are not extremely clean.  They are partially clean and the key is to weigh the dirty vs the clean in making a decision to allow them past the filter.

Skip to main content