Theorem 5 – The precision of anti-spam pattern matching techniques are inversely proportional to their risk

We have seen how spam and legitimate messages can share common characteristics, and that in order to increase a spam filter's effectiveness it needs to be able to detect the boundaries in the overlap.

The last theorem only marginally is related to the first four.  It can be paraphrased thusly: The more generic the pattern-matching, the more risky the technique.

An aggressive pattern-matching technique might target specific phrases in a message.  For example, the word “lust” can occur in spam but it can also occur in legitimate messages.  However, suppose we extend the length of the phrase “see models lust after huge other models and have it caught on tape.”  The longer phrase can obviously be much more aggressively filtered because it will not likely occur in a legitimate context.  However, note that the longer the phrase, the more difficult it is to pre-emptively strike spam because it is relatively easy to change what the phrase contains.  If an anti-spam filter targeted that specific phrase, but then a spammer changed the words to “see models lust after huge other models and get it caught on camera” our filter has been defeated.

Of course, a good filter would obviously make a matching pattern more generic.  It might include several words as possibilities for the red words.  This, of course, increases the risk of causing a false positive.  In our above example, the risk is small but as the size of the phrase is reduced the possibility of hitting upon a legitimate phrase goes up.

Blacklists (or blocklists) are interesting study.  If a filter decides to reject all mail coming from a specific IP address or IP range, there is some inherent risk in doing so because what happens if legitimate mail comes from that IP address one day?  In cases like these we use historical patterns of mail traffic coming from those IP ranges.  We work on the assumption all the mail coming from that IP address is dirty; the volume of traffic is so dirty that we are prepared to assume the risk in all probability, no mail will ever come from that domain.  If it does, the IP address will be removed from the list.  Clearly, we assume some risk in blocklisting IP addresses.  It would be less risky if we checked every single day whether or not mail from those IP addresses were legitimate.  This, however, is far too much work and would waste too much time so we don't do it.  The tradeoff is between time spent monitoring those blacklists vs the assumption that all mail from those addresses are dirty.  In practice, very rarely are IPs that are blacklisted sending legitimate mail in the future.

We can make use of subject lines, SPF records, similar email addresses in the To line, and so forth, but all of these techniques have some risk involved in them.  Even though we think they are techniques used by spammers, legitimate users routinely send mail outside of their SPF record and claim to be from that domain.  Legitimate users routinely send mail to domains with similar email addresses.  These pattern techniques, while good, still have some degree of risk.

One method of fighting spam with little risk is taking snapshots of the entire message and comparing future spam messages with known spam messages.  If an anti-spam expert moves known spam messages quickly and an automated engine does the snap-shotting and comparing, then there is little risk in overlap.  On the other hand, this process, while precise, is not particularly robust.  Even small changes will defeat this process.  But on the other hand (as Tevye would say), the chance of false positives is virtually reduced.  It is also incredibly good at detecting the overlap in the bottom line in the Z-Spam Curve.  The trick for this process, then, is automating the process quickly without human intervention.  The snapshot process could also be made to be a bit more fuzzy, but this increases the risk.  However, if the risk is within an acceptable range then clearly this is a desirable goal.

So, to conclude, in spam processing we are faced with a dilemma:

1. We can make our spam fighting techniques very generic based on some logical assumptions and heuristics, but this is likely to cause false positives because spam and non-spam can resemble each other.

2. We can make our spam fighting techniques quite precise, but this is not deployable on a wide scale because even small changes in the patterns will defeat the process.  It is also not deployable on a wide range because much human intervention is required to identify the patterns to begin with and it is a reactive process, rather than a proactive one.


Skip to main content