Recall from previous discussions that some spam and non-spam messages can closely resemble each other. While most spam filters can easily detect the email that is obviously spam, many of them have trouble detecting the difference between spam and non-spam when both types of messages share commonalities. As a result of these commonalities, there exists an “overlap” portion of the Spam Curve. Spam filters must make sacrifices in their aggressiveness of filtering algorithms to account for the overlap.
A very good spam filter is much better at detecting the overlap of this area than a run-of-the-mill filter. For example, suppose that the overlap area resembles the following:
A second-rate spam filter (perhaps Yahoo! Mail Spam Guard) would have difficultly filtering that line because of the proximity of spam and non-spam. However, suppose, for example, that a better mail filter was able to look a little closer and found the following:
A better filter might still be able to see the following:
Even more sensitive filters might see this:
The point is that while the first line looks mostly jumbled up and random, the next few lines start to see more discernible patterns. Each ‘x’ represents a spam pattern while each ‘o’ represents a non-spam pattern, and the longer the pattern the easier it is to filter.
Consider the implications of this concept; automating a process of spam filtering does not improve the spam filtering effectiveness if it does not result in greater detection of the granularity of the overlap. Writing spam filter algorithms that target single words will not increase spam filtering effectiveness because single words can often be used legitimately. Automating the process of spam filtering through machine learning may be fine and dandy but it doesn’t mean much if it’s getting better at blocking the spam it was already blocking anyways. Getting new blacklists into production quickly will reduce the amount of spam getting through from IP addresses on those blacklists but this does not address the issue of message overlap.
This naturally brings us to the issue of blacklists. A blacklist rejects all spam from an IP address without examining the message based on the reputation of the sender. It can be argued that this is a better way to block spam because it can reduce the amount of load going over the filter. However, it begs the question that the spam would have gotten through the filters to begin with. Has the filtering effectiveness actually been improved? Not really; we have only reduced the amount of work that has to manually be done to tweak the filter settings. What about the mail that doesn’t come from blacklists? If the filter cannot detect that overlap line very well then the spam will get through the filter.
Also note a corollary of Theorem 4 – a mutation in the type of mail getting through a filter does not imply a lack of effectiveness in the filter if the type of spam getting through falls outside of the overlap area. If a spammer is sending incredibly spammy enlargement pill spam and changes a few words and spam gets through the filter, this does not mean that the filter is ineffective. The filter just has to be updated to keep up with the mutation of the spam. This can be achieved a couple of ways, either algorithmically or manually (with human intervention). In either case, the filter is still blocking the spam with a slight delay until it sees examples of the spam. Pre-emptive blocking of the mail would presuppose that the algorithm or human intervention is capable of predicting how spammers will change the mail to get it through the filters. This, of course, is impossible. Theorem 4 states implies that if a spammer changes spam on the left hand side of the Spam Curve, both a good filter and a poor filter will still catch it within an eventual period of time. However, if a spammer changes spam in the overlap area of the Spam Curve, a good filter will catch it in time but a poor filter will not no matter how much time is allotted to it.
The logistics of updating heuristics and algorithms to catch very spammy mail is a factor in how quickly a filter can respond to mail that rapidly mutates, but a good filter will be able to intelligently detect the good messages from the bad when the two types are very similar in content.
Most spam filters use all types of shortcuts in order to catch the very spammy mail. Since this is where the most volume of spam occurs it makes sense to do this. Shortcuts involve the use blacklists, SenderID to use reputation based email lookups, Phishing lookups, and so forth. However, note that these things will only exist if the sender of the mail (ie, a spammer) already has a bad reputation. It says nothing about the overlap area which, by definition, contains content that appears to be spammy and non-spammy at the same time. Being on a blacklist makes a mail considerably spammier, regardless of the content. The same is true for sites that are cross-referenced with phishing sites. The weakness on relying on these is that while they will catch a lot of spam because there was lots of volume for them, the overlap area is likely to have readily available shortcuts. Their volume is much lower and detection becomes more difficult. A good spam filter will need to work harder in order to grab these.