A promising new antispam technique – does it deliver what it promises?

I’m always skeptical when I read about new antispam techniques, especially those ones coming out of academia. Today, while browsing news stories, I came across the following article entitled Scientists devise new technique to get rid of spam mail. Here are some excerpts:

Researchers have proposed a new statistical framework for spam filtering that can quickly and efficiently block unwanted messages in your email inbox.

When I first read this, I was like “Oh, a new technique using statistics! Please, tell me more!” After all, using statistics to fight spam is one of my specialties.

Scientists from the Concordia University have conducted a comprehensive study of several spam filters in the process of developing a new and efficient one.

“Our new method for spam filtering is able to adapt to the dynamic nature of spam emails and accurately handle spammers’ tricks by carefully identifying informative patterns, which are automatically extracted from both text and images content of spam emails,” said Researcher Ola Amayri in a statement.

Until now, the majority of research in the domain of email spam filtering has focused on the automatic extraction and analysis of the textual content of spam emails and has ignored the rich nature of image-based content.

My curiosity quickly turned to disappointment. What is “new” in this technique is that the filter extracts the textual content from the image and then patterns run against them. For example, suppose you got a spam message with the following image:

image

This filter could extract the URL www.my.example.com and then feed it into other parts of the spam engine. The article continues:

When these tricks are used in combination, traditional spam filters are powerless to stop the messages, because they normally focus on either text or images but rarely both, the study found.

“The majority of previous research has focused on the textual content of spam emails, ignoring visual content found in multimedia content, such as images. By considering patterns from text and images simultaneously, we’ve been able to propose a new method for filtering out spam,” said researcher Ola Amayri.

Amayri explained that new spam messages often employ sophisticated tricks, such as deliberately obscuring text, obfuscating words with symbols, and using batches of the same images with different backgrounds and colours that might contain random text from the web.

By conducting extensive experiments on traditional spam filtering methods that were general and limited to patterns found in texts or images, the new method is much stronger, based on techniques used in pattern recognition and data mining, to filter out unwanted emails.

These assertions are not true. While this technique might be a new research method of filtering spam, it’s years behind modern spam filters. Modern filters are quite capable of extracting different parts of a message and considering them when they occur together. As I say in my post Combating Phishing, there are numerous techniques that filters use:

  1. IP Reputation

    This is the most common filter and all good ones use it. Filters maintain lists of IPs that are malicious or sending spam and block them at the SMTP level before the message has even been accepted for content scanning (some accept and use it as a weight in the filter).

  2. URL reputation

    Similar to IP reputation, modern filters extract URLs and example them against reputations or even use forward resolution to examine the IP space where the URL points to. This is the main gap this technique aims to fill.

  3. Sender authentication
    Most filters use checks including SenderID, SPF, DKIM and DMARC to make spam decisions.

  4. Content filtering

    The last piece of this puzzle is content filtering. These are rules – keywords, tokens, phrases or regular expressions – that operate on the various parts of a message including the message body, headers and attachments. These pieces are considered together, assigned a weight, and then added up to make a spam or non-spam decision.

This “new” method aims to fill in the gaps in #2 and #4. While it is true that URLs cannot be extracted out of text very easily, it is not true that content filtering cannot catch this type of spam. What’s the issue?

  1. There are other properties of messages besides the content within an image

    Every image has MIME properties. Many spammers name their files with predictable patterns and content filters can match those. These include the file names, encoding and file sizes. Put together, these can be an indicator of spam (for example, if a file has no content other than an image, and a certain file name, and comes from an IP that’s never appeared before, that is suspicious).

  2. There are ways to catch images beyond text extraction

    Second, many filters create signatures or fingerprints based upon spam messages going to spam traps. They then create signatures on the images within the message (since they are attached and encoded in base64). You don’t need to extract the content to match the spammy content it is hiding because you can just compare the image’s signature with your database of known bad signatures.

  3. Its unique catch rate is limited

    Third, way back in 2007 and 2008, Hotmail’s Smartscreen filter did use image extraction and analysis to catch certain campaigns. They ended up moving away from it because its unique incremental catch rate was negligible. Everything that the filter caught with image extraction could be caught with other antispam techniques. This is especially important – existing methods are very good at catching image spam without doing image content extraction.

  4. It is computationally expensive

    Fourth, why wouldn’t you want to do image content extraction if it helps you catch a little more? Because image content extraction is very expensive. Filters are scanning millions of message per day. Doing this type of processing incurs a major CPU hit. Filters are fast but they are not that fast. Large scale filters must scale to the types of volumes they are accustomed to seeing.

  5. The dynamics work against spammers

    Fifth, image spam is not the huge problem it was years ago. The reasons why spammers send image spam is to avoid spam filters examining the URLs within a message. However, if a URL within a message is within an image, the user can’t click on it, either. They must manually type in the URL into their browser and this drops their click-through rate. It is much more effective to have a one-click solution.

    Furthermore, sending spam with images eats up their bandwidth. You can only send so many messages depending on their size. To get around the lower click through, you need to send more messages. But because you are constrained by how much mail you can actually send, your spam campaign needs to last longer (i.e., it will take longer to send ten thousand 50 kb messages with an image than it will to send ten thousand 10 kb short messages). But if you’re sending spam for that long, IP blocklists detect it, update, and then you’re blocked from sending spam before you’ve even gotten your entire campaign out.

    Which means you’re out of luck.

Because of all of these, this “new” research method isn’t new at all and isn’t something I would implement. The idea has floated around in the industry for years but it hasn’t caught on.

On to the next technique.