Spam filters and foreign spam, part 2 – Collisions

Note: this post contains some explicit language relating to stuff commonly found in spam.

One of the very basic ways of fighting spam is with content filtering.  At a basic level, you look in the content of an email searching for words, patterns and phrases that are spammy.  Here’s an example from a message that landed in my inbox:

Diana sent a card and wrote this to you:
“My love for you will always be the same”

Just click on the following link to see your E-card:

Regards, 101Christmas-Carols.

As a spam analyst, I might write a regex like the following:

\bclick on the following link to see your e-?card\b

It’s simple in its elegance although not all that robust.  Of course, spam analysts generally write them much more flexible than that.  These types of regexes target English language messages.  But problems arise when messages occur in foreign languages, you have a global rule set, and you don’t know what words hit what other languages.

Porn spam is one of those types of spam that can be prone to collisions.  A number of our organizations are sensitive to this type of mail and want to cut down on all spam containing explicit images or language.  To that end, some spam filters will allow you to block single words completely.  The net result of this is that you will end up blocking much of your porn spam but you can also end up blocking legitimate mail.  English speakers use a lot of slang and it is also used in legitimate contexts.  If you’ll allow me to get somewhat graphic:

  • This is escalating in a tit-for-tat battle of words
  • The former vice-president of the United States was Dick Cheney
  • Here is a cum. list of everything (short for cumulative)

These are clearly sensitive words in some contexts in the English language in a pornographic sense, but they are also used legitimately above.  You can end up blocking a lot of legitimate mail.  It is always a risk to block mail based upon the existence of a single word. 

The problem gets worse for English speakers who translate seemingly safe spam words into foreign languages.  Consider what would happen to an organization that translated the following words:

  • F*ck in English is fick in German, but it is a common inoffensive word in Swedish (it means might, according to my online translators).  Translating and blocking this if you have Swedish legitimate inbound mail will result in false positives.
  • Pussy in English is translated into German as pflaume which literally means prune. This is spammy depending on the context (the same is true in English)
  • Threesome in English is dreier in German, but it is a proper name in German

The wider your footprint of inbound mail (ie, the more global your organization), the more restricted you are in what you can and cannot block on when translating single words from English.  Many organizations, as I stated earlier, want to block all sensitive words.  But it becomes difficult to do this when you can’t be certain that what is bad in one language is benign in another. 

These are the kinds of things to be aware of when writing spam rules that do straight up blocking on content.  After a while you start to become paranoid of the unintended consequences of word translations.  Nobody is an expert in every language, so what I generally recommend is to avoid blocking on a single word.  Phrases are much better.

Comments (2)

  1. Retired Ninja says:

    .. or even…

    use combinations of single words in meta rules and catch the stuff without risks of false positives

    (but I guess that is way too advanced)