Blocking foreign languages in Forefront Online

Article
10/20/2011

From time to time, we hear customers complain of problems with foreign language mail – mostly Chinese but sometimes Portuguese. We hear these complaints and understand the frustration; however, the problem of foreign language spam is more complicated than regular spam.

Chinese language spam occurs in much smaller volumes than regular spam. It is not a “wide” problem but what we see is that there are pockets of dissatisfaction, that is, there are small groups of users that receive much higher levels of spam in Chinese that the rest of our broader user base does not get. Thus, if we have 100 users, perhaps 3-5 of them will have lots of Chinese spam messages hitting their inboxes.

One of the problems combatting Chinese spam is that IP blocklists do not contain very many IPs of spammers who send it. Furthermore, not all Chinese spam originates in China; in fact, less than half of it does. Complicating matters further still, lots of Chinese spam contains URLs that are not on URL block lists, nor does much of it even contain URLs. Thus, the primary methods of combatting Chinese spam are not available to spam filters. This does not mean that IP or URL reputation filtering does not work on Chinese spam, but rather, a disproportionately high amount of Chinese spam is resistant to these techniques compared to spam sent in English or Russian.

Second, much of the Chinese spam is “gray” mail, that is, spam that is sent advertising products, seminars, or magazines. Gray mail is more difficult than regular spam mail and Chinese gray mail compounds this.

Finally, Chinese spam contains Chinese characters (well, no kidding). Much of the industry is based in North America or Europe and our primary languages are English, Russian, German or some other western language. Many of our spam rules target word patterns and phrases in the body content. I don’t understand the grammatical structure of logographic languages and so for me to go out and write a regular expression based on Mandarin characters just doesn’t work the same as it does for words, patterns and phrases in English. It is not as flexible nor as predictive.

For customers who have problems with Chinese language spam, there are some workarounds. One trick is to block all Chinese language mail.Many foreign languages are encoded in different charsets. For example, back in the day, English was mostly encoded in the US-ASCII charset. Nowadays, it is more frequently (but not exclusively) encoded in ISO-8859-1 or UTF-8. However, foreign languages are encoded in charsets other than those two.

There are some character sets that are almost always reserved to a particular language. Here is my list:

Chinese – gb2312, gbk, EUC-cn, ISO-2022-cn, and Big5 (Taiwanese)
Japanese – ISO-2022-JP, EUC-jp, Shift-JIS
Korean – ISO-2022-kr, johab, ks_c_5601_1987, EUC-kr
Russian – KOI8-R, Windows-1251, ISO-8859-5
Turkish – Windows-1254, ISO-8859-9
Arabic – ISO-8859-6, Windows-1256
Greek – ISO-8859-7, Windows-1253

If you are having a problem with foreign language mail, then you can create a policy rule to block mail in those languages. Here is a screenshot of what it would look like:

(Click for larger image)

Create a new policy rule and make every feature blank except for the Message blue drop down property and click the blue “Edit” next to the character sets. Add in the ones for the Chinese language, or any language you want. I recommend splitting different languages into different character sets in case you get false positives.
Set the action to “Quarantine.” This way, if you do get messages that are false positives, you can go back and retrieve them later rather than asking the sender to resend them to you.
If you get false positives from senders, you can create one-off whitelist exceptions for them according to my instructions in this post: How to whitelist a sender for inbound mail.

Using these, you can cut down on the amount of spam encoded in foreign languages in your inbox. However, there are some caveats to doing this:

Not all Chinese language spam is encoded in those four charsets above. The UTF-8 charset encodes every language – English, French, German, Russian, Japanese, Spanish… and Chinese. Chinese spam that is encoded in UTF-8 will still bypass this rule. You should not create a rule to block UTF-8 because you will block far too much legitimate mail.

You can cut down on the amount of Chinese spam you receive by creating these rules for catching charsets, but unfortunately you won’t catch it all.
Not all mail encoded in Chinese charsets is Chinese. There are two exceptions to this.

The first is that sometimes there will be someone that you are communicating with that is of Chinese descent and has a Chinese signature in their email as their tagline. Their entire message might be in English but they want to include a phrase in Mandarin. If they do, the message will be encoded in a Chinese charset (or perhaps UTF-8 but you’re not blocking this charset anyhow). When that happens, you will get a false positive and will have to create a whitelist entry for that sender or sender’s domain as per my instructions above.

The second situation is stranger. Sometimes, for reasons unknown to me, mail can be encoded where the occasional character is encoded in a foreign language. Consider the attached mail snippet from my own personal email:

From: John Jones
To: Terry Zink
Sent: Thursday, July 14, 2011 5:54 PM
Subject: RE: Code and Test Automation Complete Criteria Status

Regarding blocking bugs, we have also used “Priority 0” to mark bugs as blocking. I am assuming that we including those bugs here.

Thanks.

-John

Okay, you say. So what? Looks pretty normal. It does look normal, but this message was encoded in ISO-2022-JP, which is the Japanese charset. Here is what the message looks like in the raw source:

From: John Jones
To: Terry Zink
Sent: Thursday, July 14, 2011 5:54 PM
Subject: RE: Code and Test Automation Complete Criteria Status

Regarding blocking bugs, we have also used =1B$B!H=1B(BPriority 0=1B$B!I=1B= (B to mark bugs as blocking. I am assuming that we including those bugs her= e. Thanks.

-John

You see those quote marks in the original email surrounding the words Priority 0? In the raw source, they are encoded using the ISO-2022-JP characters and converted to HTML. There is no other Japanese content in the message. There is no actual Japanese content in the message. Yet for some reason the message encoded these quote marks using a charset used primarily for Japanese characters.

Why did it do this?

I have no idea. But I know that it happens and it happens more often than you think.
Finally, using charsets to catch foreign language spam doesn’t work with all languages, it only works on languages where the alphabet is not the Roman alphabet (Cyrillic, Chinese, Korean, Thai, etc). For languages like Portuguese, Spanish, French, German, and so forth – western languages – these are usually encoded using ISO-8859-1, UTF-8, and windows-1252. If you block these charsets you will end up blocking a lot of legitimate mail which would be undesirable.

We are looking into using other technologies to perform language identification without resorting to looking at charsets, but it is not available yet.

There you go. If you are having problems with mail arriving in your inbox that is encoded in a foreign language, there are workarounds. These are not perfect, but hopefully you can get some relief by implementing this.

Blocking foreign languages in Forefront Online

Additional resources