A few days ago, Yahoo antispam chief Mark Risher hosted a Q&A session with various users and answered their questions, both pre-submitted and live questions. I thought I’d chime in and take some selected quotes from the session and add my own thoughts to Mark’s.
dlippman: Why don’t emails with the word "Lottery" and a few other Spam characteristics automatically go into my Spam folder?
Mark: I really wish we could! Catching a specific word is really hard. On the one hand, there are the risks that we’ll catch something legitimate — “Campus housing lottery this Friday” — which is what we call a “false positive.” (A “false positive” is any time our filters mistakenly mark something as spam when it isn’t). On the other hand, if we build a filter for one specific word, there are often about a bazillion other ways the bad guys can spell it and still get their point across. Did you know there were 600,426,974,379,824,381,952 ways to spell \/!@g.r.a? (check out http://cockeyed.com/lessons/viagra/viagra.html)
Indeed. Back when I was a spam analyst and actively writing and adjusting spam rules every day, we developed some rules of thumb. Writing spam rules on single words is risky because there are lots and lots of times when supposedly spammy words can occur in legitimate circumstances. What if a researcher at Pfizer was discussing the latest results of a Viagra test?
This comes up quite often in the sensitive word list. Some customers want no x-rated words in their inbox and ask us "Why don’t you block this word? Or that word?" The reason is that in American slang (at least), curse words are used in "legitimate" contexts so you can’t block them outright. Think about how many times you curse in real life, perhaps while doing some coding and can’t figure out why your code doesn’t produce the correct output…
On the other hand, spammy words that occur in longer phrases are much more suspicious and much less prone to FPs. Blocking on "Lottery" in the subject line is risky, but blocking on "Win the Spanish lottery!" is much less so. The rule of thumb in spam filtering is that the less generic the phrase, the more aggressively you can be in your spam weight/evaluation.
brip: Customer Care tells me that I need to forward with full headers if I’m reporting spam, but when I try to do that the headers are never there. What should I do?
Ryan: To forward with headers users have to take two steps. First you will need to reveal the headers for the message. In Classic you can look for a "Full Headers" link just below the bottom right corner of the message. In All-New Mail there is a Header dropdown just above the top right corner of the message. Once you have exposed the full headers you can copy and paste them into the message as you are forwarding it.
This is something we also tell our customers. Forwarding full headers is crucial to fighting spam. Why? Because the headers of the email tell us much that is not available in the body contents. While we do write spam rules on body and subject content, the headers tell us the following:
- Who sent the message (ie, who did they claim to be in the SMTP MAIL FROM)
- What IP address sent the message
- The route the mail traveled on its way to you (ie, intermediate hops)
- Suspicious message headers… what does the Message-ID say? What does the HELO say? What do the Received headers say?
- What did our own spam filter say? Many filters insert x-headers into the message and receiving those headers let’s us know what the filter already said so we know where to start from when we want to block the message.
In other words, there is a much richer set of content in the message headers than is available in the message alone. Much of the time, simply forwarding a message to the abuse email address loses the message headers, rendering them much less useful to the spam abuse team.