I was recently thinking about what a person who fights spam (like me) does all day. In other words, what is a day in the life of a spam analyst like?
The question for me is two-fold, because the stuff I do now is quite different than when I first started three years ago. So, I’m going to break this into two posts; one for what I do now and one for what I did back then. The idea is to break down how we go about fighting spam.
Back when I first started, after a while I did three tasks daily:
- Process false positive submissions
- Process spam abuse submissions
- Process IP blocklist candidates and delistings
The method of going through the false positive inbox was in itself a process:
- Separate the messages into cause of filtering (ie, was it filtered by a spam rule, or one of our automated processes?). I wrote back-end scripts to do this.
- Move joke messages or obvious spam into the not-valid folders. I wrote a back-end script to do this as well, it was the first one I wrote.
- Manually go through and separate the wheat from the chaff. You wouldn’t believe how many invalid submissions come to the false positive inbox.
- Go through the messages one by one and fix the broken rule. I wrote some partial automation scripts to do this as well, so once I fixed one, any others that were caused by the same rule were subsequently moved. Since I had already pre-sorted them I could assume all messages were valid.
My goal for false positives was to divide and conquer. I tried to automate as much as I could but the biggest part of what I did was still separating good mail from bad mail. It didn’t help that the file system was slow, which is why I wrote the automated scripts to begin with. For the most part, I could keep up with all false positives every day. On Mondays, I usually saw between 1500 – 2000 submissions, and that always went lower throughout the week with Fridays being the lowest day. For some reason, Wednesday was often higher than Tuesday.
I became very good at looking at a message and without even looking at the headers, I could tell why a message was filtered. Even when we used automated filters that cause the false positive, I could still tell which one it was because each tended to hit the same types of mail.
Also, because I modified so many spam rules that were written by humans, I became very good at predicting what spam rules written by others would be effective and which ones were likely to cause false positives. For example, consider the word "stocks". A spammer could spell it like any of the following:
stock.s stock-s stock^s stock!s
It didn’t take me long to figure out rules-of-thumb for writing rules on obfuscated phrases. While I might be tempted to write a rule like the following:
That’s a regular expression where \W can be any non-letter, non-number character. This doesn’t work well because the following word is legitimate:
There are a number of nuances like that when writing regular expressions to match words. My experience with the predictive elements of false positives is something that stuck with me after I moved on from writing rules to Program Management.