I've lamented that automating analysis of something that requires a human to examine is exceedingly difficult. Still, there a few rules of thumb when it comes to spam that I have picked up over the years.
- Use statistics in analysis, but make sure that there's room for noise. In my previous post, I said that I like to look for occurrences that are outside the 3x standard deviation from the average. However, this only works when you have a large enough sample size. If you have an observation set that runs between 1 and 2 and one day you get a 3, that can trigger an alert. There should be some minimum threshold before you start using the statistical analysis technique.
- Allow for weekends.
Weekends are days where legitimate email drops by a factor of 3. When doing analysis, make sure that you allow for weekends by examining what day of the week you are examining. The same is true for holidays.
- Track corner cases.
I have discovered that we have a lot of email aliases that show up every single day in our outbound logs that have consistent patterns of mail marked as spam. However, these aliases are legitimate and usually are forwarding from one email address to another.
To that end, I build a text file of corner cases to exclude, that is, search for email addresses that have a pattern of spamming, but exclude the following exceptions...
- Allow for user error/noise and accept the fact you'll miss something.
When it comes to processing feedback loops, there's a lot of noise. Users regularly mark non-spam as spam. We have feedback loops with some large free email providers, and if we followed up on everything, we'd be going through billing reports, email notifications, etc. Users don't seem to be able to tell the difference between spam and their monthly billing.
To that end, we have thresholds for each feedback loop. The one above requires at least 25 instances of abuse in the past 24 hours before we take action or it shows up on our screens. Perhaps there is a legitimate case of 24 instances and our filters screen it out. I accept the fact that we'll miss that, and usually if it's a problem we'll get a ton of more complaints. Spammers usually go big, not small.
- Compress information where it makes sense.
In one of my scripts, I parse through logs and get a summary view. One such thing I do is look for null senders with mail marked as spam. I don't bother to get a customer by customer breakdown of this data, I simply store the aggregate of all this information in the database. From my point of view, the compressed information, namely the aggregate total, is useful to me whereas who is sending the messages are not because at the moment I don't plan to take action on it.
If the situation changes in the future, then maybe I'd start to store it. But not before then.
Those are some of the tips I have learned. Nothing revolutionary I guess, but it's nice to have it burned into my memory so next time I don't have to start from scratch and re-learn everything.