When we encountered the problem of outbound spam, one of the big problems is that of the automation of analysis.
Let's say we sign up for feedback loops. Many of these FBLs contain a lot of noise. It often takes a human to take a look and verify that the message is, indeed, actual outbound spam and not someone reporting their telco bill as spam.
How do we parse through logs and look for outbound spammers? We can see, sometimes, that an IP exceeds its historical outbound mail count. But how do you know if a particular email address, say email@example.com, has a lot of messages flagged as spam and those messages really are legitimate? Let's say we go further to parse through the logs and decode the action taken on a message (so we can see what antispam rules a message hit). It takes somebody experienced with antispam techniques to make a judgment call that the outbound customer in question is spamming and that the filter is not generating false positives.
This comes down to what I refer to as "The automation of analysis." We can generate data, but how do we analyze that data? We can do it easily enough for a few days for a few users, but the problem is one of scale. We have millions of users and we simply can't dedicate someone full time to perform this analysis every day because it consumes so much time.
In the next couple of posts, I will aim to propose some solutions to the problem of analysis automation. Data gathering is one thing, scaling up that analysis such that it is automated (or with minimal human intervention) is another.