Lately, I have tasked myself with the enormous challenge of the automation (or partially automation) of processing all of our false positive submissions.
For years I have always thought that false positive processing would require hand processing. Because we get such a high volume of FP submissions that are actually spam (ie, stock spam, pharmaspam, image spam, adverts, etc) we could not simply assume that every submission was legitimate. In fact, it was more like the other way around, submissions to false positives were more likely to be illegitimate. This was what was so tedious: FPs had to be sorted manually to separate the wheat from the chaff and then they could be processed for real (ie, adjusted such that they don’t get hit anymore).
Even within non-spam submissions, what constitutes a real false positive? Is a dirty joke a legit FP submission? We filter business mail so should we count jokes? It’s a tough question because automation could throw off our stats. If we assume that everything that is quasi-legit is legit, but we don’t really consider dirty jokes as valid FPs, then it makes us look a little worse than we are.
But let’s leave aside the question of jokes and advertisements for now. What I want to know is if it is possible to partially automate the false positive process? Can it at least be streamlined? I think the answer is “yes, but it’s not easy.”
Disclaimer: I’m not that clueless, I had already written a lot of scripts to move messages around based on content, but I need a better way to move messages and categorize them without examination afterwards.
Recall that a few months ago I posted a series on the spam theorems and that most email was either mostly clean or mostly dirty. When it comes to people who send mail, most of them are either sending very high volumes of spam or very low volumes of it. So, in theory, if I examined the historical traffic patterns of sending IPs, I could remove very dirty IPs and set aside the very clean ones in a special location without ever having to sort them manually.
The problem is that doing that above process, looking for either very dirty or very clean IPs, only gets rid of around 20-30% of false positive submissions. I still have a ton of garbage to weed through. While gray senders (IPs with historical spamming records of 10% to 89% spam) make up around 10% of our overall mail stream, they comprise 70-80% of false positive submissions. This makes it very difficult to guess, based on the sender’s historical mailing pattern, whether or not I can move it around without looking at it manually.
The algorithm I am using (ie, inventing from scratch) requires a certain spam threshold, a certain number of legit messages per day, a certain number of spam messages per day and a certain number of days for which we have traffic. I then tweak the parameters, copy the mail submissions to a special subfolder and then quickly glance through them. Not only that, but we have different categories of false positive submissions (based on the technology that flagged them as spam) and I have only managed to work through 3/4 of one subcategory. Clearly, this process is going to take longer than I had originally thought.
But, when all is said and done, I think that this is going to be worth it because if we can reduce FP sorting time by 80%, it gives us tons of more time to work on other stuff.