The nuances of measuring spam effectiveness

Story time.

A couple of years ago, I was tasked with coming up with a mechanism to measuring how good our filters were on spam.  At the time, we had a rough idea of where we were.  We could kind of tell by looking at abuse statistics; if more people were submitting spam, then it meant we were getting worse.  If fewer people complained, it meant we were doing better.

The problem with this is that the evidence is strictly anecdotal.  While it may mean that the spam team had a relatively reliable, if not intuitive, feel for how things were, it was rare that anybody ever trusted our gut feelings (even though we were rarely wrong).  So, that meant building a mechanism to measure spam effectiveness automatically.

Since I was tasked with designing this process, I had a few simple requirements:

  1. It had to be on-going.
  2. The method had to be automated.
  3. It could not interfere with legitimate mail flow.
  4. It had to be statistically relevant.
  5. It had to scale.

Each of these tasks was an adventure in and of itself.  Let's take a look at these.

Measuring effectiveness requires constant monitoring

One of the drawbacks today when I see a study done is that the study is usually done over a window of time, the results published, and then the results floated around for a very long time.  While this is a good snapshot in time, it is not representative of the real world.  Spam changes; it morphs, it blends and it is not constant.  Not only do you need to be able to respond quickly to new spam outbreaks, you must be able to measure when your effectiveness slides.

In stock investing, the art of look at stock charts is called technical analysis.  You wouldn't be able to get an accurate gauge of a stock's performance simply by looking at one day's worth of data.  You must look at multiple time frames to see what something is doing.  Even if you just look at a stock's fundamentals, you need to be able to compare it against its prior history.  Is a stock making more or less money per quarter compared to previous quarters?  Is growth increasing or decreasing?  Relative performance is crucial.

Similarly, with a spam filter, a historical database of a filter's effectiveness history is key.  While this may seem obvious to anyone who wants to measure something, we need to think about the bigger picture:

  • Where do you store the data?
  • How do you display it?
  • Who has access to it?
  • How long do you store it for?
  • Can you drill down deeper into the data and extrapolate trends?
  • Can you publish the data?
  • What kind of redundancy should the servers have that store the data?  If you lose it, is it critical?
  • What does the data ultimately mean?  If a dip occurs, who explains why it dipped?  Should that explanation be automated?  Interpreted by a human?

Once you decide that you are going to continuously monitor your data, you have to maintain the hardware and software that does that monitoring.  There is a cost associated with that, both human and with machines.  Being prepared to assume that cost is one of the things that must be accounted for when deciding that you are going to measure something and want access to that data to be global.

In other words, just one guy shouldn't be collecting it.