The nuances of measuring spam effectiveness, part 4

I'd like to continue on in my series on measuring the effectiveness of a spam filter.  The requirements I have so far is that is has to be continuous, statistically relevant, automated, and transparent to end users.  There is one more requirement.

Measuring spam effectiveness requires a solution that scales.

This is very closely related to my requirement for statistical relevancy.  Whereas statistical relevance is more about mathematics, scaling is more about clever engineering.

Let's suppose that, like most effectiveness measurements, you decide to measure 500 messages per day.  Wow, a whole 500 messages.  Then, someone like me comes along and says "500 is not enough.  I want you to capture 150,000 messages."  That's 300x as much mail as a simple solution.  Can you do it?

In order to scale, the necessary infrastructure must be built into place.  There are a number of little knobs and dials that can be flipped:

  1. Increase the number of IPs you want to sample - the more IPs, the greater the distribution of mail.
  2. Increase the amount of mail you want to sample per IP - more mail is more distribution.
  3. Increase the rate at which mail is randomly sampled - for example, there should be a setting somewhere such that 10% of mail going through a random mail host is sampled and copied to the effectiveness network, which is operating in parallel

That's simple enough, the ability to dynamically float up or down the amount of mail is a simple requirement that should be changeable by a config file.  But secondly, there should be some throttles.  For example, there could be a maximum limit.  If we sample 10% of all mail, we will capture up to that amount unless the total amount of mail is 5 messages per second (~ 400,000 messages per day).  Once we hit that rate, start dropping messages and not bother relaying them through the network.

In other words, the amount of mail that can flow through the network should be configurable, and there should be maximum throttles on there such that the network is not overwhelmed with too much mail.  Because mail is still being randomly sampled, this should not pose a problem nor throw off the results.

That's the basic summary of what I can think of.  It's non-trivial to measure all of this stuff but the end game is that you get a network that accurately mimics a production environment, and it can be used to try out new things.