The other half of accurate metrics

Referring back to my previous post on accurate metrics referring to spam-in-the-inbox, spam is one side while false positives are the other.

Whereas we measure spam as a proportion of what the user sees, we can measure false positives as a proportion of the user's legitimate mail stream.  I have seen many organizations say that their FP rate is 1/250,000 messages, but that is quite vague.  Is that 250,000 total messages received, spam + nonspam, or is it 1 FP per 250,000 legitimate messages?  If it is per total messages received, then it is pretty easy to hit that metric as spam keeps going up but a person's legitimate mail stream stays the same.

Thus, that leaves us with how many false positives occur per legitimate mail.  I would say that 1/100,000 should be the minimum goal to shoot for.  This corresponds to an FP rate of 0.001%.

The bonus of acquiring both SITI and the FP rate is that we can plot the two metrics on a scatter plot and calculate the correlation coefficient between the two to see if any existing trends exist (ie, does a higher SITI correspond to a lower FP rate?).

Once we attain FP rates and SITI, we need to figure out how bad FPs affect SITI.  For example, suppose we have the following:

FP rate = 1/22,000

SITI = 8%

That's a decent spam metric, but a high false positive rate.  If we baseline the FP rate to 1/100,000, how does that affect (increase) the spam-in-the-inbox number?  One way we could look at it is the following:

Baseline = 1/100,000

FP rate = 1/22,000

100,000 / 22,000 = 4.54

Equivalent SITI = 4.54 x 8% = 36%

That's one way of looking at it, but it assumes that an increase in the FP rate corresponds to a proportional increase in SITI, and that is something I just pulled out of the air and probably not reflective of reality.  More work needs to be done in this area to refine this model.