Numbers don't lie, but they can confuse (part 2)

Article
12/13/2007

As I was saying in my previous post, statistics, and correlation and scatter plots in particular, are excellent ways of verifying whether or not relationships within components of the spam filter are valid or if the theory is spurious.

Now that I have a derived Spam-in-the-inbox value (SITI), I calculated the correlation coefficient between a bunch of various components, including SITI. I wanted to see what factors affect the amount of spam getting delivered to the end-user.

One relationship that has a positive correlation is the number of messages we deliver to the end-user and SITI. In other words, the more messages we deliver, the higher the SITI value. Of course, this is hardly revolutionary; more messages going to users means they see more spam in their inbox.

What I am looking for is unusual relationships. The strongest relationship with a positive correlation to SITI is with regards to virus rejections. Properly interpreted, the more viruses we reject, the higher the SITI value. The correlation is quite strong, +0.48.

I find this difficult to explain. When our virus filters are performing well, it means that we are delivering more spam to the end user. I don't know why this could be. One would think that on a virus blitz, the spam filters wouldn't be affected. On the other hand, perhaps everyone else is getting infected, flipped into a botnet and sending around a new round of spam.

However, this theory is disproven by checking the relationship between total inbound mail and total viruses rejected. The correlation between those two is negative, meaning when we catch more viruses, it corresponds to a decrease in total mail (not to mention a decrease in messages caught by blacklists and decrease in messages filtered by content).

A puzzling phenomenon indeed.

Numbers don't lie, but they can confuse (part 2)

Additional resources