As I was saying in my previous post, statistics, and correlation and scatter plots in particular, are excellent ways of verifying whether or not relationships within components of the spam filter are valid or if the theory is spurious.

Now that I have a derived Spam-in-the-inbox value (SITI), I calculated the correlation coefficient between a bunch of various components, including SITI. I wanted to see what factors affect the amount of spam getting delivered to the end-user.

One relationship that has a positive correlation is the number of messages we deliver to the end-user and SITI. In other words, the more messages we deliver, the higher the SITI value. Of course, this is hardly revolutionary; more messages going to users means they see more spam in their inbox.

What I am looking for is unusual relationships. The strongest relationship with a positive correlation to SITI is with regards to virus rejections. Properly interpreted, the more viruses we reject, the higher the SITI value. The correlation is quite strong, +0.48.

I find this difficult to explain. When our virus filters are performing well, it means that we are delivering more spam to the end user. I don’t know why this could be. One would think that on a virus blitz, the spam filters wouldn’t be affected. On the other hand, perhaps everyone else is getting infected, flipped into a botnet and sending around a new round of spam.

However, this theory is disproven by checking the relationship between total inbound mail and total viruses rejected. The correlation between those two is negative, meaning when we catch more viruses, it corresponds to a decrease in total mail (not to mention a decrease in messages caught by blacklists and decrease in messages filtered by content).

A puzzling phenomenon indeed.

"perhaps everyone else is getting infected, flipped into a botnet and sending around a new round of spam."

That theory sounds very likely indeed.

"However, this theory is disproven by checking the relationship between total inbound mail and total viruses rejected. The correlation between those two is negative"

I don’t see how the latter correlation disproves the earlier theory. Total inbound mail drops because some other providers check for viruses as you do. Now imagine if everyone didn’t check for viruses. Total inbound mail would be up, transmission of viruses would be up, and spams would be up. Your earlier theory remains consistent.

What I meant was that when we catch more viruses, we deliver more spam to the user even though total inbound mail is down. That’s the part I found puzzling.

There are more viruses, therefore you catch more viruses, and opponents of antivirus programs catch more viruses, and the second kind of catch leads to sending more spams.

As I was saying in my previous post, statistics, and correlation and scatter plots in particular, are excellent ways of verifying whether or not relationships within components of the spam filter are valid or if the theory is spurious. Now that I hav