A few months ago I posted something called the Spam Curve. The essence of the post was that most mail content was either very spammy or very clean. If this could be represented on a chart, the chart would look like a letter U. Most content would fall on the extreme sides of the chart.
I did some traffic analysis on our own data for Dec 14, 2006. One day’s worth of data is not a lot to hang your hat on, but it’s a start. In my research, I checked our historical database for all IPs sending us mail and how we marked that mail (ie, as spam or non-spam). It turns out that the majority of our senders are either sending us no spam (0% of the messages they sent were marked as spam) or very spammy (100% of the messages were marked as spam).
These statistics do not account for volume, so all IP addresses have equal weight. This means that a zombie that we missed that sent us 1 mail would be marked at 0%. Still, nearly 1/3 of all of our senders were sending us 0% spam. Just over a third of our senders were sending us 100% spam. If I widen the parameters to less than 10% spam and over 90% spam, fully 1/3 of our users are sending less than 10% spam and 59% are sending us 90% spam. Together, they account for 93% of the individual IPs that are sending us mail.
There is also a spike at 50% spam. In other words, other than the large jumps at the ends, the next largest is at senders sending us 50% spam and 50% non-spam. It seems strange that there would be so many borderline mail senders but my guess would be that these are newsletters.
The next piece of research to conduct is to determine what volume of mail these IPs are sending us and see how it stacks up.
One final note: the above reports do not account for mail that we drop at the edge due to sender blocks.