The Spam Curve - a theory

I have a theory about the nature of spam and non-spam.  Spam and non-spam, in terms of their legitimacy, are at different ends of the spectrum.  They do not share very many characteristics that are similar, for the most part.  However, there is still that very small proportion of overlap which makes it difficult to tell the difference and makes the 100% spam processing rate difficult.

x o
x o
xx oo
xx oo
xxx ooo
xxx ooo
xxx ooo
xxxx oooo
xxxxx ooooo
xxxxxx oooooo
xxxxxxxx oooooo
xxxxxxxxxxxxx ooooooooooooo
xxxxxxxxxxxxxxxxxxxxxxxxxx oooooooooooooooooooooooooo

If this is the legitimacy scale, then the above picture is spam on the left and the right one is non-spam. They are both mostly concentrated on the opposite sides of the graph, but note that there are very long tails at the end of the curve. Thus, the distribution of spam and non-spam would appear to be the following in practice:

x o
x o
xx oo
xx oo
xxx ooo
xxx ooo
xxx ooo
xxxx oooo
xxxxx ooooo
xxxxxx oooooo
xxxxxxxx oooooo
xxxxxxxxxxxxx ooooooooooooo
xxxxxxoxoxoxoxoxoxoxoxoxoxoxoxooooo

In these charts, the horizontal axis is legitimacy of the message and the vertical axis is the number (or concentration) of those messages. Spam and non-spam, in their usual form, are either incredibly spammy or very legitimate. However, we see some overlap between the two types of messages, and it is this overlap that is the most difficult type of message to filter since it shares characteristics of spam and non-spam messages. 

This figure, from here on out, I will refer to as the Spam Curve. I would like to point out that even though the figures are the same size, the amount of spam vastly outnumbers the amount of non-spam. The figures are not drawn to scale.

I have also considered naming this curve after myself. Perhaps the Zpam Curve (where zpam is pronounced spam) is a candidate. Another candidate is the Spamm Curve. A third was the Z-Spam Curve. Maybe I should just forget the cutesie names and stick with Spam Curve.