# The Spam Curve – a theory

I have a theory about the nature of spam and non-spam.  Spam and non-spam, in terms of their legitimacy, are at different ends of the spectrum.  They do not share very many characteristics that are similar, for the most part.  However, there is still that very small proportion of overlap which makes it difficult to tell the difference and makes the 100% spam processing rate difficult.

x                                                     o
x                                                     o
xx                                                   oo
xx                                                   oo
xxx                                                 ooo
xxx                                                 ooo
xxx                                                 ooo
xxxx                                               oooo
xxxxx                                             ooooo
xxxxxx                                           oooooo
xxxxxxxx                                         oooooo
xxxxxxxxxxxxx                             ooooooooooooo
xxxxxxxxxxxxxxxxxxxxxxxxxx   oooooooooooooooooooooooooo

If this is the legitimacy scale, then the above picture is spam on the left and the right one is non-spam.  They are both mostly concentrated on the opposite sides of the graph, but note that there are very long tails at the end of the curve.  Thus, the distribution of spam and non-spam would appear to be the following in practice:

x                                 o
x                                 o
xx                               oo
xx                               oo
xxx                             ooo
xxx                             ooo
xxx                             ooo
xxxx                           oooo
xxxxx                         ooooo
xxxxxx                       oooooo
xxxxxxxx                     oooooo
xxxxxxxxxxxxx         ooooooooooooo
xxxxxxoxoxoxoxoxoxoxoxoxoxoxoxooooo

In these charts, the horizontal axis is legitimacy of the message and the vertical axis is the number (or concentration) of those messages.  Spam and non-spam, in their usual form, are either incredibly spammy or very legitimate.  However, we see some overlap between the two types of messages, and it is this overlap that is the most difficult type of message to filter since it shares characteristics of spam and non-spam messages.

This figure, from here on out, I will refer to as the Spam Curve.  I would like to point out that even though the figures are the same size, the amount of spam vastly outnumbers the amount of non-spam.  The figures are not drawn to scale.

I have also considered naming this curve after myself.  Perhaps the Zpam Curve (where zpam is pronounced spam) is a candidate.  Another candidate is the Spamm Curve.  A third was the Z-Spam Curve.  Maybe I should just forget the cutesie names and stick with Spam Curve.