Numbers don't lie, but they can confuse (part 1)

One of things I do here at Microsoft is look at numbers.  I have a table of statistics that I look at, not every day, but certainly a few times per week.  It's a table of the daily number of messages we block, how many are blocked by content filtering vs blacklists, how many messages pass through to the end user, the total number of messages, and so forth.  One of the numbers that I had to derive is the legitimate number of messages seen on our network.  The formula I use is complicated and it's not perfect, but the trending data is useful.

Long time readers will note that one of my favorite statistical measures is correlation.  I like to see how much a change in one variable affects another one.  This is easy to do in Excel; given a table of figures, take two columns and type in =correl(a1:a120,b1:b120), assuming that your data ranges from 1 to 120.  The correlation coefficient will appear, and it tells us how closely the two pieces of data move together. 

You can also make a scatter plot of the two variables.  Again, Excel has this feature built-in and you can highlight both series of cells and create a scatter plot.  These are also useful for viewing relationships, but by itself a scatter plot is not useful unless you enhance it a little bit.  You have to click the data series and add a linear regression line, and also make sure that the equation and R2 values are visible.

The R2 value is the square of the correlation coefficient.  It tells you how much the variance of one variable explains the movement of the other.  For example, if the R2 is 0.073, then you know that 7% of the movement of variable A is "caused" by variable B (of course, in statistics, correlation does not equal causation).  Also, we know that the correlation coefficient is the square root of 0.073, or ±0.270.  Squaring will cause a - to become a +, so we don't know what the original sign was on the correlation coefficient but we can see it on the scatter plot.  A negative sloping line means that the correlation coefficient was negative.

Scatter plots and correlation coefficients are useful for avoiding spurious correlations.  The rule of thumb I use is that any correlation coefficient greater than 0.10 based upon 100 observations is useful.  Or, a correlation coefficient x number of observations must be greater than 10.  So, if we have 50 observations and a coefficient of 0.20, that is statistically useful.  30 observations x 0.25 = not significant.

This helps me determine whether or not certain anti-spam technologies are actually useful.  Lets suppose that we have a new content filter and then I observe the effect on spam-in-the-inbox, or SITI.  If someone says "Oh, this new filter is really helping out", all I need to do is plot the number of blocks by this new filter against SITI and determine the correlation coefficient.  If it's less than 0.10, then we know the statement the person said is false or their perception of this filter's effectiveness is spurious.  If the correlation coefficient is -0.50, then we know the more this new filter catches, the less spam people will see in their inbox.

In this way, statistics can help us measure the relationships between different parts of the system.  Given this background, in my next post, I will look at some puzzling discoveries I have found on our network.