Using statistics to aid in analysis

One of the tools that I like to use when I want to automate human analysis of logs is to use statistics.  How do you detect anomalies using statistical theory?

When pouring over logs, tables and stats, the one thing that we need to realize is that we don't need to recognize that things are normal but instead search for things that are abnormal.  When I look for things that are abnormal, I like to find things that are seriously out of the ordinary.  To do this, I make some assumptions about the nature of the data.

Suppose I was looking for a customer that had gotten compromised and started sending outbound spam.  One case to search for is the case that we see a sudden, rapid increase in outbound mail delivered for an IP.  To detect this:

  • Get a daily average of the IP's sending history
  • Find the standard deviation
  • Look for any days where the sending history exceeds the average + 3 standard deviations

By definition, the average + 3 standard deviations contains 99% of all values.  If we see an IP with sending history that exceeds this range, then we have detected an anomaly.  Why would an IP suddenly start sending so much mail that it exceeds its average by a wide margin?  We can't necessarily take action on this automatically (perhaps it is a newsletter) but we can flag it for investigation.

I refine this process a little by taking timing into account.  I exclude weekends from the average of the IP history because legitimate traffic on weekends declines by at least a factor 3.  However, when looking for days that exceed the average, I do include them.  I have found that this technique is very useful for alerting me to anomalies.