Leveraging Azure Diagnostics to Troubleshoot Intermittent Problems

One of the practices I advise with Windows Azure services (and really any service) is self-monitoring to find problems that may not be fatal but are indications of serious problems developing.

I happened to run across a good example of how to do this on the Azure Miscellany blog.

In this case, the writer, Joe, had a web role which was spiking CPU usage in some circumstances that happened only in production.  Of course the problem occurred only intermittently and couldn't be reproed in the development fabric (all the really gnarly problems manifest like this.)

In the post, Joe writes about how he added code to his web role to monitor its CPU usage and when it went above 90%, his monitoring code triggered a crash dump of the role which he was then able to analyze offline to identify where the CPU usage spike was happening.

This general approach can be used in a number of situations.  Basically, if you are able to write code to detect when your role (whether a worker role or web role) is in a bad state, you are then able to trigger a crash dump to capture the state - or you could at that point tweak some of the SourceSwitch values to increase the verbosity of logging to try to see what's going on.  In some cases, no run-time action is required - you just need to log for operations staff that something odd is happening with enough information for them to resolve the problem (e.g. it could be a configuration setting not set correctly during a deployment.)

You could even have a separate role monitoring the role and then working through the Azure Service Management API to tweak the role that is experiencing problems.

While at Microsoft, I worked with a Microsoft Research developer on a prototype tool called HiLighter that identified patterns in log data to alert operational staff to problems.  Because this analysis process runs in real time alongside the production services, and it learns from past problems, it is able to more quickly identify a developing problem than a human is able to do looking at reams of real-time log and performance data.

Think of Diagnostics broadly – not just as logging, but also as active monitoring of problems in your roles.