Logging and Instrumentation

Over the years I’ve helped to architect, develop, review, and troubleshoot various solutions ranging from windows applications up to huge enterprise scale applications and I keep banging my head against one particular area – Logging and Instrumentation.

Consider a typical distributed enterprise application consisting of a number of servers exposing IIS Applications, Web Services and Databases, its running fine for months and then one day the consumers of said application start to report poor performance problems or your operations people start to see server instability.

Ideally you’d like to be able to walk up to your system and see “at a glance” what part of your system is causing the performance problems, Are the Front End Web Services overloaded and queuing incoming traffic? Are Web Service calls to a back-end system taking an age to return? Are your database calls timing out?

In my experience this level of instrumentation is never available in one place meaning you have to trawl through many different sources of information (IIS Logs across multiple machines, Event Viewer on each machine, SQL Server logs, PerfMon, NetMon), thus making it hard and nearly impossible to diagnose where the problem might be easily. 

Then there’s instrumentation inside your code, using the aforementioned logs you might be able to narrow down the performance problem but you then need to have a bit more granularity to identify what might be going on – it could be a Web Service call your making slowing you down, a database query suddenly returning many more rows than you expected, etc.).

At the very least you code should have interface level “in and out” type logging and timing, that way you can see which calls into your code are taking the time, and ideally whenever you “call out” to an external system or bit of code so you can be sure where the time is being spent.

So, enough of the lecture J  How can we help you to get this instrumentation into your code today?

I am a massive fan of the Enterprise Instrumentation Framework (EIF) toolkit that was released a few years back; it’s a very powerful framework which .NET developers can use to add instrumentation to their code. Instrumentation output is pushed into pluggable event sinks, out of the box you get Windows Trace, WMI and the Event Viewer and you can very easily build your own sinks to perhaps output trace into a MSMQ for later insertion into a SQL Server database, etc.

The pluggable nature of the sinks is great for future proofing and the out-of-the-box sinks are enough for most customers, Windows Trace uses the ultra fast kernel ETW stuff that Windows uses internally the overhead is extremely minimal. WMI is great for integration into WMI monitoring software such as Microsoft Operations Manager (MOM) and Event Viewer to flag up alerts.

There’s plenty of good documentation on EIF so I won’t drill into it any further, however this is an often overlooked feature of EIF called Request Tracing. Request Tracing allows you to link all trace output to a specific logical request, i.e. if a user calls into an ASP.NET page and you then subsequently calls half a dozen other .NET components all trace output will be “grouped” under a unique Request Identifier allowing you to see what happened for that given user.

You can even extend EIF to pass this Request Identifier across tiers; out of the box with Remoting EIF will pass the Identifier across tiers and you can easily use a WSE Filter to pass the token for Web Service calls. I’ve got a sample somewhere if anyone is interested.

In addition to this the time of a given “Request” is also timed and logged as part of the trace record, so you get an idea of how long the end to end request took, you can also nest Request Traces to be able to time specific parts of a Request. In addition to this great feature you also get a free performance counter that allows you to monitor Execution Time of your Request.

Here’s the code required to start a Request Trace, you need to define a RequestEventSource for each timing “block”:

staticRequestEventSource Method1Source = new RequestEventSource("Method1", "Does something");

// .....

using( new RequestTrace(Method1Source) )
      {
         // Your code here
      }

Using PerfMon you will then find this Request Trace listed under the Event Sources:Request performance object:

Using PerfMon you can then monitor the execution time, I’ve created a simple test harness that demonstrates most of the calls executing sub second and then hitting a (forced) problem where the calls start to take over 4 Seconds. 

As you can hopefully see this could help you to narrow down the cause of performance problems inside your application, especially if you have a number of RequestTrace blocks inside your code. You can of course write your own Performance Counters but this approach is a bit more straightforward – and free J

You of course also get all of this information logged to a Trace File if you prefer to analyse it that way, and as I mentioned before any Trace messages you output within the Request Trace Block will be linked to that entire request, e.g:

TraceMessageEvent.Raise("Hello World");

I can’t emphasise just how important this type of information is, and if you have the time look into adding your own Performance Counters to your code to give you a view into what your application is doing, they will prove invaluable – just like the built-in Windows counters….

You can use EIF from anything you like, Windows Applications, ASP.NET, BizTalk….