Rico's Instrumentation Aphorisms

A few months ago, Mary Gray of the Management Practices Team came to talk to me about good practices for creating performance counters and doing measurements generally.  She interviewed me on the topic for about an hour and was madly scribbling notes the whole time while I talked a mile a minute.  What's below is a slightly edited version what she took away from the interview.  I thought it was interesting enough that you guys might like to see it so here it is.

Mary, thank you for allowing me to share.

Adding instrumentation in the form of events and performance counters to your software is one of the most important things you can do to make your component or application more manageable by IT personnel, more supportable by CSS, more easily tuned and debugged by developers and testers.

The OS already has performance counters you can use for such resources as CPU, disk, memory, and network resources. These are the primary resources that you will need to track for most software. You don't need to add a lot of performance counters or events to your software for raw resources; the trick is to correlate what your software thinks it is doing with the operating system resource impact of those operations.

Judiciously added instrumentation allows you to more easily pinpoint the states that lead to poor performance or failure. Well designed events inform monitoring software and IT admins about whether the software is operating normally, in a degraded state, or has failed completely. Good tracing events in conjunction with perf counters related to the work of the software allow diagnosis and tracking of trends. Events targeted to the administrator can identify what work was being done for which user context when a failure occurs.

Rico's Instrumentation Aphorisms

Instrumentation aphorism #1: Attribute the cost, don't describe it.

To attribute costs, the important word is "correlation". You want to correlate what your software thinks it is doing to what the operating system knows about resource usage. You can use (e.g.) ETW tracing events to mark the beginning and end of "jobs" or transactions in your software's work life.

What is a “transaction” in the runtime life of your software?  Is it a mouse click event?  A business transaction of some kind?  An HTML page delivered to the user?  A database query performed?  Whatever it is, look at your critical resources and consider the cost per unit of work.  For example, consider CPU cycles per transaction, network bytes per transaction, disk i/o’s per transaction, etc.  

Tracing events, to be useful, need to be associated with the higher level transactions of the software rather than associated with the life of single objects. You can have too many events and events at too low a level or marking time intervals that are too short to be useful. This use of events and perf counters just creates overwhelming noise and does not allow you to see trends easily.

This correlation between the work of the software and resources should also be used in administrator events marking changes of state, not just tracing events. Administrators are running the software for a reason and have every interest in knowing why (e.g.) MOM 2005 is reporting a degraded state for it - why the system is slowing down or why the software is banging away at the disks continuously. These events, as opposed to tracing events should provide actionable advice.

Instrumentation aphorism #2: Account for consumption.

To account for consumption, you will want to calculate rates rather than just measure occurrences. Look at the resource costs per unit-of-work of work. What is your software accomplishing to justify its consumption of CPU, memory, disk, network , or other resources? Expressing resource costs in a per-unit-of work fashion will help you to see which costs are reasonable and which are problems. You want to be able to trace or to inform adminstrators what resources are being used.

The operating system already gives you a variety of performance counters that measure CPU consumption, disk I/Os, memory usage, and network activity. These are the primary measuring sticks you need to compare to what your software is doing. The performance counters you add are most useful when they calculate the rate of work accomplished.

You can generate tracing events that tell you the rate of work, what the user context is. The combination of events that mark the start and end of transactions with rate counters allows developers and CSS people to pinpoint the resource that is being pinched and wrecking performance or starting a death spiral to failure.

If you are considering a design which sequesters a chunk of memory which your software managed, you may want to think twice about it. The OS already tracks memory resources. If you manage your own memory, then you have to duplicate the operating system plumbing to be able to diagnose performance problems and failures. The programming and maintenance costs for this may outweigh the hoped-for design benefits.