How To Do A Good Performance Investigation

I find that sometimes people have difficultly just getting started when doing a performance analysis – meaning they’re faced with a potentially big problem and don’t know where to begin.  Over the years many people have come to me under those circumstances and asked me how I would approach the problem.  So today I’m trying to distill those bits of advice – my Modus Operandi – into some simple steps in the hope that it might help others to get off to a good start.

So here it is, Rico’s advice on how to do a good performance investigation.

Preliminaries

The first thing to remember is not to try to do this in a rushed way.  The more of a hurry you are in to get to the bottom of the problem the less you can afford to be hasty.  Be deliberate and careful.  Block out a good chunk of time to think about the problem.  Make sure you have the resources you need to succeed – that means enough access to the right hardware and the right people.  Prepare a log book – electronic if you like – so that you can keep notes and interim results at each step.  This will be the basis of your final report and it will be an invaluable reference to anyone that follows in your footsteps – even if (especially if?) that someone is you.

Step 1 – Get the Lay of the Land

Before you look at any code, talk to the people involved.  You’ll want to get a feel for what they are trying to accomplish.  What are their key difficulties?  What is inherently hard about this problem?  What is the overall organization of the system?  What are the inherent limitations of those choices?  What is the chief complaint with the current system?  What would a successful system look like?  How do they measure success?

In addition to a basic understanding of how the system is intended to work the key question you want to answer is this:  Which resource is the one that should be critical in this system?  Is the problem fundamentally CPU bound, disk bound, network bound?  If things were working perfectly what would be constraining the performance?

Step 2 – Identify the current bottlenecks

In this step we cast a broad net to see what resource is the current limiting factor.  The tool I reach for first is PerfMon.  Look at key counters like CPU Usage, Memory Usage, Disk and Network I/O.  Examine these on all the various different machines involved if this is a server problem (i.e. check all the tiers).  Check for high levels of lock contention.

At this point you should be able to identify which resource is currently the one that is limiting performance.  Often it is not the resource identified in Step 1. 

If it is the correct resource – the one that is supposed to be the limiting factor in this kind of computing – that’s a good preliminary sign that sensible algorithms have been selected.  If it’s the wrong resource it means you are going to be looking for design problems where a supposedly non-critical resource has been overused to the point that it has become the critical resource.   The design will have to be altered such that it does not have this (fundamentally unnecessary) dependence.

Step 3 – Drill down on the current bottleneck

A common mistake in a performance analysis is to try to do step 3 before step 2.  This is going to lead to a good bit of waste because you could do a deep analysis of say CPU usage when CPU usage isn’t the problem.  Instead, choose tools that are good at measuring the problem resource and don’t worry so much about the others for now.  If it’s hard to measure the resource in question, add instrumentation for this resource if possible.  The goal is to find out as much as you can about what is causing the (over) use of the bottleneck resource.

When doing this analysis, consider factors that control the resources at different system levels starting from the largest and going to the smallest.  Is there something about the overall system architecture that is causing overuse of this resource?  Is it something in the overall application design?  Or is it a local problem with a module or subcomponent?   Most significant problems are larger in scope, look at those first.  They are the easiest to diagnose and sometimes the easiest to correct.  E.g. if caching were disabled on a web server you could expect big problems in the back end servers.  You’ll want to make sure caching is working properly before you decide to (e.g.) add more indexes to make some query faster.

Your approach will need to be tailored to the resource and the system. For CPU problems a good time profiler can be invaluable.  For SQL problems there’s of course SQL Profiler (find the key queries) and Query Analyzer (view the plans).  For memory issues there are abundant performance counters that can be helpful, including the raw memory counters and .NET memory management ones, and others.  Tracking virtual memory use over time can be helpful – sampling with vadump is handy.  Examination of key resources by breaking into the system with a debugger can also be useful.

Step 4 – Identify anomalies in the measurements

The most interesting performance problems are almost always highlighted by anomalous observed costs.  Things are happening that shouldn’t be happening or need not happen.  If the critical resource isn’t the “correct” one as identified in Step 1 it’s almost certainly the case that your root-cause analysis will find an undesired use of the resource.  If the resource was the “correct” one then you’ll be looking for overuse to get an improvement.

In both cases it is almost always helpful to look at the resource costs per unit-of-work.  What is the “transaction” in this system?  Is it a mouse click event?  Is it a business transaction of some kind?  Is it an HTML page delivered to the user?  A database query performed?  Whatever it is look at your critical resource and consider the cost per unit of work.  E.g. consider CPU cycles per transaction, network bytes per transaction, disk i/o’s per transaction, etc.   These metrics will often show you the source of the mistake – consider in my recent analysis of Performance Quiz #6 where I looked at the number of string allocations per line of input and then bytes allocate per byte in the input file.  Those calculations were both easy and revealing.

Expressing your costs in a per-unit-of work fashion will help you to see which costs are reasonable and which are problems.

Step 5 – Create a hypothesis and verify it

Based on the analysis in step 4 you should be able to find a root cause for the current bottleneck.  Design an experiment to validate that this is the case.  This can be a very simple test such as “if we change the settings [like so], it should make the problem much worse.”  Take advantage of any kind of fairly quick validation that you can do… the worst thing to do is to plunge into some massive correction without being sure that you’ve really hit the problem.  If there is nothing obvious, consider doing only a small fraction of the corrective work.  Perhaps just enough for one test case to function – validate that before you proceed.

The trick to doing great performance work is to be able to try various things and yet spend comparatively little time on the losing strategies while quickly finding and acting on the winners.

Step 6 – Apply the finished corrections, verify, and repeat as needed

After Step 5 you should be highly confident that you have a winning change on your hands.  Go ahead and finish it up to production quality and apply those changes.  Make sure things went as you expected and only then move on.  If your changes were not too sweeping you can probably resume at Step 2, or maybe even Step 3.  If they were very big changes you might have to go back to Step 1.

Step 7 – Write a brief report

Summarize your method and findings for your teammates.  It’s invaluable as a learning exercise for yourself and as a long-term resource for your team. 

Post Script

I wrote a followup article which offers more prescriptive advice about managed code specifically -- see Narrowing Down Performance Problems in Managed Code