Why measure when you can guess?

I recently bought a new house, and along the way our agent recommended we get a sewer inspection. Those things are so cool! The guy arrived with a roll of tubing attached to a gizmo that looked like a prop from City of Lost Children, unwound this into the sewer, and got a realtime video feed back to his laptop.

The bad news is he found a cave-in just past the south wall (but don’t worry, it won’t cost too much to fix). In fact the inspector suspected this even before we saw it on the video. He had a feeling something was wrong because he noticed a depression in the lawn.

But here’s the thing: even though he was able to diagnose the problem by applying a mixture of experience and intuition, we still sent down the camera to confirm his guess before shelling out the $$$ to buy the house. Guesses are no substitute for hard data, especially when large sums of money are involved!

Software optimization works the same way. Intuition can be a useful starting point, but if you really care to understand what’s going on, there’s nothing quite like looking inside the sewer and seeing for yourself. And yet, I’ve lost count of how many times I’ve seen variants on the following conversation:

Programmer: Can someone tell me how to speed up my Frobnicate method?

Expert: How do you know Frobnicate is the bottleneck? Have you profiled this program?

Programmer: Well no, but I’m pretty sure; I mean it’s got to be, right?

Expert: Do you even have a profiler?

Programmer: Well, not exactly…

Before you can hope to speed up a program, you have to know what part of the code is making it slow. The problem with guessing is that if you guess wrong, you can waste a lot of time trying to speed up something that wasn’t slow in the first place.

People are often tempted to add timing instructions to their code, using Stopwatch to record how long various sections take to run. There is indeed a time and place for this technique (I will blog more about it later), but it is a blunt instrument and prone to error. Common problems:

  • Can take a long time to hone in on the slowest pieces of code
  • The results are only as good as the places you think to add instrumentation
    • If you don’t think to add a timer around the slowest method, you may never even notice it
    • No good for spotting those "duh" things that ought to be fast, but are actually slow because of a silly mistake
  • The timing code itself takes some time to run, which may throw off the results
  • Easy to make mistakes in your timing code, in which case the results are worse than useless

A better solution is to use a profiling tool. Profilers are to game programmers what spirit levels are to carpenters. Sure, you can do good work without one, but at some point you get fed up of eyeballing shelf after shelf ("do you mind checking if this looks straight while I hold it in place? Ok, down a bit on the left, sorry, I meant the other way, ok, that looks good; wait, no, it seems wonky now I’m standing over here…") and you realize it is worth a trip to Home Depot to spend $15 for the proper tool.

Convinced yet?

Ok, now the bad news: there is no single tool that will tell you everything in one place. To truly understand performance, you must combine data from several tools. Here are the techniques I use most often, starting with the most important:

  1. Use a sampling profiler to see where your CPU time is being spent. There are many to choose from: ANTS, NProf, VSTS, VTune, and more that I forget right now. Never leave home without one! Unfortunately there are no such profilers for Xbox or Zune, but you can still get useful information by profiling a Windows version of your game, as the hotspots are usually similar on all platforms.

  2. Make sure your profiler isn’t being tricked by indirect garbage collection costs. Garbage collection is sneaky because it is a "play now, pay later" feature. Memory allocations are very fast in .NET, but if you do too many of them the garbage collector will kick in at some later time, and that can be expensive. A sampling profiler is no use for diagnosing this, because it can only tell you where the CPU is spending its time, and does not understand enough to know what earlier code is causing collections to occur. The solution is to use Remote Performance Monitor to see how long GC is taking on Xbox, then if it turns out to be a problem, use CLR Profiler on Windows to see exactly where the garbage is coming from, and optimize as necessary. Even if the Windows version of your game already runs fast enough, using CLR Profiler on the Windows version is the best way to understand what is causing GC on Xbox.
  3. Make sure your profiler isn’t being tricked by graphics rendering costs. Graphics is another "play now, pay later" feature which can lead to surprising profiler results. You need to understand how the CPU and GPU run asynchronously from each other, then work out whether you are CPU or GPU bound. If GPU bound, narrow down what part of the GPU is your bottleneck and try some performance experiments. Also watch out for pipeline stalls.

  4. Add a framerate counter. Use a special profiling project configuration that selects variable timestep and disables vsync, and launch via Ctrl+F5 to keep the debugger out of the way. Now you can check the framerate whenever you add new code, so you will notice if something slows down more than you were expecting. The sooner you spot a problem, the easier it is to fix, as you can fire up the profiler while the new code is still fresh in your mind.

  5. If you are trying to choose between design alternatives and want to understand the performance characteristics of various options ("which is faster: a virtual method or an interface?"), consider microbenchmarking specific language or runtime features in isolation. I will write more on this later.

  6. People are often surprised when I list .NET Reflector as a performance tool, but this is invaluable when you want to understand the performance characteristics of library code that you rely on. Pop quiz: does List<T>.Sort generate garbage?

Comments (13)

  1. John says:

    Is the answer to the Pop Quiz "Yes?"

    I think I see some sort of singleton deep down in sort where you get obtain an instance to a new ArraySortHelper. Not positive if that’s correct, though. Am I getting hot, Shawn? Great post!

  2. blog_derekh says:

    Great post! I can’t wait to go through all the links and this will make for a great reference for the forum. How do you find out if List<T>.Sort creates garbage using .NET Reflector?

  3. StatusUnkown says:

    I’d imagine List<T>.Sort will only generate garbage where T is a struct that doesn’t have a default comparer.

    I’ve made this mistake myself in the past with a SortedList, and it’s a perfect example of your point.

    (This is what I assume was happening)

    Structs are copy by val, meaning the default comparer can’t do a primitive comparison such as a pointer comparison. So it had to use reflection to pull an array of members from the struct and do comparisons on them instead!

    The garbage collector went ballistic on the xbox. CLR Profiler was indicating huge numbers of object arrays being collected, but surprise surprise I never noticed the performance hit on the PC :-)

    Luckily I have a GC diagnostic graph which made it immediately obvious something was badly wrong (showing a collection almost every frame on the xbox)

  4. Pete says:

    What’s faster …?

    1) if(foo == null)


        throw new CustomException("foo");




    2) try




      catch (NullReferenceException)


        throw new CustomException("foo");


  5. Brandon Furtwangler says:

    this is good info.

    In step 2 I would add SciTech’s memory profiler (memprofiler.com).  It’s much better than the CLR profiler…although not free.

  6. ShawnHargreaves says:

    > <Pete> What’s faster …?  …

    A perfect situation to try some microbenchmarking! Stay tuned: that’s coming up next-but-one…

  7. ShawnHargreaves says:

    John & StatusUnknown:

    This actually depends on which overload of List<T>.Sort you use (there are 4).

    If you look in Reflector, you will see that the first two (which take no arguments or an IComparer<T> interface) just chain to the fourth version (which has three arguments). That fourth version does some range checking, then chains to Array.Sort.

    The third overload (which takes a Comparison<T> delegate) uses Array.FunctorComparer to wrap the Comparison<T> delegate in an IComparer<T> interface, which it passes to Array.Sort. Garbage any time you call this overload!

    Digging through what happens inside Array.Sort is a little more complex, since this does a bunch of one-time initialization the first time it is called on any given type. But once you make it through the couple of levels of factory code, it looks like this will be fine as long as the type you are sorting has a garbage-free comparison function (either it is a reference type, or implements IComparable<T>, or you provide an explicit comparison interface).

  8. Pete says:

    "A perfect situation to try some microbenchmarking! Stay tuned: that’s coming up next-but-one…"


  9. Great post.  I’m in the middle of trying to find a memory leak that only sometimes appears and only appears after hours of my game being paused, when I have my debug console showing run-time data.  CLR Profiler produces such crazy sized logs that I can’t just run it for hours.  I wish there was a nice way to find out who is using the memory when the OutOfMemoryException is thrown…

  10. aaron says:

    " I wish there was a nice way to find out who is using the memory when the OutOfMemoryException is thrown…"

    cant you just put a breakpoint where you catch the exception then leave it running over night

  11. ShawnHargreaves says:

    > cant you just put a breakpoint where you catch the exception then leave it running over night

    That may work, but it also could pick up some other unrelated allocation, if some piece of rogue code is allocating almost all the memory in the machine, which then causes some other piece of harmless and correctly working code to run out of memory the next time asks for what would normally be a sensible/small/safe allocation.

    My favorite way to diagnose memory leaks is to compare the heap contents at two points in time using CLR Profiler. You don’t need to wait for it to actually run out of memory: just look at what is on the heap after 30 seconds of running, then again at what is on the heap after a couple of minutes. If you have a leak, the heap will have grown, and by seeing what type of object makes up the difference, you can identify where the leak is coming from.

  12. Business Software Dorset says:

    Hi Shawn. I have a performance problem, and was wondering if you could help me. Looked around for a suitable place to ask this question, and this seemed as good as any.

    I am creating a 3D post-it note application using XNA. The app gives users a board they can zoom in and out on.  It works fine when there are a couple of hundred post-its, but I am wanting to support something in the region of a 1000 at least. As the number of post-its go up the frame-rate goes down. Eventually this becomes significant.

    I listened to your talk on optimization that comes with the slides – the one that comes up if you google xna performance.

    Because XNA has no direct 3D text writing capability I am drawing the text into a render target, then storing the result into a texture, then using an effect to draw the texture on to my 3D surface.

    I have been using a separate texture for each post-it, but as far as I can tell that means I have to call effect.begin and effect.end for each post-it. Any attempt to change the texture inside effect.begin and effect.end leads to some error or another. And your talk explains having a lot of effect.begin effect.end inside a loop is bad for performance.

    After I listened to you talk I thought I would instead draw several of post-it notes on to the same texture and, as it were, draw them as a batch. Then using HLSL as demonstrated in the multi-instancing sample I would pass that texture into a modified version of the .fx file and that way be able to draw batches of post-its in a single call. I have been drawing my post-its on render targets 128×128 and then storing the result to a Texture2D.

    However when I come to make a larger render target so I can draw several post-its on it, I bump up against not being able to create a render target that is larger than the current graphics device back buffer. In the general case the user does not have the app full-screen, so I am restricted to created Texture2D s of an arbitrarily small size. Or am I?

    In a way, that is my first question. Is there a way of drawing textures up to the size of

    GraphicsDevice.GraphicsDeviceCapabilities.MaxTextureWidth x GraphicsDevice.GraphicsDeviceCapabilities.MaxTextureHeight

    that is independent of current graphicsdevice window or screen size?

    If not, could you recommend a way of stitching Texture2D s together so I can pass a single texture to my HLSL code along with an array of vectors to point to the right place within the Texture2D for any given post-it.

    Or perhaps, as I have explained the background to what I am trying to do, there is a better way that is not on my radar? Please do not say "use a different framework instead of XNA" or "use 2D instead of 3D" … I have come too far for that! … 😎