How do I debug a problem in someone elses code

There’s been an interesting confluence of discussions at https://weblogs.asp.net about debugging other people’s code. JeremyK started the ball rolling, and Eric’s picked up with it. So I figured I ought to add some more details from my end. First off, Eric’s “part two” article is an absolute must read.

Just about every week I end up having to debug a problem in somebody else’s code. Either it’s something I’m testing that doesn’t work (Hmm. After my changes, why doesn’t winamp play music any more?), or it’s someone on my team that’s having a problem (Can you help me figure out why CoCreateInstance isn’t creating my object?).

The first thing I do when I’m debugging is to ensure that windbg is installed on the machine, and that the NT symbols are up-to-date (or that they’re using Microsoft’s public symbol server). Btw, the symbols that Microsoft publishes for Windows are almost exactly the same symbols we use internally, the internal functions have some more information like line number information and structure definitions (and routine names for static functions), but I rarely need that information when debugging – the routine names are almost always enough to get me started.

Often times I come into people’s office and ask to use windbg but they say “I’ve got visual studio, why can’t you use that”? Well, the answer is simply: “Because visual studio doesn’t have the level of command line support that windbg has”. It really doesn’t, although it’s improved immensely in recent versions. Windbg offers an essentially unlimited length command history window – I can look backwards through the history window and see what’s changed.

Also, when I’m debugging (even when I have source code available) I almost always debug in assembly language single step mode. This way, even if I miss the decision point that caused the failure I can look back in the history to see what failed. Windbg’s command line window is essential for that – I get registers AND code at the same time. I’ve had other developers at Microsoft look over my shoulder as I’m debugging and exclaim in surprise “I didn’t know anyone actually ever looked at the assembly language any more!”. Well, I do. Sue me J

The next thing that’s important is to be fearless. I can’t think of the number of times that people have said to me “Wait, that’s OLE’s code. Why are you debugging in OLE’s code?” Well, if you want to understand the problem, you need to look at the code. Even if you don’t have the symbols, you need to look at the code. It can be quite daunting to debug someone else’s code, but press on. At a minimum, you might learn something.

The other thing I always keep in mind is relates to Eric’s Calvin&Hobbes comment: “I have got to start listening to those quiet, nagging doubts.”

Look at the routines that are being called. If I’m debugging something, then at every procedure call, I ask myself “Could this be the source of the error?” If it is, I step into it. If it isn’t, I step over it. But when I do, I always look at the EAX register. That’s where the C calling convention leaves the return value of the function (AL or AX if it’s a function that returns a bool or a word). So if I’m debugging CoCreateInstance, then if I see it call “CoCreateInstanceEx”, I’ll step into it – it’s likely that CoCreateInstance is just a wrapper around CoCreateInstanceEx and CoCreateInstanceEx is going to be the real routine that returns the failure.

The next thing is to iterate over the failure. At some point you’ll step over the real cause of the failure. When this happens, restart the app and retry the failure case. And this time, step into the function instead of stepping over the function. Keep an eye out as to what is going on. Every function that fails should trigger a quiet nagging doubt. Please keep in mind though, it’s entirely possible that the failure is expected – you need to use critical thinking when evaluating the failure. For example, if the function calls RegOpenKeyEx and the RegOpenKey fails, then check the registry key in question – see if the failure is supposed to happen or not. Maybe the registry key they’re opening is an optional key.

The other thing to keep in mind is that this stuff takes practice. I’ve debugged through COM activation enough times that I know where to put the breakpoints right away. That’s wasn’t always the case, I’ve spent enough time looking at problems that I’ve pretty much learned my way around the code by trial and a great deal of error.

Of course all the discussion above assumes that the problem you’re debugging is simple and easily reproducible. This is true for the vast number of problems I’ve debugged over the years, but every once in a while you run into one that takes hours of work to reproduce. Those are harder to deal with, especially when the crash appears to be in someone else’s code. If it takes a long time to reproduce the problem then you need to be very careful when stepping through the code. It can be quite frustrating, I know.

Oh, and always take every debugging session as an opportunity to learn something new. For instance, as I just mentioned above, CoCreateInstance is a wrapper around CoCreateInstanceEx. Well, if you application uses both CoCreateInstance and CoCreateInstanceEx, then you can speed up you application’s load time slightly by removing all the calls to CoCreateInstance in your application and replace them with calls to CoCreateInstanceEx by removing one routine that needs to be loaded into your DLL.