How do I debug a problem in someone elses code

There’s been an interesting confluence of discussions at about debugging other people’s code.  JeremyK started the ball rolling, and Eric’s picked up with it.  So I figured I ought to add some more details from my end.  First off, Eric’s “part two” article is an absolute must read.

Just about every week I end up having to debug a problem in somebody else’s code.  Either it’s something I’m testing that doesn’t work (Hmm.  After my changes, why doesn’t winamp play music any more?), or it’s someone on my team that’s having a problem (Can you help me figure out why CoCreateInstance isn’t creating my object?).

The first thing I do when I’m debugging is to ensure that windbg is installed on the machine, and that the NT symbols are up-to-date (or that they’re using Microsoft’s public symbol server).  Btw, the symbols that Microsoft publishes for Windows are almost exactly the same symbols we use internally, the internal functions have some more information like line number information and structure definitions (and routine names for static functions), but I rarely need that information when debugging – the routine names are almost always enough to get me started.

Often times I come into people’s office and ask to use windbg but they say “I’ve got visual studio, why can’t you use that”?  Well, the answer is simply: “Because visual studio doesn’t have the level of command line support that windbg has”.  It really doesn’t, although it’s improved immensely in recent versions.  Windbg offers an essentially unlimited length command history window – I can look backwards through the history window and see what’s changed.

Also, when I’m debugging (even when I have source code available) I almost always debug in assembly language single step mode.  This way, even if I miss the decision point that caused the failure I can look back in the history to see what failed.  Windbg’s command line window is essential for that – I get registers AND code at the same time.  I’ve had other developers at Microsoft look over my shoulder as I’m debugging and exclaim in surprise “I didn’t know anyone actually ever looked at the assembly language any more!”.  Well, I do.  Sue me J

The next thing that’s important is to be fearless.  I can’t think of the number of times that people have said to me “Wait, that’s OLE’s code.  Why are you debugging in OLE’s code?”  Well, if you want to understand the problem, you need to look at the code.  Even if you don’t have the symbols, you need to look at the code.  It can be quite daunting to debug someone else’s code, but press on.  At a minimum, you might learn something.

The other thing I always keep in mind is relates to Eric’s Calvin&Hobbes comment: “I have got to start listening to those quiet, nagging doubts.”

Look at the routines that are being called.  If I’m debugging something, then at every procedure call, I ask myself “Could this be the source of the error?”  If it is, I step into it.  If it isn’t, I step over it.  But when I do, I always look at the EAX register.  That’s where the C calling convention leaves the return value of the function (AL or AX if it’s a function that returns a bool or a word).  So if I’m debugging CoCreateInstance, then if I see it call “CoCreateInstanceEx”, I’ll step into it – it’s likely that CoCreateInstance is just a wrapper around CoCreateInstanceEx and CoCreateInstanceEx is going to be the real routine that returns the failure.

The next thing is to iterate over the failure.  At some point you’ll step over the real cause of the failure.  When this happens, restart the app and retry the failure case.  And this time, step into the function instead of stepping over the function.  Keep an eye out as to what is going on.  Every function that fails should trigger a quiet nagging doubt.  Please keep in mind though, it’s entirely possible that the failure is expected – you need to use critical thinking when evaluating the failure.  For example, if the function calls RegOpenKeyEx and the RegOpenKey fails, then check the registry key in question – see if the failure is supposed to happen or not.  Maybe the registry key they’re opening is an optional key.

The other thing to keep in mind is that this stuff takes practice.  I’ve debugged through COM activation enough times that I know where to put the breakpoints right away.  That’s wasn’t always the case, I’ve spent enough time looking at problems that I’ve pretty much learned my way around the code by trial and a great deal of error.

Of course all the discussion above assumes that the problem you’re debugging is simple and easily reproducible.  This is true for the vast number of problems I’ve debugged over the years, but every once in a while you run into one that takes hours of work to reproduce.  Those are harder to deal with, especially when the crash appears to be in someone else’s code.  If it takes a long time to reproduce the problem then you need to be very careful when stepping through the code.  It can be quite frustrating, I know.

Oh, and always take every debugging session as an opportunity to learn something new.  For instance, as I just mentioned above, CoCreateInstance is a wrapper around CoCreateInstanceEx.  Well, if you application uses both CoCreateInstance and CoCreateInstanceEx, then you can speed up you application’s load time slightly by removing all the calls to CoCreateInstance in your application and replace them with calls to CoCreateInstanceEx by removing one routine that needs to be loaded into your DLL.



Comments (8)

  1. Phaeron says:

    You forgot a step: before you even walk over to the person’s office, have them try "@err,hr" or "eax,hr" in the debugger first.

    Visual Studio’s debugger is more polished, but for the tough bugs WinDbg definitely owns. !locks, the ability to force a stack trace to use a different base address (kb =12fc00), and scripted breakpoints are very useful. That I can attach it noninvasively to a process that VS is already debugging is great too. And for those of us who don’t have access to the prerelease VS.NET 2005, it’s the only debugger available for AMD64.

    The interface has been improving steadily, particularly with the addition of dockable windows, but I still find it clumsy in many ways. The constant "save base workspace information?" dialogs are really annoying.

    The public symbol server is great and I use it all the time, but WinDbg does seem to have some problems with it: when I have it hooked up, C++ struct evaluations take a really long time because for some reason it keeps hitting the symbol server. VS.NET doesn’t have that problem.

  2. I love the dockable windows, they’re a huge help. And the constant save prompts get really tiresome quickly. And I often don’t go to their office – the !remote command is your friend – it gets you a command line prompt into their debugger.

  3. Pavel Lebedinsky says:

    > The next thing is to iterate over the

    > failure. At some point you’ll step over the

    > real cause of the failure. When this

    > happens, restart the app and retry the

    > failure case. And this time, step into the

    > function instead of stepping over the function.

    On average, this process will converge on the point of failure in something like O(log(N)) steps, where N is the number of nested functions called by the top-level function that fails.

    This is not bad from the algorithmic complexity point of view, but unfortunately each step is very time consuming (restart the app, reproduce the problem, step into the function that failed the last time, start stepping over).

    An alternative is to use the ‘wt -or’ command in windbg to do the stepping for you (for those unfamiliar with windbg – this command will single-step through the entire function, printing a tree of all child functions that are called and their return values).

    Even though it’s O(N), in many cases it can actually outperform the manual O(log(N)) process. You can also combine both approaches using -l, -m and -i switches in wt.

    Sometimes it’s also possible to locate the point of failure in close to O(1) time. Put a breakpoint on kernel32!SetLastError that stops if the error code is the one you’re looking for, and continues execution otherwise. Since SetLastError is typically called only if something fails, this can be much faster than either of the previous two methods.

  4. Larry,

    >> Well, if you want to understand the problem, you need to look at the code

    indeed. and this is exactly why it would help a lot to have easier access to some of the windows sources… it makes understanding ones own problems so much easier.


    thomas woelfer

  5. Keith Moore [exmsft] says:

    I still prefer cdb/ntsd for user-mode debugging and kd for kernel-mode debugging.

    I admit it — I’m a troglodyte.

    And Larry – Give a big "high five" to the folks that created the MS symbol server. With all of the service packs/updates/whatever these days, having easy access to *correct* symbols via the symbol server is a total life saver.

  6. Pavel: Absolutely – I don’t tend to use wt because it perturbs timing too much, but you’re absolutely right.

    Thomas, 99% of my debugging is done without the code.

    When I’m debugging a problem in winamp, I don’t have the source code. I don’t even have the symbols. It doesn’t stop me. When I’m debugging components in windows outside multimedia I don’t have the source code. It doesn’t stop me. When I’m debugging office applications I don’t have the source code. It doesn’t stop me.

    Source code is a CRUTCH. If you know what you’re doing you don’t need it.

    Keith, as far as I know, Andre’s responsible for putting the symbol server out there.

  7. Sreevalli says:

    plz send the solution

  8. I agree with Larry, as long as you know what you want to do, source code is not necessary. I debugged all of these apps and more without source code. And I think sometimes it is easier to find a bug in debug without the source.

Skip to main content