Retail code debugging

Retail code debugging is one of those necessary evils. It's difficult, but the only way to completely avoid it is to not have retail code, which unfortunately usually requires you to avoid having any customers. Anyway, I figured I would give a brief tour of what you can trust and what you can't trust while you are retail debugging.

Basics

First the most important advise I can ever give – ALWAYS generate pdbs, and always keep the PDBs for any binary that you ship. If you work for a software company that doesn't do this, let me know so that I can avoid their products :).

Next, for the 1.0 and 1.1 versions of the CLR, retail debugging is pretty much out. Unless you start the application under the debugger or do some other funky thing, the Just-In-Time compiler will not generate the data on how it mapped IL instructions to native instructions. If you happen to have a reliable repro that doesn't go away when launched under the debugger, then you can debug your code that way. However, if you have a reliable repro, why mess with retail debugging? Anyway, because of this I am only going to talk about native debugging from here on out.

Variables

What you can trust:

  • global/static variables. Unless you are looking at a minidump without heap, global and static variables can be trusted.
  • member variables of a class or structure, assuming you have a valid pointer to that class/struct. The compiler is not allowed to optimize the layout of a structure. So if you can find a pointer to that structure, then the debugger can display the structure correctly.
  • registers on the top frame. The debugger just reads these from the thread context, so they will always be correct.
  • $vframe. $vframe tells you the 'virtual frame pointer'. This is the memory address where you can find the stack frame. If the function has a true stack frame, memory will be layed out like the below table. $vframe is extremely helpful when retail debugging because it tells you where about on the stack to look for your local variables.
addresses less then $vframe local variables
$vframe the return address
$vframe+4 the old EBP value
$vframe+8 the first parameter
$vframe+... additonal paramaters

NOTE: Other then the return address, everything else is going to depend on compiler optimizations, and calling conventions (for calling conventions, see Raymond's blog). However, this is a good rule of thumb.

So what can't you trust? Anything accessed from a local or parameter. Unfortunately, this is almost everything (example: member variables are accessed via a parameter: 'this').

So, what do you do?

  • The disassembly window is a great guide at finding locals. Look at how your data is accessed.
  • Evaluating $vframe in the memory window will often be helpful.
  • Find a global. If you have a bunch of interconnected objects, finding one will often let you find a bunch more since the debugger will always get the struct layout correct.
  • The return code is probably in eax. 'eax,hr' is your friend if you are doing COM programming.
  • Cast the pointer to an 'IUnknown*'. If the pointer is really an object that implement COM interfaces, the debugger can use the vtable to figure out what type the object really is.

Callstack

As long as you have symbols for every frame on the callstack, the callstack window should produce accurate results. Caveats:

  • Functions that produce identical disassembly will be merged together. The classical example are the ATL templates. I don't know how many times I have seen CComPtr<O1>::~CComPtr calling O2::Release which in turn calls O3::~O3. Since the CComPtr destructor is the same for all objects, and the implementation of Release is the same for all objects, those get merged together leaving only the O3 destructor to tell you what object is actually being destroyed.
  • Functions written in asm don't get symbols. This can be a problem with crashes in some CRT routine written in ASM
  • The stack is corrupt

Stepping / Breakpoints

Stepping and breakpoints should both work reliably, but this doesn't mean that they will work like you expect.

  • Functions can get inlined, which turns your 'step into' into a 'step over'
  • Functions that produce identical disassembly will be merged together. This will produce lots of confusing results – stepping into a function, and winding up at a seemingly wrong function, setting a breakpoint on a function, and having that breakpoint get mapped to some other place.
  • Instruction re-ordering can cause stepping to look wrong. This can cause stepping to skip around, or to visit lines of code that are only partially executed.

The disassembly window is generally your friend when you are stepping through code.