Hard and Soft Mode Debugging or The Woes of Soft Mode

I had to explain this a little while ago and I wrote up something that I thought was generally interesting.   This is only approximately correct (even the examples are flawed) but I think you can get the idea.

I had first hand experience trying to get a soft-mode debugger working when I was the debugger lead on Visual C++ 1.0  -- that was just oodles and oodles of fun.

The easiest way to contrast “hard-mode” and “soft-mode” is as follows:

In hard-mode the debuggee never needs to run in response to any debugger inquiry.  In soft-mode it is normal/natural for the debuggee to "help" the debugger in various ways, usually via an agent. Although some hard-mode debuggers allow you to force the debuggee to run code they don't require it in the normal performance of their job.

Notably, hard-mode debuggers can debug a core dump; soft-mode debuggers cannot. Hard-mode debuggers find it easy to attach to an already-running process and do not ever cause deadlocks in the debuggee.  Soft-mode debuggers can provide access to more/richer debuggee information, may or may not be able to attach to already running processes (because it may be “too late” to insert the needed agent) and broadly suffer from reentrancy problems that can introduce every imaginable failure mode up to and including total destruction of the debuggee’s execution environment (by virtue of calling debuggee functions at a time when data invariants are not valid.)

Classically, soft-mode debuggers place limitations on the scope in which they will debug so as to avoid obvious reentrancy problems and generally give the user of the debugger a decent experience.  They can be very effective but is rarely as easy as it superficially appears.

Let me give a simple example of how soft-mode can "mess things up."

I am debugging function f1, I put a breakpoint in it.  It is half way done its job and hits the breakpoint, the debuggee is now “stopped”.  I ask for the value of some innocuous looking property p1 via evaluation.  The debugger causes the debuggee to execute the correct code for p1.  However the program logic in p1 assumes that p1 can never be called while f1 is running, much less while f1 is half complete. As a result p1 corrupts its internal state. The debug session is now useless.

The unwanted reentrancy I described above is typically less fatal than a total corruption and in practice you tend to get pretty good results if you use some care.  There are more subtle kinds of reentrancy and other safeguards soft-mode debuggers require, in some sense, the case I describe was an “easy” one. To avoid the bigger disasters, soft-mode debuggers place significant limitations on what they will debug.  For instance, you cannot use Visual Studio in mixed mode to break in the VM itself, it’s “out of bounds”.

Contrariwise native-only mode (“hard-mode”) has no such limits.  It’s normal/natural to debug the VM like that. Hard-mode is impervious to reentrancy issues.

The trouble is if you naively combine hard- and soft-modes in a joint hardness (mixed mode) you have dragged the “native-only” half into the world of soft-mode.  Allow me to illustrate.

The user starts the mixed debugger, uses the hard-mode features to set a breakpoint somewhere in the method dispatch code of the managed VM.  The debugger dutifully stops the VM when that breakpoint is hit.  Now the user tries to evaluate a managed expression… this has no hope of working, the VM is not in a consistent state and cannot be asked to do work.   Most places in the VM are not safe stopping points.

OK so you might say, “You can’t debug the VM itself, that’s fair” but that isn’t the extent of the damage.  Suppose we instead set a native breakpoint in the system memory allocator.   Any attempt to evaluate managed expressions while the allocator is in an interim state are likely to result in a corrupt system.

All right, so not system methods, and not the VM.  What about if I set a breakpoint in some other native library that isn’t a crucial system resource? I still can fail because even if dispatching the managed expression succeeds that code could call, via interop, any native method at all, including ones related to the system I am trying to debug, so that system, whatever it is, is now faced with reentrancy issues.  Effectively I’ve injected soft-mode problems into my hard-mode debugger because the soft-mode features have “contaminated” the hard-mode approach.

The trouble is that the managed expressions you might most want to evaluate are likely to be precisely the ones that involve the use of the interop features that access the native code you are trying to debug.

There are other cases that are worth mentioning. You might say, “well people shouldn’t set breakpoints in weird places like that and expect their methods to work, it’s unreasonable.”  But, tragically, the effective stopping point is often not directly chosen by the user.

Let me give another example:  I observe some strangeness in one of my finalizer methods.  I wish to investigate, so I set a breakpoint in the middle of my finalizer.  Shortly afterwards the debugger stops my program having never hit my breakpoint.  I inspect the stack and I find that I am in the context of a thrown exception that originated in the memory manager which has thrown due to an invalid de-allocation request in a destructor my finalizer called via interop.  Now, since an allocation is in flight the allocator’s internal data structures happen to not be consistent at this point -- I now attempt to evaluate a managed expression to look at state associated with my finalizer but I’m back to the same problem I had before – I can’t because the allocator is out of commission.

Things only get worse in a system that supports multi-threading.  Reentrancy issues can and do create deadlocks in addition to corruption. A normally impossible deadly embrace can be completed by a debugger induced resource grab.

In practice, a composite debugger system faces daunting challenges to keep working robustly in a wide variety of circumstances; limiting damage due to unexpected reentrancy is perhaps the biggest problem.