Hard and Soft Mode Debugging or The Woes of Soft Mode


I had to explain this a little while ago and I wrote up something that I thought was generally interesting.   This is only approximately correct (even the examples are flawed) but I think you can get the idea.


I had first hand experience trying to get a soft-mode debugger working when I was the debugger lead on Visual C++ 1.0  — that was just oodles and oodles of fun.


The easiest way to contrast “hard-mode” and “soft-mode” is as follows:

In hard-mode the debuggee never needs to run in response to any debugger inquiry.  In soft-mode it is normal/natural for the debuggee to “help” the debugger in various ways, usually via an agent. Although some hard-mode debuggers allow you to force the debuggee to run code they don’t require it in the normal performance of their job.



Notably, hard-mode debuggers can debug a core dump; soft-mode debuggers cannot. Hard-mode debuggers find it easy to attach to an already-running process and do not ever cause deadlocks in the debuggee.  Soft-mode debuggers can provide access to more/richer debuggee information, may or may not be able to attach to already running processes (because it may be “too late” to insert the needed agent) and broadly suffer from reentrancy problems that can introduce every imaginable failure mode up to and including total destruction of the debuggee’s execution environment (by virtue of calling debuggee functions at a time when data invariants are not valid.)



Classically, soft-mode debuggers place limitations on the scope in which they will debug so as to avoid obvious reentrancy problems and generally give the user of the debugger a decent experience.  They can be very effective but is rarely as easy as it superficially appears.



Let me give a simple example of how soft-mode can “mess things up.”



I am debugging function f1, I put a breakpoint in it.  It is half way done its job and hits the breakpoint, the debuggee is now “stopped”.  I ask for the value of some innocuous looking property p1 via evaluation.  The debugger causes the debuggee to execute the correct code for p1.  However the program logic in p1 assumes that p1 can never be called while f1 is running, much less while f1 is half complete. As a result p1 corrupts its internal state. The debug session is now useless.



The unwanted reentrancy I described above is typically less fatal than a total corruption and in practice you tend to get pretty good results if you use some care.  There are more subtle kinds of reentrancy and other safeguards soft-mode debuggers require, in some sense, the case I describe was an “easy” one. To avoid the bigger disasters, soft-mode debuggers place significant limitations on what they will debug.  For instance, you cannot use Visual Studio in mixed mode to break in the VM itself, it’s “out of bounds”.



Contrariwise native-only mode (“hard-mode”) has no such limits.  It’s normal/natural to debug the VM like that. Hard-mode is impervious to reentrancy issues.



The trouble is if you naively combine hard- and soft-modes in a joint hardness (mixed mode) you have dragged the “native-only” half into the world of soft-mode.  Allow me to illustrate.



The user starts the mixed debugger, uses the hard-mode features to set a breakpoint somewhere in the method dispatch code of the managed VM.  The debugger dutifully stops the VM when that breakpoint is hit.  Now the user tries to evaluate a managed expression… this has no hope of working, the VM is not in a consistent state and cannot be asked to do work.   Most places in the VM are not safe stopping points.



OK so you might say, “You can’t debug the VM itself, that’s fair” but that isn’t the extent of the damage.  Suppose we instead set a native breakpoint in the system memory allocator.   Any attempt to evaluate managed expressions while the allocator is in an interim state are likely to result in a corrupt system.



All right, so not system methods, and not the VM.  What about if I set a breakpoint in some other native library that isn’t a crucial system resource? I still can fail because even if dispatching the managed expression succeeds that code could call, via interop, any native method at all, including ones related to the system I am trying to debug, so that system, whatever it is, is now faced with reentrancy issues.  Effectively I’ve injected soft-mode problems into my hard-mode debugger because the soft-mode features have “contaminated” the hard-mode approach.



The trouble is that the managed expressions you might most want to evaluate are likely to be precisely the ones that involve the use of the interop features that access the native code you are trying to debug.



There are other cases that are worth mentioning. You might say, “well people shouldn’t set breakpoints in weird places like that and expect their methods to work, it’s unreasonable.”  But, tragically, the effective stopping point is often not directly chosen by the user.



Let me give another example:  I observe some strangeness in one of my finalizer methods.  I wish to investigate, so I set a breakpoint in the middle of my finalizer.  Shortly afterwards the debugger stops my program having never hit my breakpoint.  I inspect the stack and I find that I am in the context of a thrown exception that originated in the memory manager which has thrown due to an invalid de-allocation request in a destructor my finalizer called via interop.  Now, since an allocation is in flight the allocator’s internal data structures happen to not be consistent at this point — I now attempt to evaluate a managed expression to look at state associated with my finalizer but I’m back to the same problem I had before – I can’t because the allocator is out of commission.



Things only get worse in a system that supports multi-threading.  Reentrancy issues can and do create deadlocks in addition to corruption. A normally impossible deadly embrace can be completed by a debugger induced resource grab.



In practice, a composite debugger system faces daunting challenges to keep working robustly in a wide variety of circumstances; limiting damage due to unexpected reentrancy is perhaps the biggest problem.

Comments (5)

  1. nksingh says:

    Yikes!  That sounds bad.  This would require a lot of VM and maybe OS support, but how about snapshotting the process and doing all the soft-mode evaluations on the copy?  There are still a lot of things that are out of bounds, like changing the state of resources that are outside of your program’s memory space, but at least you wouldn’t lose the initial failure you’re trying to solve.

  2. Tanveer Badar says:

    I’ll have to read this about 4 more time to fully understand. 🙁

  3. ricom says:

    One of the things you can do is to not really run code (say function evaluations) in the image of the debuggee but "pretend" to do so (i.e. sort of emulate running starting from where you are) that can be very effective, especially for some modest functions but it turns into a nightmare quickly as soon as even innocuous looking calls like SendMessage(…,WM_GETTEXT,…) — which you need to do something as simple as getting the caption text of a window.

    And what does WM_GETTEXT do?  In principle, anything it wants to… anything at all.  Though in practice usually very safe stuff.  Usually…

    It’s actually quite hellish.

    It was doubly hellish on Win16 with no real threads and globally visible memory.  But the CLR debugging system faces many analogous problems.

  4. ricom says:

    If you give me some feedback as to the parts that are hardest to understand I could break those down in a later posting.

  5. Not important says:

    If you invoke a managed property then you execute code in the debugee. De facto you changed the state of the debugee and now you are debugging something else than what you set yourself to debug.

    Most applications are probably simple enough that changing their state is no big deal – you will notice right away if invoking managed properties starts breaking stuff.

    If you are investigating some intricate piece of code, then probably you should not invoke any managed code and limit yourself to inspecting the memory (ie. fields, vtables, etc).