How do Managed Breakpoints work?


In this blog entry, I’ll explain how setting source-level breakpoints in a managed debugger work under the hood from end to end.

 

Here’s an overview of the pipeline of components:

1)      End-user

2)      Debugger (such as Visual Studio or MDbg) .

3)      CLR Debugging Services (which we call “The Right Side”). This is the implementation of ICorDebug (in mscordbi.dll).

—- process boundary between Debugger and Debuggee —-

4)      CLR. This is mscorwks.dll. This contains the in-process portion of the debugging services (which we call “The Left Side”) which communicates directly with the RS in stage #3.

5)      Debuggee’s code (such as end users C# program)

 

At a 10000 foot level, the action starts at stage #1 when the user sets a breakpoint (perhaps by pressing F9 in Visual Studio), and then trickles down to stage #5 where the user’s code will actually run until it hits the breakpoint, and then trickles back up to stage #1 to notify the user.

 

I mention various source files which you can look up in Rotor for more details. It will also help to have some mild familiarity with windows Structured Exception Handling (SEH) before reading this.

 

Here’s what happens at a much gorier level of detail.

 

Part 1: Adding the breakpoint.

1)      [User] The user presses F9 on some source line.

 

2)      [Debugger] The debugger must bind the breakpoint to a function and IL-offset. The function may or may not be jitted yet. It can use the Symbol Store interfaces to use the sequence point maps in the PDB to bind. (See ISymUnmanagedMethod::GetSequencePoints in CorSym.idl).

If the code is not yet loaded, then the debugger can not bind the breakpoint yet. In visual studio, the debugger will show unbound breakpoints as hollow circles to indicate they will not be hit. The debugger is notified when a modules is loaded (via ICorDebugManagedCallback::LoadModule).

If the debugger can’t immediately bind the breakpoint, it will listen for module load events and bind the breakpoint as soon as a module is loaded that contains the relevant source file.

 

3)      [Debugger] Once the breakpoint is bound, the debugger can obtain an ICorDebugCode for the function. It then calls ICorDebugCode::CreateBreakpoint to set the breakpoint. This gives it back an ICorDebugBreakpoint object. The debugger can remember this to associate the breakpoint with some action for future use. This allows the debugger to build more advanced breakpoint features like conditional breakpoint and hit counters on top of the basic breakpoint support provided by the CLR.

 

4)      [Right-Side] The right-side (RS) just packages this information into an event and sends it over across the process-boundary to the left-side (LS). The RS blocks waiting for a reply from the LS.

 

5)      [Left-Side] The left-side has a helper-thread  listening for events from the RS. (These events are all defined in src\debug\inc\dbgipcevents.h)

If the method is already jitted, then the LS uses the ILàNative maps to find the native address to place the breakpoint at. It will then inject a native break opcode (0xCC or “int3” on x86) at the address and will remember the opcode being replaced by the int3 so that it can restore it later when it wants to remove the breakpoint. This part is the same as what a native-debugger would do.

The left-side will keep track of:

         the address,

         the opcode to restore,

         the RS’s breakpoint object (so that it can identify the breakpoint when it’s hit).

If the method is not yet jitted, the LS will listen for a Jit-Complete notification to notify it when the method is jitted and then fall back to the jitted case. This jit-complete notification is not exposed to ICorDebug (though it is exposed through the profiling API).

 

6)      [Left-Side] The left-side sends an acknowledgement back to the RS that the breakpoint has been successfully applied.

 

7)      [Righ-Side] The Right-side returns success from the ICorDebug* calls.

 

8)      [Debugger] The debugger adds the breakpoint to its own tables and displays it as appropriate.

 

 

Part 2: Running to hit the breakpoint.

9)      [Debuggee’s code] A managed thread is running. If it executes the line that the breakpoint is set at, it will executes the native break opcode. This will generate a hardware exception (code=0x80000003), similar to if the thread executed a divide-by-zero.

 

10)   [Left-Side] The CLR injects specific Structured-Exception-Handling (SEH) filters before running managed code. The OS will invoke these filters from the first pass with the breakpoint exception. The break opcode is still in the instruction stream, but the thread’s context is now inside the SEH filter.

 

11)  [Left-Side] The filter notifies the LS that a native breakpoint has been hit at a given address.

 

12)   [Left-Side] The thread looks up the address and recognizes it. It will send an event to the RS notifying it that it has hit the breakpoint. 

 

13)  [Right-Side] The RS has an event thread listening for events from the LS (the counterpart to the LS’s helper thread) which will queue the event. The RS will not dispatch this event to the debugger since the debuggee is still running.  

 

Part 3: Notifying the debugger

14)   [Left-Side] Now that the breakpoint is hit, the CLR needs to suspend all managed threads (so that the process can be inspected) and notify the debugger. The thread that just hit the breakpoint will ping the helper thread to request that the runtime be suspended. It will block itself inside of the SEH filter waiting for the debuggee to be suspended. Once the debuggee is suspended, all threads will remain blocked until the debuggee is resumed. This ensures that threads are not running while the debugger is trying to inspect them!

 

15)   [Left-Side] The helper thread will asychronously suspend the runtime using the same logic as a GC suspension. If other threads hit debug events during this window, those events will just be queued as well.

 

16)   [Left-Side] Once the runtime is suspended, the helper thread will send a “Sync-complete” event to the RS to notify it that the debuggee has now been suspended. The runtime will remain suspended until the debugger resumes it by calling ICorDebugProcess/AppDomain::Continue().

 

17)  [Right-Side] The RS receives the sync-complete, and then flushes all of its queued events. For each queued event, it will invoke a particular callback on ICorDebugManagedCallback (see CordbProcess::DispatchRCEvent in src\debug\di\process.cpp). For the breakpoint, it invokes ICorDebugManagedCallback::Breakpoint where the ICorDebugBreakpoint object is one of the parameters.

 

18)  [Debugger] The debugger implemented the callback object and so it gets notified. For basic breakpoints, it just stops the shell and sets the current thread and source file appropriately so the user sees the breakpoint they just hit. If the debugger implements conditional-breakpoints, then it can evaluate the condition now, and if it is false, it will resume the debuggee immediately without ever notifying the user.

 

19)  [Debugger / User] While the debugger is stopped,  the user can do all sorts of inspection activity such as looking at callstacks and local variables, and adding other breakpoints.

 

Part 4: The Debugger continues

20)  [User] The user continues past the breakpoint (such as pressing F5 in Visual Studio, or typing “Go” in MDbg).

 

21)  [Debugger] The debugger calls ICorDebugAppDomain::Continue().

 

22)  [Right-Side] The RS checks if there are any more events to dispatch. If there are, the RS will dispatch the next event. Else, the RS will send a continue event to the LS’s helper thread.

 

23)  [Left-Side] The helper thread gets the continue event and resumes all threads it had previously suspended.

 

Part 5: The thread moves past the breakpoint..

24)  [Left-Side] The thread that had hit the breakpoint is still in the SEH filter, but now pops out of its wait. The thread will eventually return from the SEH filter and resume executing code at the context of where the exception was initially raised (which will be the address of the breakpoint). The break opcode is still in the instruction stream, so if we just immediately returned then the thread would just imediately re-hit the break opcode and never move past it.

If we remove the break-opcode, then that effectively deactivates the breakpoint and allows a race where another thread might slip through and execute the line on the breakpoint without actually hitting the breakpoint.

Instead, the thread will make an auxillary buffer, and then copy the instructions that are under the opcode to the buffer. It will execute the instructions from this buffer and thus never need to remove the break opcode.

 

25)  [Left-Side] The SEH filter updates the IP of the context to this auxillary buffer, enables the single-step flag and then returns from the filter. This is effectively like a long-jump to the auxillary buffer. The single-step flag is a hardware flag (it’s 0x100 in the Eflags field on x86) which tells the CPU to execute a single instruction and then raise a hardware exception (0x80000000)

 

26)  [Debuggee’s code] The thread executes a single instruction in the auxillary buffer, and the CPU raises the single-step exception. That goes into the CLR’s SEH filters and notifies the LS (just like the breakpoint exception). 

 

27)  [Left-Side] The LS sees it got a single-step exception on the thread in an auxillary buffer. It does a bunch of anayslis to determine what real address back in the original code the thread should be resumed at.

 

The thread is now past the breakpoint.

 

 

I think the key take away here is that even things that look like they should be really simple may actually be surprisingly complicated.


Comments (19)

  1. Why is the auxillary buffer necessary? Couldn’t the helper thread keep all the other threads suspended until the current thread has executed the initial single-step after the breakpoint?

  2. jwf says:

    So why does Visual Studio 2003 hang occasionally when "stepping" through a mixed managed/unmanaged debugging session?

    This is the single most frustrating aspect of trying to develop new parts of our smart client applications in managed code while integrating with a huge unmanaged code base that has a ton of UI.

  3. Mike Stall says:

    Pavel – interesting idea, but there’s a problem. Somehow you need to execute the opcode underneath the breakpoint opcode (int3). You can either:

    1) temporarily remove the breakpoint and restore the original opcode. Single step to execute the opcode and then put the breakpoint back in.

    2) use an auxillary buffer (as described above).

    The problem with #1 is from multi-threaded scenarios. If the other threads are not suspended, then they may race through the breakpoint while we have it removed.

    If they are suspended, then we may deadlock if that single opcode we’re stepping over blocks on another thread for some reasons (perhaps it’s a system call).

  4. Mike Stall says:

    jwf – that’s because interop-debugging in v1.1 / VS 2003 is just really unstable in general.

    In some hangs I’ve diagnosed, I’ve counted at least 7 different races that could cause it. There could be 100 reasons why your scenario is hanging.

    FWIW, There was large debate over whether to cut it completely or ship a half working feature that may still provide some value in some cases.

    On the good side, we’ve spent a lot of resources in v2.0 trying to stabalize interop-debugging and we believe it’s a lot better. Check out VS2005 beta 1 and let us know if you still see such hangs (at least use Watson to report them).

  5. I’m having a little trouble figuring out the mechanics of step 10. If the SEH is injected before running managed code, isn’t it possible for someone to add another SEH and be farther up the stack, possibly grabbing the breakpoint exception before Left-Side can see it? And couldn’t a vectored exception handler steal the exception as well before you get it?

    Or am I misinterpreting this and the hardware exception has already been converted to a managed exception? In which case, why aren’t other managed handlers exposed to this exception before it hits your filter?

    I would have thought this would be handled by Right-Side where the debugger can handle the exception before the debuggee even gets to see it. And then communicate to Left-Side that a breakpoint has been hit to stop threads. But I’m guessing there’s a reason this is all done in Left-Side.

  6. Mike Stall says:

    Nicholas – sorry, #10 is poorly worded. You bring up great points:

    Your interpretation is correct. So:

    1) The SEH filter is already injected before the managed code is ever run by the CLR’s EH subsystem.

    2) The CLR’s EH subsystem also owns the vectored exception handler. If you override it, you’ll break things.

    3) While in managed code, only the CLR’s EH subsystem can add SEH filters, so it’s not possible for someone to add another SEH to intercept us.

    4) The CLR’s EH subsystem then cooperates heavily with the CLR’s debugging subsystem. It gives the debugging system first shot at all SEH exceptions with code=0x80000003 (breakpoint) or 0x80000004 (single-step)

    Your last sentence is a great observation.

    This is all done in the Left-side (the portion in-process w/ the debuggee) because managed-only debugging is actually not OS-debugging, and so the RS won’t even get notified of the exceptions.

    When interop-debugging (both managed & native debugging simultaneously), the RS is also an OS-debugger and so it will then get notified of the exception’s via native debug events. In these cases, it detects the exceptions are meant for the CLR and then resumes the process to let the in-process filters handle them. This is done so that interop and managed-only share a lot of the same logic.

  7. Thank you. I think I understand it now.

    The CLR requires that it controls the OS exception handlers (structured and vectored) for things to work. When the exception is hit, Right-Side ignores it (or never sees it depending on whether you’re doing managed or interop) and lets Left-Side pick it up. Left-Side will always have the first chance because it is the handler of choice for the OS. Left-Side first gives it to the debugging exception filter (which acts like a vectored handler). If it is a debugging related exception, we continue with the algorithm above. Otherwise the debugging filter declines the exception and Left-Side then gives it to the managed structured handler to try.

    One consequence of handling all exceptions in-process is that we have extra transitions between Left-Side and Right-Side when doing interop debugging.

    RS (hit exception, decline) -> LS (hit exception, accept) -> RS (notify of exception) -> LS (suspend debuggee) -> RS (run debugger)

    If we took the native exception it would look like

    RS (hit exception, accept) -> LS (suspend debuggee) -> RS (run debugger)

    Manged debugging always goes

    LS (hit exception, accept) -> RS (notify of exception) -> LS (suspend debuggee) -> RS (run debugger)

  8. Mike Stall says:

    That’s 99% correct.

    Your last sentence is technically wrong. For managed debugging, the RS never sees the raw exception (0x80000003), it only sees the higher level managed debug events (like "Breakpoint"). So your last sentence should be:

    Managed debugging always goes:

    LS (hit exception, accept) -> RS (queue managed debug event) -> LS (suspend debuggee) -> RS (dispatch debug events) –> User responds –> RS(run debuggee)

  9. Daniel Moth says:

    Blog link of the week 53

  10. We’ve made Interop-debugging (aka “mixed-mode debugging”) much faster and more stable in Whidbey (VS8)…

  11. I find people often assume that just being a developer on the CLR means you somehow know everything about…

  12. Eldar says:

    From Tuesday, March 14th, 2006: A friend of mine asked me recently if I had any good books on .NET internals…