Simple example of an API design flaw.

Here’s a simple example of an API design flaw in ICorDebug.
 (ICorDebug is the API that Visual Studio / MDbg and other debuggers use to debug managed code).
 Here’s the background knowledge:
 1) When a thread first executes actual IL (whether jitted or ngenned) code, ICorDebug issues an ICorDebugManagedCallback::CreateThread callback to the debugger. This callback provides the debugger with a ICorDebugThread object which the debugger can then use to inspect the new thread. In other words, when Mdbg gets the CreateThread callback, it can run a managed-only callstack and will see exactly 1 frame.
 2) When an thread throws a managed exception, it issues an ICorDebugManagedCallback:: Exception callback. This callback includes a non-null ICorDebugThread object to tell the debugger which thread the exception occurred on. ICorDebug guarantees that the thread object a) is explicitly non-null and b) was provided by a previous CreateThread callback.
 Some of you are probably already chuckling at how we painted ourselves into a corner here…
 So now what happens when an exception occurs on a managed thread before any IL is run? The debugger will get an Exception on a thread for which there was no CreateThread callback, and it can’t fulfill this without violating the guarantees in #2 above.
 Some may ask “but how could that happen?” The real question is “how could that possibly not happen?” After all, there’s a lot that happens between when a native thread is created and when that native thread is actually executing jitted IL code, including:
 – loading types (could throw a TypeLoadException)
 – jitting the code (that could fail, perhaps from Out-of-memory).
 – Corrupt PE image (may throw some sort of InvalidProgram exception).
 – Security Checks (may throw some security exception)
 This turns out to be one of our most common Watson buckets
 To make matters worse, there’s no way to fix it without changing the interface. I think the ideal solution would be to provide some sort of PreCreateThread notification that fires when the thread is first created but well before it runs any managed code. This effectively weakens the guarantees in #2 and thus allows us to fulfill them.
 Other non-breaking solutions include:
 1) Fire a managed log message to at least notify the user. LogMessages require a thread object too, but we could actually just pick a random one. (This is the solution we ended up using for V2.0)
 2) Fire an MDA (Managed Debug Assistant, which are basically log messages with rich data). MDAs explicitly don’t require an ICorDebugThread object to avoid this same sort of problem. This is a fancier version of the LogMessage solution.
 3) Ignore the exception event completely. However, this means that the thread will disappear underneath the user and could be very confusing.

Comments (3)

  1. David Srbecky says:

    It is big surprise for me that ICorDebugManagedCallback::CreateThread is not called as soon as thread is created – I must have missed this in the documentation.

    Is it possible to cheat ICorDebugManagedCallback::CreateThread? I mean, if exception happens before IL code is executed, could just handle it by pushing some IL on the callstack, calling ICorDebugManagedCallback::CreateThread and finally calling IcorDebugManagedCallback::Exception?

  2. That’s a tempting suggestion. "Cheatting" the API is usually one of the only ways to deal with API flaws. It amounts to using the existing API in some new creative way to dance around issues like this. Usually it’s very fragile, and a headache, and breaks down somewhere.

    In this case, we couldn’t _actually_ push some IL on the stack: that’s has too many ramifications. That means calling IL after the 1st-chance exception but before the exception dispatch; which would break tons of invariants.

    But could ICorDebug just pretend there’s IL on the stack? Even that has problems. For example, what if the user had placed a Breakpoint in that IL. Do we now have to pretend to fire those breakpoints? What about the LoadModule callbacks for that IL?

    I’m sure that if we were tenacious enough, we could come up with some scheme that worked 95% of the time; but it would be very ugly, very fragile, and probably cause us everybody more grief than it was worth.

  3. David Srbecky says:

    There is one more ‘nice’ solution: change the specification and call ICorDebugManagedCallback::CreateThread as soon as the thread is created if the user is using IcorDebug v2.0. If ,for some reason, the user is relaying on the fact that ICorDebugManagedCallback::CreateThread is called when managed code is on the stack, he is going find out as soon as he changes from v1.1 to v2.0.