Example of Goofy bugs

Here’s an sampling of various goofy bugs we’ve had to deal with in the CLR Debugging services over the past. I mention these so that you can consider whether most of the hot “silver-bullet” testing / software development technique would ever catch bugs like this. Some of these are watered down from the original problem to make it easier to understand. (Yes, we’ve fixed all of these)

The CLR crashed if you stop at an assembly instruction in a generic method in managed code at a non-sequence point with the IL stack depth =2 and then take a callstack.

Trying to call Debugger.Log() with exactly 150 characters crashed.

Lots and lots and lots of race conditions. For example, thread 1 is trying to setup a funceval while thread 2 happens to start a garbage collection (which is very random event), which corrupts an object referenced in the func-eval parameters, which caused the CLR to crash 2 hours later.

On checked-only build of the product, Japanese-only OS, step-in became a step-over when you try to step-in to a method going through an interface call where the interface has a particular attribute on it, and where the caller is an optimized ngen dll and the target is a non-optimized, non-ngenned, unjitted method.

We weren’t handling getting native debug events from Suspend threads. (Naively, you’d expect a suspended thread wouldn’t generate debug events). Turns out the thread dispatched the native debug event to the debugger; and then we suspended it. Dispatched native debug events will still be retrieved from WaitForDebugEvent.  I briefly alluded to that here. (I could do a whole post on goofy issues with the native debug API).

Debugging would hang if we got a native breakpoint debug event (0x80000003) from an IP at which there was no int3 instruction. (We didn’t think this was possible).

On win98 only, doing an async-break that just happens to be in the fraction of a second when the debuggee is unloading an appdomain, and then setting up a func-eval to evaluate a method from the to-be-unloaded appdomain on another thread, and then continuining may crash.

If you attached during a module load, detach, and then reattach, we missed a module load notification on the reattach (which eventually leads to a crash). (Related to issue here).

CLR crashed from races when doing an “detach from debuggee” while simultaneously killing the debuggee from task-manager. Timing is such that it only reproes when the debugger is not being debugged.

Perhaps you could claim bugs like the ones above are too much of a corner case. The problem is when you get millions of people using your product, enough people hit the corner cases that you still need to think about them.

Comments (4)

  1. Chris Bilson says:

    While I agree that there is no silver bullet that would catch these types of bugs, I would have to point out that a tool that helps you set up the conditions for the bug to occur couldn’t possibly hurt. I don’t think silver bullets are any better than regular bullets, but I would be suprised, and a little horrified, frankly, if you were working on problems as difficult as these sound without at least something to help you set up conditions. I would imagine you would want to have some kind of uber-test harness that embeds the CLR and allows you to have lots of control over things like when garbage collections happen, when something gets jit’d, etc. Please tell me that’s the case!

  2. jmstall says:

    Chris – you’re absolutely right. Testability is an extremly high priority for us and we work very hard at it.

    The CLR has ~a billion internal knobs that we can twiddle (especially around the GC) to help us catch these sort of things. We have all sorts of "stress" modes we can run the CLR in (eg, "GC at every instruction") to help catch bugs too.

    This is in addition to more traditional testing techniques like unit tests, scenario-tests, and code-coverage tests.

  3. Chris Bilson says:

    I’ve mentioned this to Microsoft folks before (MSDN people, developer evangelists, etc.), but I think it would be really cool to know a little bit more about the testing story at Microsoft. On several occasions I have considered getting a temp SDT job just to take notes. I wonder how much of the testing infrastructure is secret sauce, and how much would be OK to share with the world. I have heard an explanation of how BVTs work and how they are built etc., several times, so this must be OK to talk about. I also know some other misc. details, but would love to hear a more complete story. Your testing story is orders of magnitude more complicated than my testing story, and I think I could benefit from hearning about yours.

  4. jmstall says:

    I think the vast majority of our testing stuff is public.

    Can you give me a list of your top N questions, and I’ll use that as future blog fodder.