Interop-stepping is 400 times faster in Whidbey over Everett

We’ve made Interop-debugging (aka “mixed-mode debugging”) much faster and more stable in Whidbey (VS8) than it was in Everett (VS7).

Interop-stepping was very slow in Everett.
One way to measure this is to step over a line like this:
 int x = GetCount(dwStartMS)
+ (a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)
+ GetCount(dwEndMS);

And then look at (dwEndMS - dwStartMS) after the step completes to see how many milliseconds the step took.

Why does that work?
Step-over works by using a process’s trace flag to single-step over each instruction in a basic block. This means the CPU will execute a single instruction an then raise a SingleStep exception which traps to the debugger. The debugger then processes that ss exception and if the thread is still on the same line, will keep stepping.  This extra exception overhead is why stepping over a statement is slower than actually executing the statement.   To step-over calls, it places a breakpoint after the call instruction and runs to that. Thus the call is executed at normal speed.

This means that a step speed is roughly proportional to the size of the basic block it steps over. So all the (a+b+c) operations in the line just magnify the step duration which helps us measure it more accurately. The GetCount() calls provide an accurate way to set dwStartMS and dwEndMS at the start and end of the step. This saves the hassle of trying to coordinate the step-over with a separate timing operation.  It also avoids measuring any operations the debugger does after the step such as refreshing the callstack.
I’d recommend double-checking the disassemble to make sure that everything’s there and looks as we expect.

Using this, we can measure certain things
Not-stepping: How fast does the statement execute at full speed when not stepping over it?  Measure this by placing a breakpoint after the line and running to it. This gives us the upper level base-line.
Managed-Code (managed-only debugging): How fast can we step over the line if it’s managed code (eg, IL using #pragma managed in MC++) when managed-only debugging? 
Managed-Code (Interop-debugging) : How fast can we step over the line if it’s managed code (eg, IL using #pragma managed in MC++) when interop debugging? 
Native-code (Native-debugging) : How fast can we step over the line if it’s native code (using #pragma unmanaged in MC++) when Native-only debugging?
Native-code (Interop-debugging) : How fast can we step over the line if it’s native code when Interop-debugging?

The measurements:
This table shows the measurements on the various scenarios . All measurements are done on the same machine (1.000 ghz, dual-proc). I ran the scenarios several times, threw out the top + bottom, and took the average of the remaining.

Scenario Everett Whidbey Ratio (Everett/Whidbey) Whidbey Ratio (current/fastest)
Managed (managed-only) 203 ms 78 ms 2.60 1.0 (fastest)
Managed (interop) 484 ms 390 ms 1.24 5.0
Native (Native-only) 750 ms 734 ms 1.02 9.4
Native (Interop) 633494 ms (= 10 minutes) 1343 ms 471.70 17.2 (slowest)

In all scenarios, not-stepper was < 1 ms, and so I couldn’t get an accurate measurement. (I suppose I could really add a ton of instructions, but I’ll save that for another blog entry). That means all these numbers show the full overhead of stepping and are not tainted with the cost of actually executing the underlying code.
The last column shows how each whidbey scenario compares to the fastest whidbey scenario. The Everett/Whidbey ratio (second to last column) shows the speedup from Everett to Whidbey.
So you can see that raw interop stepping through native code has gotten ~400 times faster from Everett to Whidbey. Now I admit, this is an exagerrated scenario and it doesn't measure the stuff that happens after stepping (like refershing debugging windows), but it's still a great improvement.

Explaining the results?
The ordering of the different configurations roughly makes sense one you understand what’s going on.
Managed-only execution control (like stepping and breakpoints) is all done in-process (see here for details)  That means the single-step exceptions are just normal exceptions occurring on a single thread and there’s no additional overhead beyond SEH. Futhermore, all the stepping logic was done in-process  inside SEH filters, and so there was no cross process communication needed in the middle of the step.
Once we’re Native or Interop debugging, now each exception becomes a full blown native debug event. That means each exception stops the entire process (not just a single thread) and needs to notify the debugger and then wait for the debugger to continue it.
Interop-debugging is slower than native-only debugging because interop needs to do additional filter to determine if that native debug event is for the managed or native debug engine. Native debugging knows it’s for the native debug engine (since that's the only one) and can thus skip this step. It turns out in Everett, this filtering was extremely expensive and results in a huge portion of the slowdown. There was one single optimization in ICorDebug responsible for a vast majority of this filtering perf win (yet another blog entry waiting to be written...). The VS guys also did some optimizations.

Here are those functions in more context:
void Test()
{
DWORD dwStartMS;
DWORD dwEndMS;
int a = 5, b = 6, c = 7;

 int x = GetCount(dwStartMS)
+ (a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)+(a+b+c)
+ GetCount(dwEndMS);

 DWORD dwDiffMS = (dwEndMS - dwStartMS);
printf("Time (ms) %d:\n", dwDiffMS);
}
int GetCount(DWORD & dw)
{
dw = GetTickCount();
return 0;
}