Sometimes the source of a performance problem is not that code is too slow, but that threads switch too frequently. When the system switches between threads, the system consumes extra cycles in scheduling activities such as selecting the next thread to run, or switching to the new thread, rather than spending the time executing application code. In general, the less time threads run before switching out, the more total time the system spends scheduling new threads to run.
The default amount of time that a thread runs before switching to another thread is called the Default Thread Quantum, dwDefaultThreadQuantum. For example, if the default thread quantum is 100 milliseconds, then a thread would run continuously for 100 milliseconds before another thread ran, unless preempted by a higher-priority thread.
In a system where threads run to the end of a 100-millisecond quantum, there is one thread switch per 100 milliseconds, or 10 thread switches per second. In contrast, in a system where threads run only 50 microseconds before switching, there is one switch per 50 microseconds, or 20,000 thread switches per second. Switching at such an elevated rate has a dramatic performance impact.
You can use Remote Kernel Tracker to identify these situations. Remote Kernel Tracker depicts thread switches as vertical green lines.
The following screenshot shows an example of an application with excessive thread switching. In this picture, I’ve set two markers in the time scale at the top of the display. The difference in time between the two is displayed in the status bar on the bottom right, so from time marker 1 to time marker 2 is 10 milliseconds. As you can see, there are numerous vertical green lines; each represents a switch from one thread to another.
(scroll down if you can’t see anything)
You also can see above that the threads seem to be continuously repeating a pattern. The next picture zooms in on that pattern:
(again, scroll down if you can’t see anything)
As you can see in the preceding picture, the “OpenDeviceKey” thread runs the longest continuous time, about 0.6 milliseconds. The next longest run-time in this screenshot, “RndisSendRndisMessage,” is 0.16 milliseconds. The greatest switching activity, between “UsbInterruptThreadStub” and “DmacIntrThread,” occurs at a rate of one switch approximately every 50 microseconds. One thread switch per 50us is a rate of 20,000 thread switches per second. This means that thread switching occurs at a rate that is about 2000 times higher than the ideal.
Let me describe where these screen shots came from:
The problem: Poor USB throughput. On particular hardware we expected to get about 50Mbps throughput, but were getting only about 23Mbps.
The investigation: We recorded a Monte Carlo profiling trace, with these results:
HITS PER SYMBOL
The lines in green are all related to making API calls. The ones in blue are all involved with thread rescheduling. I was a little suspicious about the number of API calls and the amount of thread rescheduling that was going on here. KCNextThread is the primary kernel function which selects the next thread to run, and having that show up 1.6 percent of the time is pretty significant. The API call time might also be partly blamed on calling into the kernel to reschedule.
Then we looked at a Kernel Tracker trace for the same scenario, and saw the screen shots above. I realized how much thread switching was going on, and brought it to the attention of the folks who owned the code involved.
The fix: They changed the design so that the work done between the two most active threads (UsbInterruptThreadStub and DmacIntrThread) was all on one thread instead of shared between two. As a result, the USB throughput jumped from 23Mbps to almost 41Mbps.