Recognizing Excessive Thread Switching

Sometimes the source of a performance problem is not that code is too slow, but that threads switch too frequently. When the system switches between threads, the system consumes extra cycles in scheduling activities such as selecting the next thread to run, or switching to the new thread, rather than spending the time executing application code. In general, the less time threads run before switching out, the more total time the system spends scheduling new threads to run.

The default amount of time that a thread runs before switching to another thread is called the Default Thread Quantum, dwDefaultThreadQuantum. For example, if the default thread quantum is 100 milliseconds, then a thread would run continuously for 100 milliseconds before another thread ran, unless preempted by a higher-priority thread.

In a system where threads run to the end of a 100-millisecond quantum, there is one thread switch per 100 milliseconds, or 10 thread switches per second. In contrast, in a system where threads run only 50 microseconds before switching, there is one switch per 50 microseconds, or 20,000 thread switches per second. Switching at such an elevated rate has a dramatic performance impact.

You can use Remote Kernel Tracker to identify these situations. Remote Kernel Tracker depicts thread switches as vertical green lines.

The following screenshot shows an example of an application with excessive thread switching. In this picture, I’ve set two markers in the time scale at the top of the display. The difference in time between the two is displayed in the status bar on the bottom right, so from time marker 1 to time marker 2 is 10 milliseconds. As you can see, there are numerous vertical green lines; each represents a switch from one thread to another.

(scroll down if you can't see anything)

 

You also can see above that the threads seem to be continuously repeating a pattern. The next picture zooms in on that pattern:

(again, scroll down if you can't see anything)

 

As you can see in the preceding picture, the “OpenDeviceKey” thread runs the longest continuous time, about 0.6 milliseconds. The next longest run-time in this screenshot, “RndisSendRndisMessage ,” is 0.16 milliseconds. The greatest switching activity, between “UsbInterruptThreadStub” and “DmacIntrThread ,” occurs at a rate of one switch approximately every 50 microseconds. One thread switch per 50us is a rate of 20,000 thread switches per second. This means that thread switching occurs at a rate that is about 2000 times higher than the ideal.

Let me describe where these screen shots came from:

The problem: Poor USB throughput.  On particular hardware we expected to get about 50Mbps throughput, but were getting only about 23Mbps.

The investigation: We recorded a Monte Carlo profiling trace, with these results:

HITS PER SYMBOL

Hits Percent Address Module Routine
6557 7.3 0x8012C8FC NK.EXE IDLE_STATE
4643 5.2 0x8010E330 NK.EXE INTERRUPTS_ENABLE
2882 3.2 0x02FEA884 tcpstk.dll tcpxsum
2166 2.4 0x80110894 NK.EXE ObjectCall
2123 2.4 0x03FA37B8 coredll.dll memcpybigblk
2064 2.3 0x8010DB7C NK.EXE PrefetchAbort
1732 1.9 0x80110EC0 NK.EXE ServerCallReturn
1518 1.7 0x801491D4 NK.EXE OEMWriteDebugByte
1468 1.6 0x8011A094 NK.EXE SleepOneMore
1460 1.6 0x8011BCDC NK.EXE WaitOneMore
1418 1.6 0x80119360 NK.EXE KCNextThread
1405 1.6 0x8011C154 NK.EXE SC_WaitForMultiple
1404 1.6 0x8010E00C NK.EXE IRQHandler
1255 1.4 0x8010DEC0 NK.EXE SaveAndReschedule
1212 1.4 0x03F73CA0 coredll.dll xxx_CeLogData
1145 1.3 0x03F68258 coredll.dll xxx_GetTickCount
979 1.1 0x8010EF34 NK.EXE KCALLPROFON
908 1.0 0x8010EFE0 NK.EXE KCALLPROFOFF
865 1.0 0x017A4544 pxa27x_dma.dll InterruptDone

The lines in green are all related to making API calls.  The ones in blue are all involved with thread rescheduling.  I was a little suspicious about the number of API calls and the amount of thread rescheduling that was going on here.  KCNextThread is the primary kernel function which selects the next thread to run, and having that show up 1.6 percent of the time is pretty significant.  The API call time might also be partly blamed on calling into the kernel to reschedule.

Then we looked at a Kernel Tracker trace for the same scenario, and saw the screen shots above.  I realized how much thread switching was going on, and brought it to the attention of the folks who owned the code involved.

The fix: They changed the design so that the work done between the two most active threads (UsbInterruptThreadStub and DmacIntrThread) was all on one thread instead of shared between two.  As a result, the USB throughput jumped from 23Mbps to almost 41Mbps.