Recognizing Excessive Thread Switching

Article
05/23/2005

Sometimes the source of a performance problem is not that code is too slow, but that threads switch too frequently. When the system switches between threads, the system consumes extra cycles in scheduling activities such as selecting the next thread to run, or switching to the new thread, rather than spending the time executing application code. In general, the less time threads run before switching out, the more total time the system spends scheduling new threads to run.

The default amount of time that a thread runs before switching to another thread is called the Default Thread Quantum, dwDefaultThreadQuantum. For example, if the default thread quantum is 100 milliseconds, then a thread would run continuously for 100 milliseconds before another thread ran, unless preempted by a higher-priority thread.

In a system where threads run to the end of a 100-millisecond quantum, there is one thread switch per 100 milliseconds, or 10 thread switches per second. In contrast, in a system where threads run only 50 microseconds before switching, there is one switch per 50 microseconds, or 20,000 thread switches per second. Switching at such an elevated rate has a dramatic performance impact.

You can use Remote Kernel Tracker to identify these situations. Remote Kernel Tracker depicts thread switches as vertical green lines.

The following screenshot shows an example of an application with excessive thread switching. In this picture, I’ve set two markers in the time scale at the top of the display. The difference in time between the two is displayed in the status bar on the bottom right, so from time marker 1 to time marker 2 is 10 milliseconds. As you can see, there are numerous vertical green lines; each represents a switch from one thread to another.

(scroll down if you can't see anything)

You also can see above that the threads seem to be continuously repeating a pattern. The next picture zooms in on that pattern:

(again, scroll down if you can't see anything)

As you can see in the preceding picture, the “OpenDeviceKey” thread runs the longest continuous time, about 0.6 milliseconds. The next longest run-time in this screenshot, “RndisSendRndisMessage ,” is 0.16 milliseconds. The greatest switching activity, between “UsbInterruptThreadStub” and “DmacIntrThread ,” occurs at a rate of one switch approximately every 50 microseconds. One thread switch per 50us is a rate of 20,000 thread switches per second. This means that thread switching occurs at a rate that is about 2000 times higher than the ideal.

Let me describe where these screen shots came from:

The problem: Poor USB throughput. On particular hardware we expected to get about 50Mbps throughput, but were getting only about 23Mbps.

The investigation: We recorded a Monte Carlo profiling trace, with these results:

HITS PER SYMBOL

Hits	Percent	Address	Module	Routine
6557	7.3	0x8012C8FC	NK.EXE	IDLE_STATE
4643	5.2	0x8010E330	NK.EXE	INTERRUPTS_ENABLE
2882	3.2	0x02FEA884	tcpstk.dll	tcpxsum
2166	2.4	0x80110894	NK.EXE	ObjectCall
2123	2.4	0x03FA37B8	coredll.dll	memcpybigblk
2064	2.3	0x8010DB7C	NK.EXE	PrefetchAbort
1732	1.9	0x80110EC0	NK.EXE	ServerCallReturn
1518	1.7	0x801491D4	NK.EXE	OEMWriteDebugByte
1468	1.6	0x8011A094	NK.EXE	SleepOneMore
1460	1.6	0x8011BCDC	NK.EXE	WaitOneMore
1418	1.6	0x80119360	NK.EXE	KCNextThread
1405	1.6	0x8011C154	NK.EXE	SC_WaitForMultiple
1404	1.6	0x8010E00C	NK.EXE	IRQHandler
1255	1.4	0x8010DEC0	NK.EXE	SaveAndReschedule
1212	1.4	0x03F73CA0	coredll.dll	xxx_CeLogData
1145	1.3	0x03F68258	coredll.dll	xxx_GetTickCount
979	1.1	0x8010EF34	NK.EXE	KCALLPROFON
908	1.0	0x8010EFE0	NK.EXE	KCALLPROFOFF
865	1.0	0x017A4544	pxa27x_dma.dll	InterruptDone

The lines in green are all related to making API calls. The ones in blue are all involved with thread rescheduling. I was a little suspicious about the number of API calls and the amount of thread rescheduling that was going on here. KCNextThread is the primary kernel function which selects the next thread to run, and having that show up 1.6 percent of the time is pretty significant. The API call time might also be partly blamed on calling into the kernel to reschedule.

Then we looked at a Kernel Tracker trace for the same scenario, and saw the screen shots above. I realized how much thread switching was going on, and brought it to the attention of the folks who owned the code involved.

The fix: They changed the design so that the work done between the two most active threads (UsbInterruptThreadStub and DmacIntrThread) was all on one thread instead of shared between two. As a result, the USB throughput jumped from 23Mbps to almost 41Mbps.

Recognizing Excessive Thread Switching

Additional resources