Diagnosing high CPU by studying profiling results, example


A customer asked for assistance determining why their program demonstrated sporadic high CPU. It occurred only at a client location, so they couldn't set up a full debugging environment. They were able to convince the client to trace one of the spikes in a profiler. During this capture, high CPU was recorded for around 20% of the running time. Here is the drill-down of where most of that time was going.

Stack % Weight
[Root] 19.36
 |- ntdll.dll!RtlUserThreadStart 19.36
 |    kernel32.dll!BaseThreadInitThunk 19.36
 |    contoso.exe!__wmainCRTStartup 19.36
 |    contoso.exe!wWinMain 19.36
 | ⋮
 |    fabrikam.dll!Widget::~Widget 19.36
 |    fabrikam.dll!WidgetProxy::Disconnect 19.36
 |    fabrikam.dll!WidgetProxy::DisconnectAll 19.36
 |    fabrikam.dll!WidgetProxy::TransactCommand 19.36
 |    fabrikam.dll!WidgetProxy::WaitForResponse 19.36
 |    |- user32.dll!MsgWaitForMultipleObjectsEx 18.63
 |    |    |- kernel32.dll!WaitForMultipleObjectsExImplementation 17.37
 |    |    |    |- KernelBase.dll!WaitForMultipleObjectsEx 16.55
 |    |    |    |    |- ntdll.dll!ZwWaitForMultipleObjects 8.69
 |    |    |    |    |    |- ntoskrnl.exe!KiSystemServiceCopyEnd 4.76
 |    |    |    |    |    |    |- ntoskrnl.exe!NtWaitForMultipleObjects 4.71
 |    |    |    |    |    |    |    |- ntoskrnl.exe!ObpWaitForMultipleObjects 4.02
 |    |    |    |    |    |    |    |    |- ntoskrnl.exe!ObpWaitForMultipleObjects 4.02
 |    |    |    |    |    |    |    |    |- ntoskrnl.exe!KiInterruptDispatchNoLock <0.01
 |    |    |    |    |    |    |    |    |- ntoskrnl.exe!KiCheckForKernelApcDelivery <0.01
 |    |    |    |    |    |    |    |    |- ntoskrnl.exe!KiDpcInterrupt <0.01
 |    |    |    |    |    |    |    |    |- ntoskrnl.exe!KiInterruptDispatch <0.01
 |    |    |    |    |    |    |    |- ntoskrnl.exe!NtWaitForMultipleObjects 0.49
 |    |    |    |    |    |    |    |- ntoskrnl.exe!memcpy 0.20
 |    |    |    |    |    |    |    |- ntoskrnl.exe!KiInterruptDispatchNoLock <0.01
 |    |    |    |    |    |    |    |- ntoskrnl.exe!KiDpcInterrupt <0.01
 |    |    |    |    |    |    |- ntoskrnl.exe!KiSystemServiceCopyEnd 0.05
 |    |    |    |    |    |- ntoskrnl.exe!KiSystemCall64 1.84
 |    |    |    |    |    |- ntdll.dll!ZwWaitForMultipleObjects 1.20
 |    |    |    |    |    |- ntoskrnl.exe!KiSystemServiceExit 0.56
 |    |    |    |    |    |- ntoskrnl.exe!KiSystemServiceRepeat 0.16
 |    |    |    |    |    |- ntoskrnl.exe!KiSystemServiceGdiTebAccess 0.13
 |    |    |    |    |    |- ntoskrnl.exe!KiSystemServiceStart 0.03
 |    |    |    |    |    |- ntoskrnl.exe!KiSystemServiceCopyStart 0.01
 |    |    |    |    |    |- ntoskrnl.exe!KiInterruptDispatchNoLock <0.01
 |    |    |    |    |- KernelBase.dll!BaseSetLastNTError 6.57
 |    |    |    |    |    |- ntdll.dll!RtlNtStatusToDosError 6.33
 |    |    |    |    |    |    |- ntdll.dll!RtlNtStatusToDosErrorNoTeb 6.05
 |    |    |    |    |    |    |    |- ntdll.dll!RtlNtStatusToDosErrorNoTeb 6.04
 |    |    |    |    |    |    |    |- ntoskrnl.exe!KiDpcInterrupt <0.01
 |    |    |    |    |    |    |    |- ntoskrnl.exe!KiApcInterrupt <0.01
 |    |    |    |    |    |    |    |- ntoskrnl.exe!KiInterruptDispatchNoLock <0.01
 |    |    |    |    |    |    |- ntdll.dll!RtlNtStatusToDosError 0.29
 |    |    |    |    |    |    |- ntoskrnl.exe!KiInterruptDispatchNoLock <0.01
 |    |    |    |    |    |- KernelBase.dll!BaseSetLastNTError 0.19
 |    |    |    |    |    |- ntdll.dll!RtlSetLastWin32Error 0.05
 |    |    |    |    |- ntdll.dll!memcpy 0.15
 |    |    |    |    |- KernelBase.dll!memcpy 0.01
 |    |    |    |    |- ntoskrnl.exe!KiApcInterrupt <0.01
 |    |    |    |    |- ntoskrnl.exe!KiDpcInterrupt <0.01
 |    |    |    |- kernel32.dll!WaitForMultipleObjectsExImplementation 0.60
 |    |    |    |- ntdll.dll!memcpy 0.16
 |    |    |    |- kernel32.dll!WaitForMultipleObjectsEx 0.04
 |    |    |    |- kernel32.dll!memcpy 0.02
 |    |    |    |- ntoskrnl.exe!KiInterruptDispatchNoLock <0.01
 |    |    |    |- ntoskrnl.exe!KiApcInterrupt <0.01
 |    |    |- ntoskrnl.exe!KiDpcInterrupt <0.01
 |    |    |- ntoskrnl.exe!KiApcInterrupt <0.01
 |    |- KernelBase.dll!GetTickCount 0.24
 |    |- KernelBase.dll!GetLastError 0.20
 |    |- fabrikam.dll!GetLastError 0.09
 |    |- kernel32.dll!GetLastError 0.02

From this chart, you can see that all of the time is consumed in Widget­Proxy::Wait­For­Response, and most of that time is in Msg­Wait­For­Multiple­Objects­Ex. There is a lot of detail inside that function, so let's hide everything that contributes less than one percent.

Stack % Weight
 |    fabrikam.dll!WidgetProxy::WaitForResponse 19.36
 |    |- user32.dll!MsgWaitForMultipleObjectsEx 18.63
 |    |    |- kernel32.dll!WaitForMultipleObjectsExImplementation 17.37
 |    |    |    |- KernelBase.dll!WaitForMultipleObjectsEx 16.55
 |    |    |    |    |- ntdll.dll!ZwWaitForMultipleObjects 8.69
 |    |    |    |    |    |- ntoskrnl.exe!KiSystemServiceCopyEnd 4.76
 |    |    |    |    |    |    |- ntoskrnl.exe!NtWaitForMultipleObjects 4.71
 |    |    |    |    |    |- ntoskrnl.exe!KiSystemCall64 1.84
 |    |    |    |    |- KernelBase.dll!BaseSetLastNTError 6.57

If we look only at the time spent in Wait­For­Multiple­Objects­Ex, 8.69%/16.55% = around half of the time is spent in Zw­Wait­For­Multiple­Objects (the kernel part of the function which does the waiting), and 6.57%/16.55% = around 40% of the time is the time setting the last error code.

That's odd. Why is this function spending nearly half of its time setting the last error code?

My theory: Because the call is failing with an error!

That explains the high CPU. The call to Wait­For­Multiple­Objects­Ex is failing, which means that instead of waiting, it returns immediately with an error code. The Widget­Proxy::Wait­For­Response function doesn't quite know what to do in that case, so it shrugs its shoulders and tries waiting again. Eventually, whatever it's waiting for actually happens, and the call returns, but instead of waiting at low CPU, it accidentally created a CPU spin loop.

Comments (13)
  1. Karellen says:

    This might be a time where having the call count, as well as the timings themselves, can be useful in a profile. If you see that WaitForResponse() has been called 10 times, and it in turn has called MsgWaitForMultipleObjectsEx() 10,000,000 times, then it should immediately ring alarm bells that WFR() is calling MWFMOE() 1,000,000 times per call. Under what circumstances might it do that? Oh, here's a retry loop around the call. Yup, that'd do it.

    I've had that with a program that was spending a lot of time doing string comparisons. Yeah, it was doing quite a few string comparisons, and it's reasonable that string comparisons could take a lot of time. But when I realised *how many* string comparisons each iteration of my code was doing, that's when I had a good idea where I was going wrong.

    [Call counts were not available because the problem was at a client location. The client is probably not going to install a profiled build of your product into production just so you can get a performance trace. -Raymond]
  2. alegr1 says:

    @Karellen: XPERF doesn't count number of passes of a code path. It counts number of timer interrupts that hit the particular path.

  3. Ben Voigt says:

    There are different forms of profiling... passive (sampling) gets call stack when the sample occurs, but can't determine whether the function has exited and reentered, or never left.  Active (instrumented) can deliver call counts and much more precise timing, but degrades performance more and has to be configured at build time.

  4. James Sutherland says:

    Karellen's comment reminded me of an early VB program I tuned. The original turned out to spent more than 99% of the CPU time on ... updating the progress bar. Rather than an actual progress bar control, it was showing/hiding rectangles - and doing this on every pass through a very simple loop which ran thousands of times to perform the full calculation. Simply commenting out the progress bar update made the calculation fast enough it no longer needed a progress indicator - the results were displayed in under a second anyway!

    In this particular case, a very brief sleep, or even just a scheduler yield, before or after each call to WaitForMultipleObjectSex would make a huge difference to CPU load: currently, it will be spending its entire timeslice chanting "Are we there yet? No. Are we there yet? No. Are we there yet? No." before anything else gets to use that CPU core. If this is a single-thread CPU (not unlikely, if the story's a few years old), that'll destroy performance ... in a worst-case scenario, if the other process is I/O bound, that one will be yielding each time it performs any I/O - letting this thread burn up almost all the available CPU, delaying the other by a full timeslice every time! Painful. (From the ~20%, I'd suspect it might just be a four thread system, not quite so bad, but still lousy.)

  5. Yukkuri says:

    >commenting out the progress bar update made the calculation fast enough it no longer needed a progress indicator

    That is genuinely hilarious. I guess the original author didn't ever try it without one first?

  6. Adam Rosenfield says:

    @James Sutherland: I don't think WaitForMultipleObjectSex is an official Win32 API, that probably came from a third-party library ;) .

    On-topic: Relatedly, blogs.msdn.com/.../10014185.aspx tells a similar tale, where the Explorer file copy dialog was optimized in Vista not to display every file name, because it was otherwise spending a lot of unnecessary time doing so (though probably not to quite the same extent as your VB program's task).

  7. McBucket says:

    > That is genuinely hilarious. I guess the original author didn't ever try it without one first?

    Not so far-fetched. You have a progress bar component that generally works well. You have an operation that  appears to require a progress bar, so you use it (or maybe the operation is part of a common collection of operations where the progress bar is applied automatically, even though this operation is actually very fast). Hey, the operation seems slower than you think it ought to be? The last place you're going to look is in the trusted progress bar.

    Alternative to the "solution" of not using the progress bar: monitor the progress percentage yourself, and only update the bar when some threshold differential is reached (1%? 5%?). Our progress bar does this automatically.

  8. M Hotchin says:

    Years ago, an internal product I worked on had a similar problem with its progress bars.  We also disliked the times when the operation completed quickly, so the progress bar was just a flash.

    After a few go-arounds, the solution we eventually came up with was

    - after 1 second, show the progress bar if the operation is less that 50% complete

    - update the progress bar once per second.

    As it turns out, the product was a full call profiler, so determining that we were spending a lot of time updating the progress bar was not as difficult as for others - we'd just run the call profiler *on* the call profiler, and see where we were spending all our time.

  9. Nico says:

    > The original turned out to spent more than 99% of the CPU time on ... updating the progress bar.

    Sadly, this doesn't surprise me at all.  Many years ago I submitted a trivial bugfix for the open-source Audacity project which address the exact same problem (it was ignored for several years, no idea if it ever made it in).  The audio Normalize function would take several minutes to process a two-hour WAV file.  I knew this was absurd, and investigating the source code showed that every time through the processing loop (which runs for *every* audio sample), it updated the progress bar.  Commenting out that one line changed a 4 minute runtime to about 15 seconds.

    Interacting with the UI (graphical or text) has always been an absurd performance killer.  Nowadays I've seen similar problems on websites which manipulate the DOM in a tight loop.  Surprise, surprise, they're slow as frozen silly putty.

  10. Dave says:

    I've found one very effective way to get rid of sudden unexpected spikes in CPU usage.  Run taskschd.msc, open up Task Scheduler Library | Microsoft | Windows | Application Experience, and delete AITAgent, Microsoft Compatibility Appraiser, and ProgramDataUpdater.  Then in ... | Customer Experience Improvement Program, delete all entries.  Suddenly those unexplained, minutes-long spikes in CPU usage that make your laptop fan go crazy won't happen up any more.

  11. Neil says:

    This reminds me of several bugs in the original Firefox indeterminate progress meter implementation - for instance at one point the animation used for indeterminate progress meters was changed from a 10ms to a 20ms interval to reduce the CPU load. (Its determinate progress meters don't update themselves unless the value has changed by at least 4 to avoid problems with over-eager updates.)

  12. DWalker says:

    @Raymond: Karellen's comment is still valid.  True, "Call counts were not available because the problem was at a client location".  But if the profiler showed call counts as well as timings, that certainly would be useful in this, and many other, kinds of traces.

    [But the profiler can't show call counts because the program being profiled wasn't compiled with call counts enabled. "This might be a time where it would be useful to have a time machine." -Raymond]
  13. Scarlet Manuka says:

    Of course the great thing about having a time machine is that it doesn't matter if you have it at the wrong time.

    But this reminded me of something that happened with an algorithm I was building as part of my PhD research. Details aren't important but it generated a lot of (lightweight graphical) objects and then started merging them. Initially, I had it set up so that when it merged two objects it would redraw them and update an on-screen counter. It turned out that in the first phase of merging, the merge calculations were sufficiently light that it was spending most of its time updating the counter and redrawing. After I took that out, it looked odd to see everything freeze for a while and then the count suddenly drop, but it ran a lot faster :) (Subsequent phases were heavier on the calculation and leaving the updates in for the rest of the process was OK.)

Comments are closed.

Skip to main content