Anatomy of a perf investigation [Perf people do dumb stuff too]

Posted by: Sue Loh 

To give you a little glimpse of MEDC preparation: MEDC 2007 is only a month away and talk preparation is nearly complete.  For those people who are on the ball, the slides are already submitted and beautified by professional slide-cleaner-uppers; rehearsal talks are already given and they've even gotten speaker training from a professional speaking expert.  My 2 talks are like that, but I've been lately still debugging the last weirdness out of the hands-on lab I'm giving.  I hit a tough problem that I knew would turn out to be interesting, and now that I know the answer I thought I'd post it as a reminder about what things can happen.

My hands-on lab is an update of the one I gave last year, about how to look at systemic performance problems using CeLog/Kernel Tracker and the Monte Carlo profiler.  Last year I gave the lab on CE 5.0 x86 emulator images; this year it'll be on CE 6.0 ARM emulator.  I thought that updating the lab would mostly be an issue of changing the instructions to reflect the new Platform Builder UI as embedded inside Visual Studio.  For the most part that was true; I did have to change one example, a thread that outputs a lot of debug messages, to enumerate the file system instead.  The reason for that is a little detail about the way 6.0 PB works inside Visual Studio: the VS2005 debug message UI is apparently much slower than our old UI, and we don't have control of it to work on it and make it faster.  So to keep from slowing down a CE device that's printing a lot of debug messages, they changed the messages to be asynchronous.  My example that sprayed out a lot of debug messages would generate them so fast that even after you turned it off, you'd have Visual Studio printing message after message for many minutes later.  So instead of having people in my lab have to sit for minutes or disconnect to cut off the messages, I had to find a different example.

But even after I sorted that out, I hit one puzzling case I couldn't immediately explain.  My sample program lets you set up some interesting thread interaction -- there are some threads calling memset() and a UI reporting how rapidly they're doing it.  You can choose to have the threads memset their own buffers, or to use a shared buffer that's protected by a critical section.  The idea is that you'll see how much of a slowdown they get when they encounter heavy contention over the critical section.  The memset() rate in the UI will drop dramatically when the two threads contend over the critical section, because they start convoying.  (For info on convoys see

At least, that's what the example is supposed to show.  What actually happened when I ran the sample program was that the contended case was the same speed or even a little faster than the uncontended case.  And I had a lot of variance even in the uncontended case -- the memset() speed would hover at one level, but sometimes it'd spike faster.  It's easy to explain "slow" spikes -- some other thread runs and takes time away from the sample program.  But why would it get faster now and then?

I captured a CeLog trace of the program running, and it showed my application threads taking nearly all the time, as expected.  I captured a Monte Carlo trace, and it showed exactly what I expected: the uncontended case is spending nearly all of its time (93%) inside memset() while the contended case is spending nearly all of its time scheduling threads.  memset() was only 6% of the run time of the contended case.  That was puzzling.  How, then, could the program be completing equal or more memset() calls when it was only spending 6% of its time memsetting, compared to when it was spending 93% of its time memsetting?

Since CeLog and Monte Carlo were only telling me what I already expected to see, I knew it would turn out to be something that the tools can't tell me.  One idea I had, then, was that perhaps the ARM emulator BSP spends a lot of time handling interrupts.  Since Monte Carlo profiling operates by taking an interrupt periodically and seeing what's running at those moments, if you turn off interrupts too much then Monte Carlo will not be able to tell you about the interrupts-off time.  But the main reason you turn off interrupts is while processing an interrupt -- and CeLog logs ISR entry and exit events.  So I collected interrupt data from the two scenarios, and hacked up readlog.exe to add up how much time I spent inside each ISR.  (This isn't in any shipping versions of readlog.exe but you'll probably get it someday in the future.)  No luck; the two scenarios were comparable.

I double-checked that the counter which measured how fast memset() was going was working right, by having the app threads log a CeLog event every time they completed a memset and having the UI log a CeLog event every time it reported the speed.  It was accurate.  From the events that the UI thread logged (once per second), I could see that two adjacent 1-second sections of the log would look identical in Kernel Tracker, but one would be slow (say, 180 memset calls) while the next would be fast (550 memset calls).  So I pulled a trick I added to CeLog at about Windows Mobile 5 time -- I logged a CELID_TIMER_START event before starting a memset call, and a CELID_TIMER_STOP event after it completed.  When you define timers like this, readlog.exe will add up min/max/average run time for the timer and tell you how much total time you spent with the timer running.  [Add a "flag" to the event and readlog will treat the flag as an ID for the timer, so you can make multiple distinct timers.]  Using the timers I could see that in the "fast" sections of the log, memset was running much faster -- about 70us per call instead of 300us per memset in the "slow" sections of the log.

I was talking this over with someone else, who asked me how big the buffers were.  Could it be overflowing the cache?  I said no, the buffers involved were only 4KB.  Then I went to check, and I was wrong; they were 64KB, and that's generally a lot closer to the size of the cache on a Windows CE device.  D'ohh!  I should have thought of that.  I don't know how much cache the ARM emulator has, but it is emulating one of our real BSPs (SMDK).  I changed the test application to use 4KB buffers and that solved the problem.  The variance went away, the uncontended case ran faster and the contended case ran slower.  Since the uncontended case was previously using two 64KB buffers (one for each thread) while the contended case used one common 64KB buffer, I'm guessing that the ARM emulator cache size is right about 128KB.  Most of the time the two threads would overflow the cache, running slowly, but once in a while a thread would get lucky and have its memory buffer stay in cache for a bit longer -- and spike faster.  Meanwhile, the contended case used less RAM and was spending all of its time on contention anyway, so the cache was not a bottleneck.

Why couldn’t I see this in the tools?  When you touch a memory address that is not in the cache, the CPU stalls until the data can be read from memory.  During the stalled time, the current instruction counter does not change.  So if the Monte Carlo profiler recorded a profiler sample during the stall, it would simply show you spending time on an instruction inside memset.  There’s no way to tell many fast memset calls from fewer slow memset calls in the profiler output, or to tell that cache misses were causing the calls to be slower.

Unfortunately though, I arrived on the answer to this puzzle by getting an idea from a smart person.  How could I make that better for our customers?  Well, the OS isn't involved during cache misses -- they're handled totally by the CPU.  So the only way to find out about your cache efficiency is if the CPU has ways to query it.  Luckily, many CPUs today have performance counters you can use to measure such things.  Unluckily, the way to do so varies not only between CPU families but between CPU models and manufacturers.  So any tool to query the cache efficiency would have to be written in a CPU specific fashion.  I've done it before but it's always a quick 1-off to handle the immediate day's problem, and there's never an easy way to release code like that to customers.  Testing and documenting and maintaining that kind of stuff quickly turns into a nightmare.  🙁  The best tools out there come from the CPU manufacturers -- notably Intel VTune and AMD CodeAnalyst will get such information out of their CPUs.  I'm hopeful that ARM will soon set a standard for all ARM chips; they may have already, so look forward to tools that will work on all ARM chips.  Don't expect something from Microsoft soon; we have too many macro-sized fish to fry here on the tools team right now to do micro-sized projects like getting performance counters from every flavor of CPU...  Though tell me if you think I'm making a big mistake by saying that.  Maybe I can convince our BSP team to work with me to define a standard CPU counter interface, and implement counter code for each CPU they work on.

But in general if CeLog and Monte Carlo don't help, then the problem is likely to be one of these:

  • Interrupts are off

  • Cache misses

  • TLB misses

Happy investigating...

Comments (0)

Skip to main content