QueryPerformanceCounter counts elapsed time, not CPU cycles


An anonymous coward asks whether the QueryPerformanceCounter function counts elapsed time or CPU cycles.

It counts elapsed time. It has to, since its value is governed by the QueryPerformanceFrequency function, which returns a number specifying the number of units per second, and the frequency is spec'd as not changing while the system is running.

For CPUs that can run at variable speed, this means that the HAL cannot use an instruction like RDTSC, since that does not correlate with elapsed time. Commenter "A" appears to have found a buggy HAL that failed to take this into account and returns values that do not correlate with elapsed time.

What would it take to create a counter that was tied to CPU cycles? Well, first, you'd have to come up with some definition of "CPU cycles" that is architecture-neutral. Maybe you'd say that it's a 64-bit value that increments at a rate proportional to the amount of work the CPU has done. And then you have to come up with some sort of definition of what should happen on multi-processor machines. What if you have two CPUs, one of which has gone into a HLT state (not running), while the other is busy doing work? Should the "cycle counter" run at half speed, since only half of the CPUs are running at full speed? What about hyperthreaded processors? It's all so confusing.

As a final remark, commenter Ulric wanted to know what I meant when I wrote, "Throw in a handful of workarounds for known buggy hardware." What I meant was that the people who write HALs are aware of various types of buggy hardware and added code to detect that buggy hardware and work around the problems.

Comments (29)
  1. Eff Five says:

    The following is from Code Complete Chapter 25: Code-Tuning Strategies in a section “Measurements Need to Be Precise”

    "…make sure that you’re measuring only the execution time of the code you’re tuning. Use the number of CPU clock ticks allocated to your program rather than the time of day. Otherwise when the system switches from your program to another program one of your routines will be penalized for the time spent executing another program"

    I’ve always wondered want was meant by this and how to accomplish it. I guess I don’t have to ponder it anymore.

    Thanks

  2. Mike Dimmick says:

    @Eff Five: but the system is doing many things (chiefly paging or other disk I/O) that may not be credited as CPU time to your process, but they can dominate your actual execution time.

    It’s not as if your users are able to run your program without running the other background tasks, anyway.

    You should turn off any CPU hogs – no Folding@Home, for example – but that’s really about it.

    As for finding a buggy HAL, I reckon it was the standard multiprocessor ACPI HAL. That used to use the RDTSC instruction until it was found that AMD had tied that to the actual processor clock speed, which could vary due to power management.

  3. Yuhong Bao says:

    From http://www.virtualdub.org/blog/pivot/entry.php?id=106:

    ""So, realistically, using QPC() actually exposes you to all of the existing problems of the time stamp counter AND some other bugs."

    Not all at the same time. It is like claiming Netscape 8 is insecure because it expose you to the bugs of both IE and Firefox."

    The problem is, how do you determine the counter used from user mode?

  4. Jess Askin says:

    Would running in Safe Mode help get raw numbers for an app?  

  5. Yuhong Bao says:

    "There is still more than one process running in safe mode, so not really, no."

    Indeed, you can’t turn processes off for the same reason you can’t turn security off in any modern OS.

  6. Most times QPC is mentioned there’s one very very important thing missing.

    The fact that QPC is grossly unreliable. Depending on the actual hardware and CPU load, QPC will randomly jump back and/or forth several seconds (in my experience it was usually ~4 seconds).

    You can work around that by tracking wall-clock time too and trying to detect those leaps.

  7. Narr says:

    >I’ve always wondered want was meant by this and how to accomplish it.

    >I guess I don’t have to ponder it anymore.

    But you still don’t know what is meant nor how to accomplish it. What it means is to use something like the Windows performance counter that tracks your particular processes’ CPU time[1]*.

    [1] http://www.codeproject.com/kb/system/cpuusageByDudiAvramov.aspx

    * I think the linked article has the right counter, but I might be wrong, or there might be a better one.

  8. Csaboka says:

    I think the right function to get the CPU time spent by your process is GetProcessTimes, available since Windows 2000. The only problem is, it is flawed – it charges the whole CPU quantum to your process even if it’s interrupted in the middle of it. It still reported more CPU time for the exact same code when Winamp was running in the background, so I just gave up trying and went back to wall-clock time (with as few processes running as possible).

  9. Worf says:

    As an implementer of QPC and QPF (on Windows CE), it’s always quite interesting. Firstly, the frequency I end up using for QPF is usually based on the crystal frequency – I’ve seen 32.768kHz (RTC timekeeping), 3.58MHz (used because NTSC crystals are cheap), and 3.25MHz (used because it’s multiplied by 4 to get a standard 13MHz clock).

    Some hardware is nice, and provide a nice upcounting register for that, others not so nice and provide a limited register. Either way, the fun part has always been that QPF has to return a 64-bit value, while if we’re lucky, the counter is a 32bit register, which can overflow anywhere from under 20 minutes to several hours, and there’s not usually an interrupt you can use to capture the high bit, either, so QPC ends up wrapping quickly.

    On a PC, from what I’ve seen, QPF/QPC is based on one of several sources of high-performance timing. At the low end, you’ve got the lame old PC programmable interval timer, at approximately 100kHz. On modern PCs, though, QPF/QPC get their values from the ACPI programmable timer. This can mean the source of the timing information comes from a chipset crystal that ticks away at an internal counter, or some other source of timing information. Unfortunately, it seems on an AMD system, the ACPI BIOS is broken in that the "high performance counter" used is the time-stamp counter (which isn’t synchronized across multiple processors necessarily), and as people have found out, the TSC beats in time with the clock frequency, which violates the requirement of the timer in that it’s user controllable and predictable. Which is why QPF doesn’t work if you’re using a TSC based timer – QPF gets its value on startup, but if you’re using TSC, it comes out randomly.

    Nothing the HAL can do if the ACPI BIOS returns the wrong timer to use… Probably happens in Linux too, which has support for the ACPI programmable timers.

    Trivia: The purpose of the ACPI timers is to provide a nice source of timing interrupts for the scheduler, so you don’t have to rely on the old programmable timer. The latter has issues in that the scheduler goes nuts if you reprogram it accidentally. The ACPI timer is controlled only by the OS itself, and is much easier to protect against upset than the PIT.

  10. Phaeron says:

    QPC/QPF is, in my opinion, a perfectly fine API that is just marred by implementation bugs. The interface is simple, and it has enough precision and range for most purposes. It’s just too bad that it doesn’t work on enough systems to be problematic. 500ns jitter, I can deal with. I can’t work around 30% variances in counting rate, or even worse, >100ms jumps forward and backward due to badly synchronized TSCs between cores.

    This is one of those cases where I think it would help to reveal something about the implementation. On more than one occasion I’ve seen people test QPF(), see the CPU frequency, test it on another system, see the CPU frequency again, and then write code that depends on it… only to have to fix it when I run it and it runs three orders of magnitude too slowly. If the documentation noted some of the sources that QPF() commonly uses and the huge range in counting rates (1MHz-3GHz), then I think fewer people would fall into this trap. The knowledge base articles about QPC problems, like 274323, also contain information that should really be noted in the docs. It’s not part of the API and it’s not intended behavior, but for better or worse, it’s what people have to deal with in the wild. IMO, QPC() should be treated like PulseEvent().

  11. KristofU says:

    I find it funny that a thing a computer should be great at : counting up at a fixed rate, creates so many problems.

    Timers on pc’s have always been a bit of a mess, and QPC/QPF worked great until the buggy hardware appeared, from which time on we couldn’t use it anymore and had to rework our time keeping classes using GetSystemTimeAsFileTime.

  12. Dean Harding says:

    Jess Askin: There is still more than one process running in safe mode, so not really, no.

    As long as you do your performance test over a long enough period of time, the effects of other processes should be statistically reduced to the point where they’re insignificant anyway. As long as you’re being sensible about it…

    Plus, there are plenty of applications where you *want* the wall-clock time anyway. For example, when doing frame-rate independent animation in games, etc.

  13. MadQ says:

    There is another issue with AMD CPUs (at least up to the Athlon XP, that I know of.) The TSC isn’t always up to date. It’s possible to get the same number more than once if it’s read in quick succession. As far as I can tell, it has something to do with the state of the CPU’s pipelines. Usually, it helps to keep the (three for Athlon XP) pipelines filled so that reading the TSC is done in parallel with instructions in the other pipelines. For example:

    xor ecx,ecx

    xor esi,esi

    rdtsc

    Avoid using the eax and edx registers, otherwise the pipelines would stall, because the rdtsc instruction writes the TSC into those.

    Intel CPUs don’t seem to have this problem.

    If an application needs precision timing that is consistent across hardware platforms, it’s usually best to give up on a little accuracy, and use a timing algorithm with hysteresis. Kinda like the software version a phase locked loop (see http://en.wikipedia.org/wiki/PLL ), but this is difficult get right.

    Whatever you do, PUHLEASE don’t use timeBeginTime(). It changes thread scheduling, causes a significant increase in interrupts, and increases latency system-wide. Abuse and mis-use of timeBeginTime() has annoyed me to the point that I finally broke down and wrote a DLL that’s loaded via the evil AppInit_DLLs, intercepts all timeBeginTime() calls, and simple returns success without doing anything.

  14. Eff Five says:

    @Narr: >you still don’t know what is meant nor how to accomplish it

    That’s actually my point. Did McConnell mean “turn off anything you don’t need when you are perf testing”? I thought that it seemed so obvious that no one would actually bother to write that down. Clearly I was wrong on that score.

    Perhaps he meant “avoid the problems of context switching by counting how many cycles/ticks/clocks that your code uses.” I take it that Mr. Chen believes that I might as well as count how many happy or sad or beautiful cycles it takes because that’s just as well defined. This is what I was thinking when I wrote that I don’t have to ponder it anymore.

    Lastly when doing perf analysis I try to follow the advice in this blog entry http://blogs.msdn.com/ricom/archive/2005/05/23/421205.aspx

  15. Yuhong Bao says:

    "Unfortunately, it seems on an AMD system, the ACPI BIOS is broken in that the "high performance counter" used is the time-stamp counter (which isn’t synchronized across multiple processors necessarily), and as people have found out, the TSC beats in time with the clock frequency, which violates the requirement of the timer in that it’s user controllable and predictable. "

    Well, in this case, the OS is using rdtsc, otherwise /usepmtimer won’t fix it.

    http://support.microsoft.com/kb/938448/

  16. Yuhong Bao says:

    Phaeron: Your post is great, except for the last sentence, which I’d delete.

  17. Yuhong Bao says:

    For those who are having problems with QPC() now, right now the only way to determine the exact timer used is to attach a kernel debugger and step through the hal!KeQueryPerformanceCounter function. I hope this can be changed, but right now…

  18. SuperKoko says:

    On my old K6-2 550Mhz computer, QPF returns 1193181. I guess it uses the 8253-compatible PIT counter.

  19. Alexander Grigoriev says:

    "Whatever you do, PUHLEASE don’t use timeBeginTime(). It changes thread scheduling, causes a significant increase in interrupts, and increases latency system-wide."

    Oh no, no more stupid "evil timeBeginTime" bitching, please.

    Do you honestly believe a modern CPU has a problem with 1000 interrupts per second? It can easily handle 100,000 ips. And no, changing timer resolution doesn’t change dispatching time slice. These are different things.

  20. Yuhong Bao says:

    "Do you honestly believe a modern CPU has a problem with 1000 interrupts per second? It can easily handle 100,000 ips."

    Except that 1000 interrupts per second affects power management by allowing the CPU to sleep less.

  21. ton says:

    Raymond I must say this

    http://blogs.msdn.com/oldnewthing/archive/2005/09/02/459952.aspx#460003

    is the clearest example I’ve ever read explaining the difference between accuracy and precision. It’s scared into my brain now it was definetley confusing for me in the past. Keep up the good work.

  22. SuperKoko says:

    Whatever you do, PUHLEASE don’t use timeBeginTime().

    Some applications do need accurate timing. The standard Windows XP time slice is too long for video games.

    Consequence: They either have to use timeBeginTime() or use a CPU-intensive waiting loop. I would prefer the former. In my experience, video games use the latter.

    Is it better?

  23. MadQ says:

    @Alexander Grigoriev: It must be nice to have a PC with latest and greatest multi-core, hyper-threaded, etc. CPU, where you don’t notice the difference. Unfortunately, I’m still limping along on my old Athlon XP 2500+, and I do notice the difference. Mutlitasking suffers a lot.

    For example, I use Indexing Service a lot; if some application used timeBeginTime() and then uses a fair bit of CPU time while Indexing Service is in the middle of a master merge, the PC seems to so down to a crawl. Even a simple alt-tab can take a few seconds to respond. I/O is not an issue here

  24. MadQ says:

    @Alexander Grigoriev: It must be nice to have a PC with latest and greatest multi-core, hyper-threaded, etc. CPU, where you don’t notice the difference. Unfortunately, I’m still limping along on my old Athlon XP 2500+, and I do notice the difference. Mutlitasking suffers a lot.

    For example, I use Indexing Service a lot; if some application used timeBeginTime() and then uses a fair bit of CPU time while Indexing Service is in the middle of a master merge, the PC seems to so down to a crawl. Even a simple alt-tab can take a few seconds to respond. I/O is not an issue here

  25. MadQ says:

    [My bad. I fat-fingered some keyboard shortcuts.]

    @SuperKoko: It’s been my experience that games run just fine, unless some other application has a higher base priority, and does a lot of polling. It’s very obvious when I accidentally leave Process Explorer running during a game.

  26. ton says:

    Oops meant ‘scarred’ not ‘scared’… :-)

  27. MadQ says:

    @Alexander Grigoriev: "And no, changing timer resolution doesn’t change dispatching time slice. These are different things."

    I beg to differ, BTW. Just write a little app that’s nothing more than an infinite empty loop. Run two instances of it, and add the Context Switches/sec performance counters for their threads in PerfMon. Watch those numbers for a while. Now run a program that does timeBeginPeriod(1). I just did this, and the sum of the Context Switches/sec of both apps just about doubled.

  28. alegr says:

    "Now run a program that does timeBeginPeriod(1). I just did this, and the sum of the Context Switches/sec of both apps just about doubled."

    If time slice was equal to the timer resolution, number of context switches would increase ten times.

    I suppose you have a single processor or you’ve limited the test apps’ affinity to the same processor?

    " Unfortunately, I’m still limping along on my old Athlon XP 2500+, and I do notice the difference. Mutlitasking suffers a lot.

    For example, I use Indexing Service a lot; if some application used timeBeginTime() and then uses a fair bit of CPU time while Indexing Service is in the middle of a master merge"

    Indexing service is a WTF by itself. It looks like it was only tested with at least dual processor and 4 GB of memory. Those poor souls with a single processor and 1 GB, just have to disable that spawn of evil.

  29. My timekeeping rant…

    1. Time should never go backwards.  It just shouldn’t.  You can slow down time if you want to fix a clock that’s going too fast, but you shouldn’t have it skip backwards.  This breaks subtraction of times, comparisons of times, etc.  Suppose you have an event log of some sort and want to do a binary search to find the line in the log corresponding to a particular time… good luck if times can skip backwards whenever they feel like it.  Or, it’s always fun to get negative benchmark results when you divide by a negative "end-start".

    2. Handling of leap seconds.  See (1).  Apparently whoever designed "Unix time" and its interaction with leap seconds decided that time should skip 1 second backwards at each leap second, so that a year is always 365*86400 or 366*86400 seconds long, even though really sometimes it’s 365*86400+1 or 366*86400+1 seconds long?

    3. Time APIs with millisecond or worse precision.  Come on, at least design the API so it’s *possible* to give me microseconds or better, even if currently you only implement seconds or milliseconds.

    4. Time APIs that return 32-bit values.  You’re just asking for overflow problems.  Come on, we can afford 64 bits to store times these days.

    5. Why should my CPU have to take interrupts regularly just to keep the system time up to date?  Even 100 interrupts per second is wasteful, especially for power consumption.  It’s pretty easy to build a 64-bit counter in hardware that increments the time automatically (with arbitrarily high precision), and then if the CPU wants to read the current time, it can read the register.  If the CPU needs an interrupt to wake up at some point (a thread is doing a wait with a timeout), that’s fine, but that doesn’t require regular interrupts — it only requires an interrupt when the timeout expires.

    6. All the problems with SMP and variable clock frequencies.  Come on, is it so hard to build your main hardware timekeeping facility so that (1) it isn’t tied to a particular CPU and (2) it is in a clock domain whose frequency doesn’t change?

    7. Interactions with standby/hibernate.  This one always scares me… if the system is in standby or hibernate for a week, will my clock tick forward by a week, or will it look like no time at all has elapsed?  Hopefully it will at least not reset to zero?  Could you please at least document the behavior? :)  (And let’s not even ask what happens if, during a hibernate session, you remove your CPU and plug in a new one with a different clock speed.)

    8. Network protocols that force me to parse YYYYMMDDHHMMSS data and run it through a calendar algorithm (plus some time zone chicanery) just to extract what time they’re talking about.  Would it be so bad to send the time across the wire as a Unix time or something instead of going through all that calendar stuff?  The only place you should have to do time->date conversions should be at the "final user output" step, and date->time conversions only at the "user input" step.

Comments are closed.