GetTickCount – Truth and Fiction


I was surprised and dismayed to read a recent article in Embedded Systems Programming (http://www.embedded.com/showArticle.jhtml?articleID=159902113) that gets so many things wrong about the GetTickCount API and portrays Windows CE so negatively.


Apparently I’m behind on my reading, since the same author also wrote a previous article on the same subject, to which Mike Hall wrote a rebuttal on his own blog: http://blogs.msdn.com/mikehall/archive/2005/01/28/362498.aspx  Reading the more recent article makes me want to write my own rebuttal.


GetTickCount is a pretty simple API from the caller’s perspective.  Every millisecond the tick count increments, and you can use GetTickCount to retrieve the number of milliseconds since boot.  Since GetTickCount returns only a 32-bit number, after about 49 days, the counter wraps.  This is documented behavior, and to properly use GetTickCount you have to understand that.  Typically GetTickCount is used to time the duration between two events, in which case you’re generally safe if you subtract two values; subtraction is safe in the presence of rollover.  (eg. If you get a tick count of 0xFFFFFF00 before the rollover, and a tick count of 0x200 after the rollover, subtraction gives you get a difference of 0x300 as expected.)  The only time subtraction can get you into hot water is if there’s a chance the time delta will exceed 49 days, because you may end up needing a difference that’s larger than you can represent in 32 bits.  In which case GetTickCount is the wrong API for you.  I guess in that case you would probably need to implement something using GetSystemTime and SystemTimeToFileTime.  Applications that use GetTickCount get into trouble if they are subtracting over such a long time period, or if they are using something besides subtraction with the tick count.  For example,


   if (GetTickCount() > MyTickValue) { … }



will also get you into trouble.  If I were to guess, that’s where I’d say most applications probably go wrong using this API.


To help catch such errors in applications and drivers, Windows CE does a little thing on debug builds – it initializes the tick count such that it rolls over 3 minutes after boot.  The author talks about our 3-minute rollover as if it’s indicative of a problem in the OS, when really it’s just a meager attempt to help catch bugs in applications.  It’s not much help really, if you ask me, but it might help catch a bug or two.  I’d love to improve on it, but it’s tough to arrange for the timer to roll over at a really useful time for testing your application.  For example you might think we could create an IOCTL you could call, to set the timer at run-time.  But making the timer jump at run-time could mess things up.  Suddenly drivers and applications would think 47 days have passed, maybe network drivers time out, all your appointments fire, who knows…  I’m making things up since I don’t really know how networking or appointments are implemented, but you get the idea.  If you really want to be careful about testing your application or driver that uses GetTickCount, probably the best thing to do is create a wrapper that your code uses to call GetTickCount, and arrange for your wrapper to manipulate the times.  I’d love to see suggestions, and if you can come up with something good we can do in the OS, hey, maybe we will take your advice.


So now I’ll describe how GetTickCount is implemented internally.  First off, the function is technically owned completely by the OEM, though we provide as many implementations as we can manage, so that OEMs can use those.  The implementation varies per CPU and per OAL, but in general, there’s a 32-bit counter, CurMSec, that is incremented once per millisecond with a timer interrupt.  That millisecond timer interrupt is also used for other things, like scheduling threads.


To conserve power, when the kernel has no threads to schedule, the system goes into an idle state (implemented by the OEM’s OEMIdle function) where the timer interrupt is extended to a longer period.  That allows the CPU to spend a longer time in a low-power state.  The idle period is ended when an interrupt fires (the extended timer interrupt or some other interrupt).  When the system leaves the idle state, OEMIdle updates CurMSec with the amount of time that the system was idle, using whatever timer the hardware has.


All the magic is really in the OEM’s implementation of OEMIdle.  If you want some code to look at, see the CE 5.0 help article.


Now, back to the article.  Here are my responses:



  • One of the things that floors me about this article is that it claims that GetTickCount counts downward, not upward.  That is just plain wrong.  I don’t know what gave the author that impression but the counter starts out at 0 and goes up.  If you see any documentation or code that claims otherwise, please tell me and we’ll get it corrected.
  • The author also asks whether the counter “sticks” at the same value after rollover, which it doesn’t.  He seems to imply that it does though.
  • GetTickCount also works just fine on 16-bit and 64-bit CPUs, though the author implies that it doesn’t.
  • The article brings up cases where counters are non-monotonic, where the counter jumps backwards by a few ticks.  If that happens, it means the timer is not implemented correctly by the OEM.  Most likely something in OEMIdle is not right.  We do provide documentation of how to implement a timer (here’s some), standard implementations for different CPUs, and tests to verify that the timer is implemented correctly.  But I’m sure some people would argue that we don’t do enough to help OEMs get this right, and maybe they’re right.  Let’s discuss it.  What else would you like to see?  What trouble have you had?
  • All the other examples of badness that the author uses come from desktop Windows as far as I can tell, and they are problems with applications that use GetTickCount improperly, not with Windows itself.
  • You could argue that this is too complicated, that GetTickCount should just read the timer hardware straight, and scale to milliseconds.  I think the main reason it was done this way was to standardize between hardware with a count-compare type timer and hardware with a count-down timer.  It also avoids doing division to convert to milliseconds, especially 64-bit division since 64-bit timers are common.
  • The author actually suggests using a stopwatch to measure elapsed time during a performance test.  Even if that was accurate enough, it’s a pain and it’s not automatable.  How could you run tests regularly to make sure that nothing got worse?  I fully support the author’s idea that you should loop many times, so that the performance test runs long enough that you could time it with a stopwatch.  But that’s for reasons of repeatability, to get rid of variance.  I don’t believe a stopwatch is the right answer.  For timing code run-times, GetTickCount and QueryPerformanceCounter have satisfied every need I’ve seen so far.

I guess the thing I disliked the most was that the author took a list of implementation complications, application bugs from desktop Windows, incorrect information and “what if’s,” and turned it into a negative portrayal of Windows CE.  Maybe I am just too sensitive about the product I pour so much energy into.  Too many people assume that Microsoft developers don’t care, when the truth is very much the opposite.


Oh man has this discussion gotten long.  Have I ever got just a few words to say about something?  Oh well.  Write back if you have opinions to add or if you think I got any technical details wrong.


Sue


Comments (10)

  1. Jack Crenshaw is at it again… He’s back on the "There’s something wrong with GetTickCount" and therefore…

  2. Jack Crenshaw says:

    Sue, I’m sorry you found so many things wrong in my article in ESP — especially since we seem to agree on virtually every point. You said "Typically GetTickCount is used to time the duration between two events, in which case you’re generally safe if you subtract two values; subtraction is safe in the presence of rollover." I said the same thing. You said, "The only time subtraction can get you into hot water is if there’s a chance the time delta will exceed 49 days, because you may end up needing a difference that’s larger than you can represent in 32 bits. In which case GetTickCount is the wrong API for you." I said the same thing. You said, "as far as I can tell, and they are problems with applications that use GetTickCount improperly, not with Windows itself." I said the same thing.

    As I was very careful to state in my first article, when I first heard of this problem, I was ready to dismiss it as someone who simply didn’t understand how to use a timer. But I saw that Microsoft themselves refer to the problem, occuring in software _WRITTEN BY MICROSOFT_ — when I saw that the FAA requires Windows systems to be reset every 30 days — I had to consider the possibility that there was more going on.

    You guys seem to make a lot out of the fact that I said the counter counted down instead of up. That’s what I was told by someone who uses CE, and has run into the problem. I also have readers who have reported the clock running backwards, who have had GetTickCount return an error condition after the rollover, and Microsoft’s own reports refer to a system burning many more CPU cycles after the 49.7 days.

    Let’s review what we know: (1) A 32-bit integer clock ticking at 1KHz rolls over in 49.7 days. (2) that should normally not be a problem. Anyone who thinks it is doesn’t know how to use a real-time clock. (3) _SOMEONE_ writing software using the function GetTickCount has written software that breaks at the rollover point.

    I thought I made extremely plain in my column, that if any software is creating a problem, it’s because someone took a non-issue and turned it into one. The only issue, to me, is who that person is. Is it inside Microsoft, outside Microsoft, or both?

    It seems to me that you have reacted very wrongly about this article. You have let your knee-jerk defense against what you percieve to be a criticism of Microsoft, to blind you to the words I said. Please go bsck and read the articles again, and show me where I said anything suggesting that it was all CE’s fault.

    For the record, I also wish that, instead of bashing me in absentia on your blog, you had had the courtesy to post your opinions on the Embedded Systems web site, where other readers could see it.

    Write a criticism on the web site; write a letter to the editor. All would have gotten due consideration. Talking about people behind their back isn’t nice.

    Jack

  3. Not a chance says:

    It’s always fun to watch a clash of egos…which is what seems to be happening here…=) To be honest, I’m relatively new to the embedded world and thus I really have no idea who Sue Loh and Jack Crenshaw really are – aside from hearing the names tossed around in various corners…

    After reading all the articles involved, I’m going to have to say both parties are guilty here.

    Mr. Crenshaw started out avoiding pointing fingers at anyone in his first article, but seemed to fall to the temptation to do so to a degree with the second…And it appears to based somewhat on "he said, she said" info. I don’t feel he came across as strongly as Ms. Loh claims he did, but the insinuations are there. If only slightly and apparently not based on personal experience…

    Ms. Loh seems to be partially guilty of what Mr. Crenshaw claims – a knee-jerk reaction to something negative against MS/WinCE. I don’t feel Mr. Crenshaw’s article is particularly negative towards MS. Although there are those out in the world who will snap at every scrap they can find that justifies an anti-MS stand, and Mr. Crenshaw’s second article would meet the requirements for that. Anyways, I feel Ms. Loh does have some valid points. Any person who is serious about embedded programming with WinCE has ample documentation to reference regarding GetTickCount() and counters. And she is correct by stating that the counter increments rather than decrements. I also feel she is correct to call Mr. Crenshaw on the carpet on how he arrives at some of his conclusions (i.e – "…the author took a list of implementation complications, application bugs from desktop Windows, incorrect information and “what if’s,”…" ~~S. Loh)

    I have run into GetTickCount problems, but they have had nothing to do with MS programming or documentation – they were all implementation errors by the BSP. Our main problem was the GetTickCount() function implemented by the BSP "incrementing" the counter the _wrong direction_ (i.e. – decrementing the counter…) under certain conditions….=)

    From my relatively little experience, both Ms. Loh and Mr. Crenshaw are correct in their view points. Now if they can just bottle back up the evil genies of their egos to see it…=)

    Cheers!

  4. I have been in embedded systems development for 25+ years. I do not have personal experience with Windows CE or the problems with GetTickCount() that Jack, Mike and Sue have discussed. But as an embedded, real-time system developer for that long, I am well aware of the potential problems associated with finite-resolution integer representations and counter rollover. From my perspective, this is pretty basic stuff that you must get right for any embedded system to be reliable.

    From her description of the internals, Sue can accurately say that Jack is wrong to blame Win CE for the bug, since it comes directly from the OEM’s implementation of the BSP. But then it is a bit of a stretch for her to categorically deny that it counts down, or has non-atomic or non-monotonic behavior. It depends on the OEM’s implementation, doesn’t it? Sure enough, the next comment verifies that there are implementations out there that count down, not up.

    From the developer’s perspective, it doesn’t really matter who gets the blame. The fact remains, these bugs exist and they greatly impact overall system reliability. Worse yet, Sue and Mike’s attitude seems to imply that these bugs are not significant because they are not inherent to the OS, but instead due to developers not being careful enough when using the API or writing the BSP. They fail to acknowledge that several key MS products (e.g. rpcss) show exactly the same carelessness.

    When I deliver embedded solutions to my customers, the biggest problem I face is always with the software that I inherit, be it the OS or development tools. I can find and fix my own bugs, but invariably I spend WAY too much time chasing down idiosyncratic behavior that is due to OS or firmware anomalies that are not obvious or documented, so I don’t find out about them until very late in the game. I suspect this is exactly what happened with the FAA’s com system.

    I think Jack’s purpose was not to trash Windows or MS per se, but more to make developers aware that these details are critically important to making reliable systems, whether you work for yourself or Microsoft. They are not difficult to solve, but there is definitely only a very few "right" ways to solve them.

    The defensive postures of both Mike and Sue do not make me feel any more confident about using Win CE (or any other flavor) in my next embedded project. To the contrary, all I foresee is more band-aids like "reset the computer at least every 30 days."

    Sue asks for suggestions. My suggestion is that she (and MS) treat this type of bug as a major no-no, and work hard to remove it from all MS products. Don’t point fingers, make excuses or issue work-arounds for the end user. FIX THE CODE AND RELEASE IT. Remember that RELIABILITY is king in the embedded market.

    Start with the BSPs. Make sure there is no way a BSP can pass certification if it counts down, or has non-atomic or non-monotonic behavior. Help your OEMs understand and test for this. Make them aware of the issue in your sample timer code. I can’t believe that MS does not have the resources to make these things happen.

    The operative word must be SUPPORT, not BLAME. When it is SUPPORT, we all win. When it isn’t, we all look bad.

    -David Holland-Moritz

    Owner, Advanced Systems Design

  5. sloh says:

    Jack, first off, I’m sorry if you took my message to be “bashing” or “behind the back.” I didn’t even think I could post a criticism on the ESP web site; I just wrote in the forum I have at hand. I didn’t want to make a big deal about it either (big deals never turn out leaving Microsoft looking good, no matter who is right and who is wrong).

    All the problems you noted, except for the non-monotonic behavior, were cases where the people used GetTickCount wrong. (The non-monotonic case comes from incorrect BSP implementations.) It is true that even people inside Microsoft have made errors by using GetTickCount wrong, and your examples show that. There were no problems with the Microsoft implementation of GetTickCount itself. That’s all I was trying to say.

    You used a lot of phrases like “where there’s smoke, there’s fire” and “adding fuel to the fire” as though you were building up evidence that the OS was in the wrong here. And you said things like “does the software do X? If so it is thoroughly broken,” which certainly doesn’t cast us in a good light. In the other extreme you could have said, “look at this tough problem, and I expect the software has solved it!” which states just as many “facts” with a completely different insinuation. Those things were what I was reacting to. Yeah, I was probably reacting in knee-jerk fashion. I love my product, I work with very intelligent people, and I felt a need to defend my territory. I apologize again if you felt I was being unfair.

    Sue

  6. David Vescovi says:

    Being one who has experienced the problem first hand (the incorrectly implemented BSP one) I was very interested in reading Jack’s article when I saw it in the table of contents. I have been a fan of Jack’s and followed his writings for some time. I was disappointed as I felt the title of the article did not match the content at all. I held him and his writing skills in higher regard.

    I think the bigger problem is that there does not seem to be as much emphasis by MS to keep the BSP’s updated as PB itself. This seems to be left to the third parties who more than not, like to play “I’ve got a secret” and for $$ I’ll let you in on it.