Data in crash dumps are not a matter of opinion


A customer reported a problem with the System­Time­To­Tz­Specific­Local­Time function. (Gosh, why couldn't they have reported a problem with a function with a shorter name! Now I have to type that thing over and over again.)

We're having a problem with the System­Time­To­Tz­Specific­Local­Time function. We call it like this:

s_pTimeZones->SystemTimeToTzSpecificLocalTime((BYTE)timeZoneId,
                                 &sysTime, &localTime);

On some but not all of our machines, our program crashes with the following call stack:

ExceptionAddress: 77d4a0d0 (kernel32!SystemTimeToTzSpecificLocalTime+0x49)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 00000000
   Parameter[1]: 000000ac
Attempt to read from address 000000ac
 
kernel32!SystemTimeToTzSpecificLocalTime+0x49
Contoso!CTimeZones::SystemTimeToTzSpecificLocalTime+0x26
Contoso!CContoso::ResetTimeZone+0xc0
Contoso!ResetTimeZoneThreadProc+0x32

This problem appears to occur only with the release build; the debug build does not have this problem. Any ideas?

Notice that in the line of code the customer provided, they are not calling System­Time­To­Tz­Specific­Local­Time; they are instead calling some application-defined method with the same name, which takes different parameters from the system function.

The customer apologized and included the source file they were using, as well as a crash dump file.

void CContoso::ResetTimeZone()
{
 SYSTEMTIME sysTime, localTime;
 GetLastModifiedTime(&sysTime);

 for (int timeZoneId = 1;
      timeZoneId < MAX_TIME_ZONES;
      timeZoneId++) {
  if (!s_pTimeZones->SystemTimeToTzSpecificLocalTime((BYTE)timeZoneId,
                                  &sysTime, &localTime)) {
    LOG_ERROR("...");
    return;
  }
  ... do something with localTime ...
 }
}

BOOL CTimeZones::SystemTimeToTzSpecificLocalTime(
    BYTE bTimeZoneID,
    LPSYSTEMTIME lpUniversalTime,
    LPSYSTEMTIME lpLocalTime)
{
    return ::SystemTimeToTzSpecificLocalTime(
        &m_pTimeZoneInfo[bTimeZoneID],
        lpUniversalTime, lpLocalTime);
}

According to the crash dump, the first parameter passed to CTime­Zones::System­Time­To­Tz­Specific­Local­Time was 1, and the m_pTimeZoneInfo member was nullptr. As a result, a bogus non-null pointer was passed as the first parameter to System­Time­To­Tz­Specific­Local­Time, which resulted in a crash when the function tried to dereference it.

This didn't require any special secret kernel knowledge; all I did was look at the stack trace and the value of the member variable.

So far, it was just a case of a lazy developer who didn't know how to debug their own code. But the reply from the customer was most strange:

I don't think so, for two reasons.

  1. The exact same build on another machine does not crash, so it must be a machine-specific or OS-specific bug.

  2. The code in question has not changed in several months, so if the problem were in that code, we would have encountered it much earlier.

I was momentarily left speechless by this response. It sounds like the customer simply refuses to believe the information that's right there in the crash dump. "La la la, I can't hear you."

Memory values are not a matter of opinion. If you look in memory and find that the value 5 is on the stack, then the value 5 is on the stack. You can't say, "No it isn't; it's 6." You can have different opinions on how the value 5 ended up on the stack, but the fact that the value is 5 is beyond dispute.

It's like a box of cereal that has been spilled on the floor. People may argue over who spilled the cereal, or who placed the box in such a precarious position in the first place, but to take the position "There is no cereal on the floor" is a pretty audacious move.

Whether you like it or not, the value is not correct. You can't deny what's right there in the dump file. (Well, unless you think the dump file itself is incorrect.)

A colleague studied the customer's code more closely and pointed out a race condition where the thread which calls CContoso::ResetTimeZone may do so before the CTimeZone object has allocated the m_pTimeZoneInfo array. And it wasn't anything particularly subtle either. It went like this, in pseudocode:

CreateThread(ResetTimeZoneThreadProc);

s_pTimeZones = new CTimeZones;
s_pTimeZones->Initialize();

// the CTimeZones::Initialize method allocates m_pTimeZoneInfo
// among other things

The customer never wrote back once the bug was identified. Perhaps the sheer number of impossible things all happening at once caused their head to explode.

I discussed this incident later with another colleague, who remarked

Frequently, some problem X will occur, and the people debugging it will say, "The only way that problem X to occur is if we are in situation Y, but we know that situation Y is impossible, so we didn't bother investigating that possibility. Can you suggest another idea?"

Yeah, I can suggest another idea. "The computer is always right." You already saw that problem X occurred. If the only way that problem X can occur is if you are in situation Y, then the first thing you should do is assume that you are in situation Y and work from there."

Teaching people to follow this simple axiom has avoid a lot of fruitless misdirected speculative debugging. People seem hard-wired to prefer speculation, though, and it's common to slip back into forgetting simple logic.

To put it another way:

  • If X, then Y.
  • X is true.
  • Y cannot possible be true.

In order for these three statements to hold simultaneously, you must have found a fundamental flaw in the underlying axioms of logic as they have been understood for thousands of years.

This is unlikely to be the case.

Given that you have X right in front of you, X is true by observation. That leaves the other two statements. Maybe there's a case where X does not guarantee Y. Maybe Y is true after all.

As Sherlock Holmes is famous for saying, "When you have eliminated the impossible, whatever remains, however improbable, must be the truth." But before you rule out the impossible, make sure it's actually impossible.

Bonus chatter: Now that I've told you that the debugger never lies, I get to confuse you in a future entry by debugging a crash where the debugger lied. (Or at least wasn't telling the whole truth.)

Comments (24)
  1. In the code I wrote, the quality assurance guy with his super-mega-nuclear test (yes, that was the real name of the test script) was able to hit race conditions just a couple instructions wide. This confirms that whatever is mathematically possible, will happen. In the worst case, after it's deployed on many machines in the field.

  2. Adam Rosenfield says:

    The debugger could lie to you if you have a rootkit (blogs.msdn.com/…/580608.aspx).  Or possibly overclocking, I suppose.

  3. I have this problem all the time when dealing with people who do not fully grasp what "undefined behavior" means or what can happen when multi-threading goes wrong.  "Component ABC works on U, V, W, and X computers but not on Y computer.  Therefore, Y computer must be defective / have a problem with its software environment / not compatible with some 3rd-party software on computer Y due to defect in the other software."  Actually, that means nothing.  Component ABC is guilty until proven innocent.  The definition of undefined behavior includes cases where undesirable behavior only occurs 0.1% of the time.

    In general – even outside of computers – I find I am a lot less willing to jump to conclusions about anything – compared to some people, anyway.  Just never know about that 1% chance…

  4. kog999 says:

    As a Systems Administrator I tend to find that many developers will blame literally anything else besides there code whenever there is an issue. Often it takes the form of them asking “is there a problem with the networkVMWareServersOperating System.Net Framework”, with no information as to why they think there is a problem. This always receives the same answer

    Me: “I’m not aware of any problems with the network”.

    A brief moment of silence ensues

    Me: “No one has reported any problems with the network I’ve been using it all day and can even get to the old new thing, why do you think there is a network problem”

    Them: “My code connects to database1 to get some info blah, blah, blah, and the webpage I wrote is displaying 5 instead of 6.”

    Me: “so you can get to the server database1, sounds like the network is fine to me.”

    Them: “Well I need you to check harder MY CODE HASN’T CHANGED!”

  5. Joshua says:

    • If X, then Y.

    • X is true.

    • Y cannot possible be true.

    ∴ Heap corruption.

    Been there.

  6. wma says:

    This is when you're thinking you sound like a teacher from a Charlie Brown TV episode ;)

  7. Ivo says:

    "There is no bug here, the program works as coded"

  8. Gabe says:

    When your code works on Machine X but not Machine Y, you could be running into CPU errata. In this case it might be the machine lying to you rather than the debugger.

  9. Adam Rosenfield says:

    @Gabe: True, it's certainly possible, like the infamous Pentium FDIV bug.  But the majority of the time, if it works on Machine X but not Machine Y, you probably have a bug in your code (race condition, undefined behavior, or the like) which only manifests on X.

  10. Maurits says:

    When you have eliminated the impossible, then whatever remains, *however improbable*, must be the truth.

    — Sherlock Holmes, The Sign of the Four

  11. Jonathan Wilson says:

    Occam's Razor says its far more likely to be a bug in your code than a hardware fault.

  12. Matt says:

    If in doubt, it's always your code that screwed up. Blaming someone else should always be a last resort.

  13. Damien says:

    @Matt – if only every compiler, interpreter, macro language, etc would display that message by default every time it was invoked. (Maybe I'm wrong, but) it always seems to be the beginner programmers who assume that they're writing perfect, pristine code every time. (Of course, some beginners never grow out of this attitude, I'm not sure how to fix those people, other than recommend they pick a different career)

  14. LaQuisha Redfern says:

    You went from 'customer' to 'a lazy developer who didn't know how to debug their own code' pretty quickly.

    Don't let the marketing department see! They'll have kittens.

  15. Samuel222 says:

    I'm sure there is a huge religious/political can of worms somewhere in this argument.

  16. Adrian Ratnapala says:

    <em>

       If X, then Y.

       X is true.

       Y cannot possible be true.

    </em>

    Well that's a one up on en.wikipedia.org/…/Paraconsistent_logic

    But still I am amused that the article mention using it as a way of tackling sofware systems.

  17. Terry A. Davis says:

    AdamLog(fmt_st,…) has variable number of args.  The first on the stack is fmt_st, then #args, then args.

    PrintF("%h5D",dt);  That will print in date format 5.  

    It is faster execution time to number date formats than to use strings like "mm:dd:yy".  I learned this at Ticketmaster.

    I did a quick solution to International — a level of indirection.  It looks in an array to map date format numbers to new date format numbers.

    time zone is set with a simple system-wide global variable.

  18. "Gosh, why couldn't they have reported a problem with a function with a shorter name! Now I have to type that thing over and over again."

    On the bright side, just imagine what Klingon epithet the equivalent function would be called on a Unix platform…

    I recall one painful debugging session where a particular function call (NtClose), from userspace, was giving me a BugCheck. Every time. Now, even if I'd been passing in random integers, that would have been a bad thing, but I was just closing the handle I'd received earlier from NtCreateFile! After much probing of my own logic, my soul and the kernel crash dump, I found the culprit: buggy anti-virus filter driver. Turn it off, or switch to another product, everything worked fine. "The AV filter does something stupid when NtCreateFile is called in certain ways, which then causes it to break later on when that handle is closed" wasn't really a conclusion I'd leap to readily.

  19. @jas88: Actually, I'd argue that a bugcheck / BSOD is a good reason to suddenly jump to "no way it's my code", unless you're actually working kernel code as part of your system.  Especially if the bugcheck happens when running as a normal user.  But even if you have administrator access, it should take an explicit effort to cause a bugcheck.  User-mode code should never bugcheck the system.  It can crash itself all it wants, but it should never cause a bugcheck.

  20. Joshua says:

    @JamesJohnson: If I'd found that from unprivileged code, I'd immediately be contacting Microsoft with a potential DOS exploit. They'd probably be able to narrow it down to the screwed-up antivirus easily enough.

  21. JM says:

    Whenever people ask me for my professional inspiration, I'll say "Sherlock Holmes". And quote this line at them. Actually, I quote that line all the time, but that's just because people don't appreciate it enough.

    It's soooo tempting to assume that you *know* what's going on and ignore the facts as they're staring you right in the face. After all, our software is whatever we want it to be, right? And computers just do what they're told, right? So obviously we have complete knowledge of the situation, right? Having the knowledge available and acting on it correctly are two very different things. Experience helps but is no absolute antidote against this (you'll just start letting more elaborate assumptions slip in, because "you've seen something like this before").

    My standard reply to "my code hasn't changed" is "then the circumstances on the machine have changed ever so slightly, exposing a subtle bug that you've never noticed before". The important part is to never stop believing that the code has a bug, until you have solid, iron-clad evidence that something else is to blame. And even *then*, when you fix that you *still* assume your code has a bug by verifying that the fix does what it's supposed to do! Per Hickam's dictum ("Patients can have as many diseases as they damn well please") when something goes wrong, as many things can be wrong at the same time as the universe damn well pleases, but it's also very probable that only one thing is actually causing the bug to appear.

  22. <i>When you have eliminated the impossible, then whatever remains, *however improbable*, must be the truth.</i>

    Ah, it brings back fond memories of "this bug cannot happen" bugs.

    The bugs that, after days, you finally get down to a state where you can decisively declare: "Since this obviously can not be happening, the customer must have been hallucinating, and now we've been infected as well."

    The last one I had I got down to three lines of code as long as they were placed in a particular place in the code:

    var x : int

    x = 2

    print x

    > 0

    Turned out it was an obscure bug in the p-code generation…

    The one before that was a buffer being filled in 5 minutes after the system reported a network read failure and we assumed it was safe to re-allocate the buffer.  Apparently the call had given up, but somewhere deep inside the system, a brave routine was waiting all alone in the darkness for a packet that did, finally, come, allowing it to fulfill its mission (and corrupt our memory).  We had no threads, and we could loop, outputting a memory location, and watch it spontaneously change by itself…

    After those two, whenever we hit an non-obvious bug, it really did take effort to assume it was our code (which it was, of course, 4 times out of 5).

  23. Cesar says:

    An interesting one: mjg59.dreamwidth.org/11235.html

    The firmware set up the wireless card, and it was DMA-ing data to a region of memory every time it received a wireless beacon packet. The firmware did not reset the card before booting the operating system, and the region of memory it used for the DMA was marked as available for use by the operating system. Memory corruption ensues.

  24. > The firmware set up the wireless card, and it was DMA-ing data to a region of memory

    > every time it received a wireless beacon packet. The firmware did not reset the card

    > before booting the operating system, and the region of memory it used for the DMA was

    > marked as available for use by the operating system. Memory corruption ensues.

    Ah. more proof that no matter how bad a debugging scenario I've encountered, someone's had it *much*, *much* worse :-).

Comments are closed.