The debugger lied to you because the CPU was still juggling data in the air

A colleague was studying a very strange failure, which I've simplified for expository purpose.

The component in question has the following basic shape, ignoring error checking:

// This is a multithreaded object
class Foo
 void BeginUpdate();
 void EndUpdate();

 // These methods can be called at any time
 int GetSomething(int x);

 // These methods can be called only between
 // BeginUpdate/EndUpdate.
 void UpdateSomething(int x);

 Foo() : m_cUpdateClients(0), m_pUpdater(nullptr) { ... }

 LONG m_cUpdateClients;

 Updater *m_pUpdater;

There are two parts of the Foo object. One part that is essential to the object's task, and another part that is needed only when updating. The parts related to updating are expensive, so the Foo object sets them up only when an update is active. You indicate that an update is active by calling Begin­Update, and you indicate that you are finished updating by calling End­Update.

// Code in italics is wrong
void Foo::BeginUpdate()
 LONG cClients = InterlockedIncrement(&m_cUpdateClients);
 if (cClients == 1) {
  // remember, error checking has been elided
  m_pUpdater = new Updater();
 // else, we are already initialized for updating,
 // so nothing to do

void Foo::EndUpdate()
 LONG cClients = InterlockedDecrement(&m_cUpdateClients);
 if (cClients == 0) {
  // last update client has disconnected
  delete m_pUpdater;
  m_pUpdater = nullptr;

There are a few race conditions here, and one of them manifested itself in a crash. (If two threads call Begin­Update at the same time, one of them will increment the client count to 1 and the other will increment it to 2. The one which increments it to 1 will get to work initializing m_pUpdater, whereas the second one will run ahead on the assumption that the updater is fully-initialized.)

What we saw in the crash dump was that Update­Something tries to use m_pUpdater and crashed on a null pointer. What made the crash dump strange was that if you actually looked at the Foo object in memory, the m_pUpdater was non-null!

    mov ecx, [esi+8] // load m_pUpdater
    mov eax, [ecx]   // load vtable -- crash here

If you actually looked at the memory pointed-to by ESI+8, the value there was not null, yet in the register dump, ECX was zero.

Was the CPU hallucinating? The value in memory is nonzero. The CPU loaded a value from memory. But the value it read was zero.

The CPU wasn't hallucinating. The value it read from memory was in fact zero. The reason why you saw the nonzero value in memory was that in the time it took the null pointer exception to be raised, then caught by the debugger, the other thread managed to finish calling new Updater(), store the result back into memory, and then return back to its caller and proceed as if everything were just fine. Thus, when the debugger went to capture the memory dump, it captured a non-zero value in the dump, and the code which updated m_pUpdater was long gone.

This type of race condition is more likely to manifest on multi-core machines, because on those types of machines, the two CPUs can have different views of memory. The thread doing the initialization can update m_pUpdater in memory, and other CPUs may not find out about it until some time later. The updated value was still in flight when the crash occurred. Before the debugger can get around to capturing the m_pUpdater member in the crash dump, the in-flight value lands, and what you see in the crash dump does not match what the crashing CPU saw.

Comments (28)
  1. SimonRev says:

    So are you trying to say that the entire pair of functions is wrong, or did the italics flag get stuck on more than you intended?  Surely at least the names of the functions and associated curly braces are at least correct.

    [I'm trying to say, "Be aware that the value in the crash dump is the value that existed in memory a split second after the crash occurred. During that split second, other things may have happened. -Raymond]
  2. ipoverscsi says:

    @SimonRev: I'm pretty sure all the code in italics is wrong because of the race conditioned mentioned in the article.

  3. Jim says:

    I think SimonRev's point is that the idea of having a BeginUpdate and an EndUpdate method isn't wrong it's just their implementation that's wrong. But I'm not sure that making their method names upright and leaving their implementations in italics would make things any clearer (especially since their declarations in the class are upright).

  4. Brian_EE says:

    So…. create a flag variable in the class private section that indicates if Updater() has completed, and add a check in the else clause? Then something similar in the EndUpdate() that indicates the pointer is being deleted.

    Of course, what happens if the check of the flag happens before everything is coherent again. Hmmm.

    So what is the foolproof solution?

  5. The foolproof solution... says:

    The foolproof solution is to wrap the contents of BeginUpdate and EndUpdate in a critical section or mutex.  In essence, you need to prevent two (or more) threads from actually being in either of these functions at the same time, including not allowing a call to EndUpdate at the same time as a call to BeginUpdate…  (Scenario: In thread one, EU decrements the counter to zero, and gets timesliced before even beginning to release the updater object. Now i thread two, BU increments the counter, sees that it is one, and creates a new updater object, and then gets timesliced.  Thread one resumes, grabs the new object pointer and releases it.  Result: thread two crashes on a NULL, and the old updater object is leaked.  The critical section / mutex solution prevents thread two from actually entering BU before thread one has finished its work.  Put the CRITICALSECTION structure in the host object!)

  6. Joshua says:

    Nice! An occasional demonstration the debugger is not magic comes in handy to teach people to understand.

  7. Carl D says:

    Ahh memories of when the first dual-socket Pentium-Pro machines appeared and desktop programmers everywhere discovered that this type of bug is (at least) thousands of times more likely to be observed on a true multi-processor machine than due to simple time slicing on a single core.

  8. voo says:

    @The foolproof solution…: Actually you only have to use the mutex (or some synchronization of kinds) when initializing. So you can actually do double-checked locking to avoid the synchronization cost. Problem with that is obviously that getting double-checked locking right is quite.. interesting and I wouldn't want to come up with one for c++11 myself..

  9. Douglas Adams says:

    So what is the foolproof solution?

    A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools.

  10. Mordachai says:

    @Mike Dimmick – nice, thanks for the additional info!

  11. Joshua says:

    @Mike Dimmick: Yes, but not the Windows CriticalSection API which is dog slow.

  12. Ishai says:

    This type of race condition is more likely to manifest on multi-core machines

    If you are dealing with JIT debugger / crash dump this is common also for single core machines.  There is plenty of time for ready threads to modify memory between the time the unhandled exception filter launches the debugger and the time the debugger suspends all threads.

  13. Joshua says:

    @voo: The double-checked protocol when done right is pessimistic in that it takes the lock when it can't decide without taking the lock. The only hidden assumptions are the compiler cannot optimize across module boundaries and cache flushes cannot happen out of order.

  14. Mike Dimmick says:

    @voo: A correct implementation of double-checked locking requires you to use memory fences to ensure that the processor actually flushes its write queue out, and that the caches are kept properly coherent. If you do it properly, the result is actually about the same code as you would execute if you just used a critical section.…/DDJ_Jul_Aug_2004_revised.pdf

    See also Herb Sutter's presentation to the Northwest C++ Users' Group from 2007: titled "Machine Architecture: Things Your Programming Language Never Told You".

    You might get away with it on x86, which has a strong memory model, but ARM has a *weak* memory model and you may well run into trouble. What do I mean by Memory Model? See…/51445.aspx .

    .NET's memory model does, I believe, ensure that double-checked locking implemented naively by the programmer does run correctly on the hardware.…/double-checked-locking-in-net .

  15. voo says:

    @Mike Dimmick: You can implement mutexes without bidirectional memory fences? No idea about ARM, but on x86 that *should* make a difference, especially since you also need some further bookkeeping for mutexes (ok you get around the kernel object for CriticalSections afaik, so probably not too bad). But yes double-checked locking is one of those cute tricks that doesn't make any noticeable difference in 99% of the time and is still easy to get wrong (and even if programmers get it right, they probably have no idea why it works), I'm not really advocating them for general use.

    Anyway, yes I know what a memory model is, but I haven't really studied the new c++11 one in any detail. Java5+ as well as the C# memory model for volatiles does guarantee that the simple double-checked locking implementation works btw.

  16. Smitty says:

    To this experienced multi-threaded developer the example code looks naive, and so there is no wonder it failed.

    Furthermore, the amazement that the crash-dumper or debugger can lie to you (especially in a multi-threaded application) is laughable.  Anyone who has ever tried to debug a SEGV crash-dump knows that what gets written to the crash-dump (aka 'core') file is little more than a hint at what was going on in the process address space at the time of the crash.  Multi-threaded crash-dumps are similarly unreliable for similar reasons.

    Why do I get the impression that nobody knows how to code any more ?

  17. Jim says:

    @Smitty: Perhaps you get that impression because you now have quite a bit of experience, and there exist people with less experience, all of whom presumably fit into your "do not know how to code" category.

  18. RicarDog says:

    "If you actually looked at the memory pointed-to by ESI+8, the value there was not null, yet in the register dump, EAX was zero."

    I think you meant ECX instead of EAX here, right?

    [Fixed, thanks. -Raymond]
  19. Zan Lynx' says:

    @Smitty: In my experience you can trust the register state for the crashed thread from a crash dump. That means you can trust the instruction pointer and the values in the registers.

    All else, such as the stack trace and other threads is, yes, just a hint.

    But my point was that there *are* things that are trustworthy in the crash dump.

  20. alegr1 says:


    @Mike Dimmick: Yes, but not the Windows CriticalSection API which is dog slow.

    An equivalent of a single interlocked operation (in the case of non-contention) is slow? Are you confusing it with a mutex?

  21. Joshua says:

    @alegrl: It will contend relatively frequently and most of the time for no good reason. I use the critical section once inside the first check of the double-check and it's no issue there.

  22. alegr1 says:


    @alegrl: It will contend relatively frequently and most of the time for no good reason.

    Why would a critical section have a contention if there is no good reason (a good reason being the CS is already owned)?

  23. Kyle says:


    Precisely.  Essentially you're balancing the cost of spins vs. kernel transitions (and really more, with getting the scheduler involved).  And as you say, minimizing the use of synchronization primitives is the hallmark of threaded development–although it takes care for obvious and many non-obvious reasons.

  24. hacksoncode says:

    Worrying about the expense of a CriticalSection in this instance is one of the more severe cases of premature optimization I've encountered. It's stated that updating is an expensive operation, and from the shape of the code it appears to be an infrequent one as well.

    If ever there were a time to not worry about that cost, this would be it. Worry about it after profiling, if necessary.

  25. Joshua says:

    @hacksoncode: 80 threads, one singleton initialized once, fetched every RPC call. That's 200 times per minute. The code is /almost/ a candidate for InitOnceExecuteOnce.

  26. Kyle says:


    The only downside to critical sections is if they're used in areas of continuously high contention…then they spin for a while before sleeping, thus wasting CPU cycles.  But, as alegr1 stated previously, under low contention, there should be significant speed improvements.

    It's not like critical sections cause *more* contention than mutexes, they just tend to waste CPU resources when there is significant contention, whereas a mutex will sleep until contention subsides.  If anything, avoiding a user-kernel-user round trip is usually worthwhile.

    [You can disable spinning with Initialize­Critical­Section­And­Spin­Count, but spinning is a win for low-duration locks since it lets the waiter grab the critical section as soon as the owner exits, rather than waiting for the scheduler. If you see lots of wasted spins, then your critical section is too broad. -Raymond]
  27. alegr1 says:


    Fetching a pointer to a singleton 200 times for minute (once every 300 ms, or every 1000000000 CPU clock cycles) is as likely to cause a contention, as you contending to pass through a door with one randomly selected US citizen at at given day.

    My guess is you made some mistake in the implementation.

  28. Joshua says:

    @alegrl: I got tired of the critical section showing up in the profiler.

Comments are closed.