Things I’ve written that have amused other people, Episode 8

In a technical discussion, I opened a reply with

Bob's paper which I haven't yet read points out that...

Some people wrote to me to say that the've added this quote to their file in the hopes of being able to use it themselves someday.

For those who are curious, I found the technical discussion in question. It had to do with whether the following code is thread-safe:

// initial conditions
int x = 1, y = 0;
int *p = &x;

// Thread 1 executes this
y = 1;
p = &y;

// Thread 2 executes this
print *p;

Question: Can this code legitimately print 0?

Surprisingly, the answer is yes!

Comments (26)
  1. Adam Rosenfield says:

    Moral: reasoning about multithreaded code is hard.  I hope I never have to write code that will run on the Alpha 21264.  x86, x86-64, and PPC is enough for me for now.

  2. Vilx- says:

    This just goes to show: Multithreading is the root of all evil! XD

  3. tony r says:

    that hurts just looking at it.

  4. Zarat says:

    Nah, it just shows our tools (both the theory and software) are insufficient ;)

  5. Zarat says:

    Although I'm a bit annoyed by the fact that everyone points at the same example and then just stops. It would be much more interesting what problems occur on other processors, if any. If it's just the Alpha which is problematic that's ok, since you need to know your target architecture anyways when writing lockless code.

  6. Frank Schwab says:

    No, it shows that there is broken hardware in the world.

    A multi-processor system that "defers" processing of messages to invalidate cache is broken. Plain and simple.  The hardware designers may have thought they were making decisions to optimize performance, but they failed in their duty to build usable hardware.  At the very least, the compiler for this architecture needs to understand the failure model here and work around it by inserting the appropriate memory barriers automatically (which, of course, greatly slows the rest of the code in the application).  

  7. Jeroen Mostert says:

    Memory models are one area where hardware optimization still trumps programmer expectation. The engineers will build what's fast, not what's intuitive — instead of the processor being there to execute your program, it's patiently sitting there waiting for test cases, of which your program happens to be just one. So now you get to figure out two programs: the one you wrote and the one the processor engineers wrote. If you're not lucky or smart enough to get one memory model that appropriately covers all processors you're targeting, Turing help you — and even if you do, don't expect that model to be comprehensible to mere mortals.

  8. Me says:

    IMO, this doesn't just show that "reasoning about multithreaded code is hard". It shows that it is almost impossible for everyone who is not a dedicated expert in the field. Even if you think you know enough about this, chances are pretty high you are wrong.

    It's very similar to crypto in this regard.

  9. Joshua says:

    I must agree with Frank Schwab. The hardware is broken.

  10. NB says:

    I thought the Alpha was long buried?

  11. Torgo says:

    I had to re-read that quote a few times to understand it. I'll add it to my quote file adding couple of commas: "Bob's paper, which I haven't yet read, points out that…"

    Of course, that confusion was nothing compared to understanding how the code could print 0 (at least prior to reading the link).

  12. Zan Lynx says:

    The hardware is not broken. It doesn't work the way that you would like it to, but there was a reason Alpha CPUs were fast. This was one of the reasons.

    I believe Itanium can show the same sort of behavior, depending on what sort of speculative stores and loads are used.

    And if you think Alphas are broken, you should try some cell phones where there's no cache coherency at all. All inter-CPU communication has to be done through queues and cache lines are synchronized only on particular interrupts.

  13. EMB says:

    Foreign language programmer here… I'm not sure I understood what you mean with that. How can you know what Bob's paper points out without read it yet?

    Did you already built your time machine and talking to yourself from the future? Yes, I'm serious. Can someone explain?

  14. Sunil Joshi says:


    Presumably he read the abstract but not the full paper.

  15. Joshua says:

    @Zan Lynx: then that had better not be SMP as no code could survive being motioned from one processor to another.

  16. Jeff says:

    EMB: Perhaps Raymond follows the Alpha memory model. If he didn't put a memory barrier between reading the paper and writing his reply, the two could be reordered to that if Bob changed the paper, Raymond could have written a reply based on the old paper despite having (later) read the new one. Hence the warning.

  17. Michael Grier [MSFT] says:

    Several interesting layers here.

    First, the hardware wasn't broken.  It was implementing the memory coherency spec for Alpha correctly.  The Alpha ARM (Architecture Reference Manual, not the ARM CPU architecture…) was very detailed in allowing for loose memory coherency models.  You may not like the architecture for that reason, you may feel that the design made coding more complex than it needed to be but as was pointed out, the coherency design allowed for a significant reduction in blockages in the pipeline on MP systems.

    Second, people do this because locks are considered bad.  There are really two problems at hand here:

    2a. People hold primitive locks over inappropriate amounts of processing.  Holding a lock while performing a compare-and-store operation is just fine.  The problem that people usually have is a primitive lock held over a lengthy operation like I/O.  It's a good idea if you're in a hot path to ensure that the cache lines you will touch in the primitive lock are in the cache and the pages in the TLB.  This will avoid any cases where you accidently hold the lock for an excessive amount of time in the first place.

    2b. A lot of lock implementations were bad.  A great engineer in the Windows team fixed most of the primitive locks in Windows a few years ago which changed the equation dramatically.  For a long time it was "common knowledge" to do the double checked locking pattern; after he fixed critical sections, there was very little reason to still do so.

    Sometimes you really do need to do something like this.  But it's pretty darned rare; it's mostly something that people who work on things like the scheduler or page fault handler in the kernel worry about.  If you're working at that layer of the system you had better understand the memory model in detail.

    Third, this is exacerbated by the C/C++ language not acknowledging the existence of multithreaded programming for an extended period of time.  There was no mandated abstract memory model (which may not expose the same sharp edges implied here) so it is/was extremely difficult to write precise code that behaved the same on all processors.  When your mental model is that C is a high level assembly language, you would be surprised if normal reads or writes to shared values implied barriers of any kind.  The CLR memory model is very explicit on these points; I assume JVM is also but I haven't verified.

    Finally, x86 and ARM also have similar sharp edges in their memory models which really just gets you back to the original point of "just use locks correctly and you won't have a problem".  (This of course says that perhaps it wasn't fruitful for the Alpha ARM to specify such a loose model but the designers were looking for an architecture that would be able to flourish over the coming 20-30 years.)  It's surprising how many similar misunderstandings of the x86 MP cache coherency models existed for such a long time.

  18. voo says:

    And that's why a clearly defined memory model such as Javas is such a helpful – and I'd even say necessary thing – for writing correct(!) non-blocking multi threaded code. I haven't looked at the MM of the new c++ standard, but before that you basically had to write your c++ algorithms for one specific architecture and most of the time even one specific compiler if you wanted to be clever.

  19. Deduplicator says:

    @Michael Grier:

    If it doesn't violate the principle of least surprise, it's not nearly as good a pit.

    Next, you really want to lambast C and C++ for not providing a comprehensive definition for multithreaded/multiprocess behavior? Doesn't UB cover it all beautifully, especially considering the amount of resources which were since and are especially now poured into just this crack, regularly bending or breaking previous best practice/existing design rules? As far as I know, that's true for all architectures/programming languages allowing/extensible to multiprocessing.

    And do Java/.Net really solve all the hassles and corner-cases? I seem to recall difficulties with timed locks and exceptions.

    @voo: Let's celebrate raising things to the lowest common denominator. Not? Then there's really expensive machinery faking things for those bad nonconformists. But rest assured: If you don't want to be more clever in C++ than in C#, nobody forces you to be, even though you drop a bit of the potential extra performance, writing to said lcd.

  20. Ray Trent says:

    An obscure question that I'm not up to spending the time to figure out, but someone may know: is it possible to "fix" this problem by declaring any or all of the variables "volatile"?

    Note that I don't mean "fix it as a practical matter in the existing implementations of the C/C++ compilers on various platforms" but "make it a violation of the specification to misbehave this way, and therefore answer the question 'is this legitimate?'".

  21. Cesar says:

    And this is why you should always pair your memory barriers.

    For people who want to read more about it, there is a *very* detailed explanation on the Linux kernel "Documentation/memory-barriers.txt" file. It is long, but worth it.

  22. Michael Grier [MSFT] says:


    I don't get the reference but looseness of specifications are what allow innovation to occur without excessive costs paid.  We're fortunate that Intel has made a mint or two on the x86 processor and has reduced what would seem to be a poorly designed processor architecture to practice to enough of a degree that really a competitor has to operate on a fundamentally different premise (e.g. the extreme parallelization of GPUs or the extreme low power of ARM SOCs) to compete.  It was always designed for automatic compilation rather than hand assembly of codes so some of these tricky things could become the pervue of compiler or code generation writers.

    Alpha's failure wasn't anything technical (although some specific attributes like lack of byte addressing surprised people); it was Robert Palmer's disassembly of DEC leading to the assets going to Compaq which was already in bed with H-P and didn't want to rock the boat for the VLIW/PA-RISC followon, Itanium.

    I hardly call pointing out that the problem is exacerbated by the lack of stance "lambasting" the languages.  I was trying to be open here about how to have safe portable code but even just within the Microsoft C/C++ toolset, it wasn't until fairly recently that the whole relationship between volatile and acquire/release memory coherency semantics was worked out.  It is certainly possible for any given compiler implementation to take a stance; for all I know, gcc is very concrete on this topic.

    Timed locks and exceptions seem to be a higher level problem than the basic memory coherency model.  The CLR takes a stance on this topic.  Memory coherency doesn't solve all problems by any means but at least it means that teams which implement higher level synchronization objects like the .net frameworks have a basis on which they can judge the correctness of their implementation.  As I mentioned, I have no idea what levels of guarantees the JVM provides or what synchronization primitives in the common class libraries are implemented as native call-outs vs. on top of the JVM.  (Note that the runtime providing a guarantee isn't sufficient – optimizing compilers must also have rules to follow.  Eric Lippert has a great blog and has covered some of these issues with respect to C# in the past.…/ericlippert)

    As soon as you move up the stack such that you're going to have timeouts and alternate code paths dealing with those timeouts, it gets much harder to reason about correctness so I won't try to for the sake of this discussion.

    @Ray Trent:

    It depends on how the compiler decided to define volatile.  This was a topic of much debate some years ago when people realized that the x86 memory model wasn't as "obvious" as people first thought.  Volatile came in to existence to solve the problems of driver writers dealing with memory mapped I/O so the expectations were around forcing reads and writes to occur in the same order that they appear in the source code.  The fact that these reads and writes could have side effects on other shared values wasn't immediately obvious when the feature was implemented.

    Worthwhile reads:…/Volatile_variable…/Memory_ordering

  23. voo says:

    @Deduplicator Well go on and show me all those amazing crossplatform, cross compiler non-blocking data structures in C++. You won't find many. Getting non-blocking code in C++ right is almost (sure you can do it if you specify which compiler [and which major generation/version there] and architecture you're using) impossible so it's more a question about maybe giving up some performance (the guarantees Java wants aren't that problematic on most modern CPUs it seems to me) vs. getting a probably bugged implementation or just giving up and using locks. In practice most people went with locking in their c++ programs for a good reason.

    You can abstract differences between CAS and ll/cs,.. away without losing too much, but it certainly makes live a whole lot easier for the rest of us.

    @Michael Grier: For Java volatile basically guarantees a memory barrier, but there are the usual low-level intrinsics available (if you'd be interested, you can look at the "unsafe" class).

  24. Nawak says:

    A lot of blog commenters use this memory model and post before reading the article, so perhaps Raymond should add a MemoryBarrier somewhere in his posts… or… maybe he already did and that's why we have to "double-post" to get our ignorance in front of the internet!! It's the cache-invalidate in action!!

    Now to answer the question:

    "Is this code wrong by sometimes printing 0?", no, but "Is it an a**hole for doing so?", yes, definitely!

  25. Michiel says:

    C+11 now covers multithreaded code, so the legitimacy of an equivalent C++11 fragment can now be judged. Clearly, all the instructions in each thread are fully sequenced. I.e. p = &y; is sequenced after y=1. This might be an issue for Alpha's, but C++11 is clear: The execution of a program contains a data race if it contains two conflicting actions in *different threads*. Since there's no conflicting action in the second thread, there's no data race.

    [There is at least one, possibly two conflicting actions in the second thread: First, the read of p conflicts with the write of p in the first thread. And if the read of p reads &y, then there is a second conflict: The read of y conflicts with the write of y in the first thread. -Raymond]
  26. Michael Kuperstein says:


    Java is a really bad example. The original Java memory model was completely broken. The new one is not (too) broken, but is so complicated the number of people who can reasonably claim to understand it can be counted on one hand. I can think of Doug Lea, and possibly a few more. The current consensus* among experts is that Java-like models are broken on a foundational level.


    The authors are Sarita Adve and Hans Boehm. Hans is the principle designer of the C++11 memory model, and Sarita participated on the design of both the C++11 MM and the (fixed) JMM.

    The relevant quote is: "While it may be possible to fix the Java model, it seems undesirable that our specification of multithreaded program behavior would rest on such a complex and fragile foundation. Instead, the section entitled "Implications for Languages" advocates a fundamental rethinking of our approach."

Comments are closed.