The Joys of Compiler and Processor Reordering: Why You Technically Need Read-Side Barriers


In a previous post on compiler and processor reordering, I said that for multi-threaded, lock-free code to be totally correct - you need a barrier on the read sides of the code - but that it was pretty complicated and wasn't required on any processors that Windows currently runs on. I also said that if there were equally as sick-minded individuals as myself that wanted to dive into the specifics of why - that I would post about it. There are - as it turns out - a lot of sickos out there and this is that post. 🙂


Consider this code:


//


// Global data definition and initialization


//


 


ULONG x = 1;


ULONG* p = &x;


ULONG y = 0;


 


VOID


T1(


    VOID


    )


{


    //


    // Set y to 1


    //


   


    y = 1;


 


 


    //


    // Compiler and processor barrier that prevents reodering


    // and forces an “invalidate” message to be sent to all


    // processors caching locations that have been written to by


    // this processor since the last barrier and before this


    // one – in other words the variable y


    //


 


    MemoryBarrier();


 


 


    //


    // Set p to point to y


    //


 


    p = &y;


}


 


 


VOID


T2(


    VOID


    )


{


    ULONG i = 1;


 


    //


    // Read the contents of the memory location pointed


    // to by p and store its contents in i


    //


   


    i = *p;


 


    ASSERT(i != 0);


}


 


 


In this code T1() is the producer – it populates p with the pointer to some variable. T2() consumes p. From reading this code we can observe the following facts about this code:


 


- p will only ever point to x or y


- x will always have a value of 1


- y will initially be set to 0, then set to 1


- a memory barrier is issued by T1() after y is set to 1 and before p is set to point to y


 


Given these facts we should be able reason that we could never see i with the value of 0. However, there exists at least one processor where this is possible – the Alpha 21264. It is a narrow windowed race condition – but it is real – and if you wrote code that ran on that processor you would have to code within its model and worry about this exact problem. So how exactly could this occur on this processor? Let’s explore.


 


First - before T1() or T2() start to run, x, y and p are all initialized. Then T1() starts running on processor P1 and T2() starts running on processor P2. At this point any dereference of p will access the location containing the variable x. T1() then sets y to the value 1, but this does not mean that the outside world knows about this update to y yet. That comes when we issue the call to MemoryBarrier(). When MemoryBarrier() is called a message is sent to all the other processors’ probe queues. A probe queue is a per processor queue of addresses that have been invalidated by other processors. Each processor has its own queue and processes it independently of other processors. So in this case – when T1() calls MemoryBarrier() – a message is sent to P2 indicating that the memory location that holds the contents of the variable y, is no longer valid – it contains stale data and should not be used. Excellent! This is exactly what we want. The next thing T1() does is set p to hold the address of y. This is perfectly correct from the point of view of T1() and P1. y is fully initialized before it is published via p. This is good.


 


So how does T2() ever see i as 0? Let’s follow the flow of T2() (a very important piece of information is that P2 currently has y cached in its processor cache with the value 0). The first thing that T2() does is define and initialize i to the value of 1. Next it dereferences the address contained in p. What address will that be? We don’t know – it is a race condition. What we do know is that it will either be the address of x or the address of y. First let’s look at the case where p is pointing to x. In this case we know the value of i will be 1 because x is and has always been 1 since it was initialized. Not too interesting of a case. So let’s consider the more interesting case where p is pointing to y. If p is pointing to y then we know the following: y was initialized to 0, was then changed to 1 by T1() before T1() issued a MemoryBarrier() call. Now we need to know what a MemoryBarrier() call does on this processor. Here is what the manual says:


 


If Cbox CSR SYSBUS_MB_ENABLE is clear, the Cbox waits until the integer queue is


empty and then performs the following actions:


a. Sends all pending MAF and IOWB entries to the system port.


b. Monitors Cbox CSR MB_CNT[3:0], a 4-bit counter of outstanding committed


events. When the counter decrements from one to zero, the Cbox marks the


youngest probe queue entry.


c. Waits until the MAF contains no more Dstream references and the SQ, LQ, and


IOWB are empty.


 


When all of the above have occurred and a probe response has been sent to the system


for the marked probe queue entry, instruction execution continues with the


instruction after the MB.


 


The most important parts are highlighted – basically that text says “When a processor issues a memory barrier – it stalls instruction execution on the local processor, sends a message to all other processors letting them know to flush their caches because they now have invalid entries in them, flushes its own probe queue and continues executing instructions.”. Armed with that knowledge we can finally complete this long, weird, memory system - trivial pursuit sojourn. When T2() reads *p it is a 2 step process. First it loads the value of p – in this case the address of y - and then reads from that address. Because P2 has a cache entry for y – the read gets a cache hit. WAIT! T1() issued a MemoryBarrier() before setting p to hold the address of y, so how could P2 get a cache hit on y when T1() running on P1 just invalidated it?! Well – as it turns out – on the 21264 processor - local cache hits for reads are allowed to bypass the probe queue on the local processor! That’s right – the cache hit for the read of y on P2 bypasses the probe queue message telling P2 that y is not valid! Crazy-ness abounds!! So why would you ever design a processor this way? Because it is efficient, fast and beautiful - no blocking on cached reads. And any real programmer would know to put memory barriers before shared reads! Duh!! J


 


It would be an outright nightmare to program to this model. I have never had to do it – thank GOD! But assuming you did have to – how would you fix the code so that it worked correctly? Like this:


 


VOID


T2(


    VOID


    )


{


    ULONG i = 1;


 


    //


    // Read the contents of the memory location pointed


    // to by p and store its contents in i


    //


   


    MemoryBarrier();   


 


    i = *p;


 


    ASSERT(i != 0);


}


 


The MemoryBarrier() in T2() forces P2 to drain its probe queue before continuing to execute instructions. This fixes the problem by not allowing the cache hit for the read to progress. This is also really weird because the memory barriers on this processor have local effects that are different that their global effects. And in some cases local memory barriers depend on remote barriers to complete their logical semantics. Like the local (local to P1) barrier – that depends on the remote barrier (local to P2 – but remote to P1) to complete the invalidate enforcement.


 


Well – I hope that I did a decent job explaining this problem and hope that it convinces you how hard it is to write correct lock-free code. Especially if it has to be portable to new processors that may have different memory barrier semantics than the processors you currently write code for. If I didn’t – or it didn’t – I hope it at least gave you something horrible and weird to think about for a few minutes. 😀


Comments (29)

  1. budsbd@gmail.com says:

    Very interesting – and a hell of a shit** processor design, if I may. Keep up the good work!

  2. ndiamond says:

    From my reading of your explanation, adding that read barrier to T2 does not fix the problem.  Consider this sequence:

    T2 executes T2’s memory barrier.

    P2 fetches y.

    T1 updates y and P1 finishes the update.

    T1 executes T1’s memory barrier.

    T1 updates p and P1 finishes the update.

    T2 fetches p.

    T2 tries to fetch y but defectively designed CPU fetches its obsolete cached value.

    This CPU void where prohibited.  Cached value one-twentieth of one yen.

  3. theelvez says:

    I don’t see why P2 is fetching y in step 2. If it is a pre-fetch of *p it would fetch x. I am probably missing your point. Can you elaborate?

  4. Norman Diamond says:

    > I don’t see why P2 is fetching y in step 2.

    It doesn’t need a reason.  It’s just a possibility.  The possibility is the same as it was in this part of the original posting:  "Let’s follow the flow of T2() (a very important piece of information is that P2 currently has y cached in its processor cache with the value 0)."

    > If it is a pre-fetch of *p

    It isn’t a pre-fetch of *p.  It’s a pre-fetch of y for unknown reasons.

  5. eranb says:

    Thanks for your elaborate explaination.

    While I perfectly understand your analysis and your solution, I think this example is somewhat different than the one you posted in an earlier post.

    In the earlier sample, we had in the reader:

       if (f.Initialized) {

           AVal = f.A;

           // Use AVal here

       }

    And I claimed that f.A can be retrieved even before the if statement is executed, and therefore can fetch stale data. Cannot this actually happen on modern CPUs like IA64 (And not on esoteric ones like the ALPHA)?

    Thanks,

    Eran.

  6. Norman Diamond says:

    > Can you elaborate?

    I think I did.  Can you undelete the comment where I think I posted an elaboration?

  7. theelvez says:

    For the record – I didn’t and don’t delete comments unless they are spam. Your comments automatically go to an "approve" folder for some reason. Maybe you are on some kind of blogger watch list! 😀

    Just kidding – I have no clue why that happens – but if you notice I did approve it once I saw it was in a holding cell. I am not sure how to just let the posts go up automatically – I am not a blog expert by any means.

    So on to your elaboration. This is an interesting case to consider. In my case y was cached on P2 before T2() ran. In your case T2() starts to run, does its memory barrier then fetches y ahead of the *p. This means that either a context swap occured (in which case there would be a memory barrier) or T2() got inlined and the code after it is using y and this is a prefetch of that read of y. If that is the case then I hadn’t thought about that case before. And in fact – if that occurred – then you would likely need a memory barrier after the "i = *p;" as well to prevent a prefetch from screwing you up. I would like to think about this a little more. I had made the implicit assumption when looking at this example – that this was an actual call and not inlined code. Otherwise we would have to analyze all of the inlined code together as one unit to make a correct decision about its validity. Thanks for the feedback. And I’ll try to figure out how to just let your future posts go live.

  8. theelvez says:

    Eran – you might be correct on that fact. I am researching it now. From my reading of the IA64 memory model – the write side barrier should be enough – but the doc is very confusing and seems to potentially contradict itself in places. I’ll reply back after a little more – re-reading. If you have specific text that leads you to believe you are right – please point me to it. Thanks!

  9. Norman Diamond says:

    "if you notice I did approve it once I saw it was in a holding cell"

    Thank you.  I noticed.  Some of your colleagues do the opposite though.

    "In my case y was cached on P2 before T2() ran. In your case T2() starts to run, does its memory barrier then fetches y ahead of the *p."

    OK but so what?  In both cases there is no presumed reason for P2 to prefetch y.  In both cases we only recognize that P2 might prefetch y (and that P2 is defectively designed to use that prefetched value even after we know it’s stale).

    "This means that either a context swap occured (in which case there would be a memory barrier) or"

    I guess you mean that the kernel guarantees to execute a memory barrier when switching back from some other context to T2 (and maybe for all context switches).  OK, but even this added ground rule doesn’t prevent P2 from doing another prefetch of y for unknowable reasons (quantum mechanics, religion, whatever).

    "or T2() got inlined"

    I think that’s irrelevant.  Some CPUs detect unconditional jumps and continue prefetching.  (Also some CPUs detect conditional jumps and continue prefetching speculatively.  Whether the cached value ends up getting used or not is independent of whether the cached value should end up getting invalidated or not.)

    "then you would likely need a memory barrier after the "i = *p;" as well"

    It’s useless.  That CPU is broken.

  10. Norman Diamond says:

    To clarify, that CPU as described is broken in multiprocessor configurations.  It sounds like it would still be usable when it’s the only CPU in a computer.

  11. eranb says:

    Hi,

    I searched the manual, but I must admit I got lost. And then it hits me that I might be searching in the wrong place…

    We are checking what the CPU does, but we forgot that there is another culprit: The compiler!!!

    The compiler can rearrange the commands to load

    f.A before checking the initialized flag. AFAK,

    the compiler can do it since it perceives a single-threaded program unless told otherwise (using volatile is one way of doing that).

    This can happen on any CPU supported by windows…

    This issue is very serious in asynchronous programming model when you deal with context objects (Like IRP in windows). One, must make sure that after it stores stuff in the context object, and there is a chance for the context to be grabbed and processed by other thread that a write-barrier is issued, and the other thread must (I think) issue a read barrier.

    Knowing how windows code works, I doubt that those barriers are issued. Most probably it works because there is usage of other primitives (Locks/Writing to device registers) that ensures barriers.

    Does that sounds right or am I over paranoid?

    Thanks,

    Eran.

  12. Norman Diamond says:

    "I must admit I got lost"

    Yeah I’ll say.  This blog goes to eleven and yours was twelve.  Now I’m lost too, thirteen.  OK enough joking, let’s lose count and continue.

    "The compiler can rearrange the commands to load f.A before checking the initialized flag. AFAK, the compiler can do it since it perceives a single-threaded program unless told otherwise (using volatile is one way of doing that)."

    Using volatile is one way of achieving that effect and it doesn’t matter if the reason involves threading or not.  But I thought I read somewhere that Microsoft’s compilers recognize Microsoft’s MemoryBarrier instructions, so volatile wouldn’t be needed in this particular case.

    [Regarding drivers]

    "Knowing how windows code works, I doubt that those barriers are issued. Most probably it works because there is usage of other primitives (Locks/Writing to device registers) that ensures barriers."

    From my understanding, barriers might still be used in coordinating CPUs (e.g. so that two CPUs don’t think they obtained a lock on the same object).  But sure I don’t think device controllers are expected to interact with memory barrier instructions.  From what Intel told me a few decades ago and my understanding of some manuals, I think that device controllers have to understand the effects of the LOCK prefix.

  13. theelvez says:

    This is so exciting to find other people that love this low level stuff as much as me. The other day I was talking to this guy about reordering and he said "Dad – can you please just drive me to school now!?" 🙂

    >> The compiler can rearrange the commands to >> load f.A before checking the initialized

    >> flag.

    Norman is correct – the MS compilers treat any barrier intrinsic as a full optimization barrier. Also – this is a dependent load – the compiler will not reorder dependent loads.

    >> Regarding drivers

    Eran is correct – this is a huge problem! The kernel has to ensure that is uses proper barriers or it could publish something befoe it is initialized. Typically interlocked operations are used for pointer size data and locks are used for larger data. Also as Norman pointed out – for devices either locks are used or serialized instruction if available. For instance – the x86 has ‘in’ and ‘out’ that read from IO space and are always fully ordered by the processor.

  14. eranb says:

    Hi all,

    Sorry for answering so lately. (Swamped by work…).

    I have several comments/questions:

    1.This BLOG is great!!! I am very interested in all the low-level stuff. Keep doing the good work. (My son usually likes to reorder my house and not my code…)

    2.My comment on barriers by access to device register: Looking at the disassmbly of WRITE_REGISTER_ULONG it is clear that it issues an explicit lock instruction. Thus barriers are ensured since every IRP eventuallly leads to someone writing to a device register.

    3.Regarding dependant loads:

     A.I understand that with barriers the compiler won’t do load before the "if". The entire issue is to prove that we must have the barrier at the first place and not only on esoteric cpus.

     B.I am not a compiler expert, so I don’t know about dependant loads.

     Looking at this code:

       if (f.Initialized) {

           AVal = f.A;

       }

     Does the fact that f.A is read after checking the initialized flag means that it is dependant? How can the compiler tell? I think that from the point of view of the compiler, no one accesses f.A (Again, the compiler assumes single threaded environment), and therefore it can think it is safe to pre-load it even before the "if".

    Many thanks,

    Eran.

  15. Norman Diamond says:

    "(My son usually likes to reorder my house and not my code…)"

    (You need to insert some barriers.)

    "every IRP eventuallly leads to someone writing to a device register"

    s/every/most/  ^_^  especially if your device has been removed.

    s/eventuallly/eventuallly/  ~_~  because you used the same spelling checker that Visual Studio used (it gives programmers an acccept button).

    "Regarding dependant loads"

    I’m trying to figure out what dependant load means.  A dependant barrier might mean this:

    ‘And in some cases local memory barriers depend on remote barriers to complete their logical semantics.’

    because processors have to tell each other to invalidate their locally cached copies of values which are being changed (and non brain dead processors have to listen).  Does a dependant load mean that it depends on all preceding barriers having been executed properly?  Doesn’t that mean that all loads are dependant?

    "I think that from the point of view of the compiler, no one accesses f.A (Again, the compiler assumes single threaded environment), and therefore it can think it is safe to pre-load it even before the "if"."

    I agree, both the compiler and the CPU can preload f.A until someone puts a barrier between the if and the load.  The C and C++ standards do not define any meanings for multithreaded programs.

  16. eranb says:

    Hi Norman,

    Thanks for your response.

    Forgive my english mistakes. Since it is not my natural/first language, I tend to include typos and grammar errors in everything I write.

    Best regards,

    Eran.

  17. theelvez says:

    Even barriers don’t help in my house! 🙂

    So a dependent load in the compiler is someting like:

    if (p) {

      p->A;

    }

    It is illegal for the compiler to read p->A unless the compiler has already figured out that the "if (p)" test is true. This is from the point of view of a single thread of course. In our case however, "f" was already proven to be correct so I was wrong when I called it a dependent load – I had it in my mind that we were checking "f" not a field of "f". So you would need a barrier between them if you needed ordering semantics, and as Norman pointed out – that barrier would also be recognized as a compiler barrier as well. If the processor didn’t allow such reordering the you would just need _ReadWriteBarrier to prevent the compiler from reordering. Thanks!

  18. Norman Diamond says:

    "Forgive my english mistakes. Since it is not my natural/first language"

    OK, then I have to revise one of my previous corrections.

    "every IRP eventuallly leads to someone writing to a device register"

    Previously I said:

    s/every/most/  ^_^  especially if your device has been removed.

    A full correction is:

    Most IRPs eventually lead to someone writing to a device register.

    When you’re programming, if Visual Studio offers you an acccept button, don’t accept its word for it  ^_^

    Actually we see that computer instructions are your first/natural language, just as they are for me  ^_^

    Now for the technical issue, "dependant load" in English or "dependent load" in American.  You didn’t answer but you left it for the blog owner to confuse me further:

    ‘It is illegal for the compiler to read p->A unless the compiler has already figured out that the "if (p)" test is true. This is from the point of view of a single thread of course.’

    No, even in the point of view of a single thread it is not illegal.  The standard notices that you didn’t declare P to point to a volatile target.  A compiler with its extensions notices that there is no read barrier between the "if" test’s expression and the subsequent fetch of p->A.

    ‘So you would need a barrier between them if you needed ordering semantics,’

    I agree.  But this seems to contradict the preceding sentences, so I’m still wondering what you mean.

  19. theelvez says:

    Note I said "unless the compiler has already figured out that the ‘if (p)’ test is true"

    There is a data dependence between p and p->A. p->A is unreachable without p. So a compiler would have to prove to itself that the prefetch would not have an exception or the compiler would have to generate code that would not cause an exception if the address is invalid – like "speculative loading" on IA64.

    But in any case we still need the barrier there to preserve ordering both to a single thread or to outside observers. If p is 0 then the speculative read is thrown away so it is effectively never executed.

    Thanks!

  20. eranb says:

    Hi,

    I am now writing a piece of code, and these series of blogs about code reordering made me think of whether my code is really OK. Note that the example might be a little extreme but this blog really started making me worry…

    Intro:

    I have a context object. In a function, I need to determine some condition according to fields inside the context object.

    Here is a pseudo code snippet:

    void foo(pContext)

    {

     BOOL bDoSomething;

     ….

     bDoSomething = /*Compute according to   pContext fields*/

     ….

     /*Here, pContext is carried out asynchronously (Similar to iocalldriver. After startAsyncActivity finishes, I am not allowed to touch pContext as it may be invalid).*/

     startAsyncActivity(pContext);

     ….

     /*Here I finally decide to do something according to bDoSomething*/

     if (bDoSomething){

       /*Do something*/

     }

    My question is:

    Is it possible for the compiler to move the code that evaluates bDoSomething to be carried AFTER startAsyncActivity, and thus actually touch invalid memory?

    I guess that if startAsyncActivity is a function, the compiler won’t dare to do that. But what if this function is inlined or macroed and the compiler can determine that pContext fields didn’t change?

    My gut feeling is that I am overly paranoid (Jonathan, that’s your fault…), and no sane compiler will dare doing that. But I think that theoratically, an optimizing compiler might decide to do that if it decides that this can actually speed up code execution.

    Best regards,

    Eran.

  21. theelvez says:

    “They’re coming to get – HA HA – HA HA!” – sorry about the paranoia. 🙂

    So – to really analyze this – you need to expand everything and then analyze it according the processor rules and compiler rules.

    If you can expand the code a bit we can try to analyze it.

    Thanks.

  22. Norman Diamond says:

    >>>It is illegal for the compiler to read p->A unless the compiler has already figured out that the "if (p)" test is true. This is from the point of view of a single thread of course.

    >> No, even in the point of view of a single thread it is not illegal.  The standard notices that you didn’t declare P to point to a volatile target.

    > Note I said "unless the compiler has already figured out that the ‘if (p)’ test is true"

    I noticed that you stated an irrelevant condition, but ignored it because it’s irrelevant.  If p were declared to point to a volatile target then your condition would be relevant, but in the case here it is not.

    > There is a data dependence between p and p->A. p->A is unreachable without p. So a compiler would have to prove to itself that the prefetch would not have an exception or the compiler would have to generate code that would not cause an exception if the address is invalid – like "speculative loading" on IA64.

    OK, I agree that the compiler would have to make sure that no exception is generated (or that execution will continue after ignoring an exception, which would be an ugly solution for performance reasons but would still be legal).  But it looked like you were saying the compiler would have to make sure p->A would not be prefectched, and I disagree, because if p->A accidentally turns out to be next to some other variable V and they get loaded together into the cache then nothing bad happens.  If you declared p to point to a volatile target then the compiler would have to do whatever work necessary to make sure that p->A doesn’t get fetched unless/until it should be.

  23. Norman Diamond says:

    > BOOL bDoSomething;

    >  bDoSomething = /*Compute according to   pContext fields*/

    >  startAsyncActivity(pContext);

    If startAsyncActivity starts another thread and the other thread accesses the same variable bDoSomething, then you’d better make it

     volatile BOOL bDoSomething;

    Even just for plain old VC++6 MFC applications that was true.  Read some of Dr. Joseph Newcomer’s essays at flounder.com.

  24. eranb says:

    Hi Norman,

    I understand your comment regarding other thread. However, in my example, bDoSomething is needed only in the scope of the foo function, and might even be a register. I am afraid that the computation of bDoSomething is done after startAsyncActivity and will thus access bad memory.

    Thanks,

    Eran.

  25. Norman Diamond says:

    Oh I see.  You’re not concerned about the store into bDoSomething, you’re concerned about fetches from some variables in the /*Compute according to   pContext fields*/ expression.  You want to be sure that the compiler will not reorder some of those fetches to come after the call starts to startAsyncActivity.

    The compiler is allowed to assume single-threading because the C language doesn’t know about tricks to run other functions in separate threads.  If the compiler wants to think about this reordering then it has to examine startAsyncActivity and any functions that might be called by startAsyncActivity in a single thread.  But if startAsyncActivity uses tricks that the C language doesn’t know about, in order to make other threads change some of those variables, then the compiler doesn’t have to check such things.  I’m not sure if there are compilers that are sophisticated enough to do this analysis and make some reordering optimization that is legal but kills your program.

  26. Pedro Teixeira says:

    > OK, I agree that the compiler would have to make sure that no exception is generated (…)

    The compiler only does part of the work. IA-64 does the real most of it. IA-64 includes the notion of speculative loads. These loads will not fail if the mem is invalid. Instead the target register is marked with a special NaT (Not a Thing) bit. After is has been asserted that "p" is valid, the prefetched value should be checked for NaT. If NaT, reload without the speculative attribute – if "p" is still invalid, a fault will be raised at this moment.

    f:   ld8.s   r33 = [r32] // i = *p (speculative)

        (more stuff)

        cmp.eq  p6 = r0, r32 // if (p)

    (p6) chk.s   r33

    l2:

    (p6) (use the value of r32)

        br.ret

    l1:  ld8     r33 = [r32] // i = *p (non-speculative)

        br.sptk l2

  27. Pedro Teixeira says:

    BTW, i believe that T2() can be fixed with:

    {

       int i;

       register int *k;

       k = p;

       MemoryFence();

       i = *k;

    }

    PS: The above (pseude)code for IA64 has 2 (pseudo)bugs:

    cmp.eq -> cmp.ne

    chk.s r33 -> chk.s r33, l1

  28. David says:

    Your description of the memory system on the alpha looks good, but I think you have an error in your solution — a memory barrier inserted at the start of the code does not help.  I think this is what you need:

    VOID T2(VOID)

    {

       ULONG i = 1;

       ULONG* p2 = p;

       MemoryBarrier();  

       i = *p2;

       ASSERT(i != 0);

    }

    Similar problem in your previous post:

       if (f.Initialized) {

           MemoryBarrier();

           AVal = f.A;

Skip to main content