volatile, acquire/release, memory fences, and VC2005


One of the more common questions I get about VC2005 code generation relates to the code generation of volatile on x86/x64.  If we take a look at MSDN we see that it defines the semantics for volatile in VC2005 as :


 


o    A write to a volatile object (volatile write) has Release semantics; a reference to a global or static object that occurs before a write to a volatile object in the instruction sequence will occur before that volatile write in the compiled binary.


 


o    A read of a volatile object (volatile read) has Acquire semantics; a reference to a global or static object that occurs after a read of volatile memory in the instruction sequence will occur after that volatile read in the compiled binary.


 


So, what does this mean for code that you might write?  Let’s look at the Read Acquire semantics in an example.  In this example the volatile variable has the name ‘V’.


 


Read Acquire Semantics:


Store A


Load B


Load V


Store C


Load D


 


The Read Acquire semantics say that Store C and Load D must remain below Load V.  Store A and Load B are not constrained by Load V (at least, they have no constraint as a result of the load acquire semantics, but other hardware constraints may constrain their movement).


 


Now let’s look at Store Release semantics.  Again, the volatile variable is ‘V’:


 


Store Release Semantics:


Store A


Load B


Store V


Store C


Load D


 


The store release semantics state that Store A and Load B must remain about Store V.  In this case Store C and Load D are not constrained by Store V (again, at least not with respect to the store release semantics). 


 


OK, this behavior is exactly what many people want.  So what they often do at this point is they use volatile in code and then they look at the generated assembly code to see what type of synchronization the compiler introduces to ensure that the acquire/release semantics are preserved.  (Note, that the compiler has internal constraints which ensure that the compiler does not violate these semantics when it generates code).   For many people they’re surprised to see that there are no synchronization primitives used.  Wait, this can’t be right?!  How do we keep the CPU from violating these semantics without some type of lfence or sfence or something?  Well lets talk about what the hardware might do to our instruction sequence. 


 


With respect to a single core all loads and stores are perceived by the programmer to occur in program order (note, that when I say program order, at this point I mean the assembly program).  There is no reordering that occurs.  OK, that makes things easy, but again that’s just a single core looking at its own instruction sequence.


 


But things get more interesting when you have more than one processor/core (doesn’t it always?).  Across processors, one “might” see different ordering, i.e., processor 1 might observe loads/stores retired in a different order than processor 0 has in its program order (note, this is probably the weakest memory model you will see on x86/x64, but if we work here, we’ll work for something stronger).  Hmmm… that may cause problems for our volatile (or will it?).  Lets dig into what this reordering Processor 1 might observe is.


 


The possible reordering that Processor 1 might observe is that Loads can pass Stores (as long as the Store is non-conflicting).  But Loads with respect to other Loads will remain ordered.  And Stores with respect to other Stores will remain ordered.  Lets see an example:


 


Original Program Order on Processor 0:


Load A


Store B


Load C


Load D


Store E


Load F


 


Possible Reordering Processor 1 Might See:


Load A


Load C


Load D


Store B


Load F


Store E


 


or


 


Load A


Load C


Load D


Load F


Store B


Store E


 


If you look at this, you see that Loads can “float” upwards past Stores (again, as long as the Store is non-conflicting), and “can” continue to float upwards  until it hits another Load. 


 


So how does this affect our volatile semantics?  Let’s start with the Read Acquire semantics example (example copied from above):


 


Read Acquire Semantics:


Store A


Load B


Load V


Store C


Load D


 


Another processor observing this instruction sequence may see Load B float above Store A, which is fine (no violation).  But since Store’s don’t float upward, Store C must remain below Load V.  Load D can float upward, but it can’t go past another Load, so it can’t pass Load V.  Thus any instruction originally below Load V, will be observed by another processor to execute after Load V.  Good.


 


Now let’s look at the Store Release Semantics (again, copied from above):


 


Store Release Semantics:


Store A


Load B


Store V


Store C


Load D


 


In this case Load D can float past Store C and Store V, but Store Release semantics don’t care about instructions that occur below the Store V, so no violation here.  Loads can float upward, but not downward, so Load B can not be observed to execute after Store V.  And Store’s are always observed in program order.  Again we’re good.  So our volatile model is preserved, even with this reordering semantics.


 


Last thing… these rules don’t apply to SSE streaming instructions or fast string operations; so if you are using weakly ordered instructions, then you’ll need to use lfence, sfence, mfence, etc…


 


PS – On Itanium, with its weaker memory model, we generate ld.acquire and st.release instructions explicitly. 


Comments (8)

  1. Peter Ritchie says:

    http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=288218 seems to suggest that ld.acq/st.rel aren’t always generated.

  2. lbargaoanu says:

    Typo 🙂

    "must remain about Store V" is

    "must remain above Store V"

  3. MSDNArchive says:

    Peter, you’re right.  On Itanium there is a bug…

  4. JAG says:

    From the original post "But Loads with respect to other Loads will remain ordered." – I don’t see how to resolve this assertion with Intel 3a 7.2.2 "Memory Ordering in P6 and More Recent Processor Families" which states "1. Reads can be carried out speculatively and in any order."

  5. MSDNArchive says:

    JAG, that’s a good question.  You can think of it this way: if you were following the traffic of memory reads you may see the read (for example for validation of the hardware), but read reordering should not be made apparent to the programmer.

  6. JAG says:

    Mmm, this seems to imply that Intel’s Rule 1 (cited above) is referring to load operations as they might appear on the bus, and not to actual instructions (and their dependents).  That these can be done out-of-order, but always appear to the program in-order, seems to imply an implicit attachment to any other program/CPU which might be storing.  Possibly this is attached to the cache coherence mechanism?  That is, assuming “speculative” refers to any instruction beyond the one at “current EIP” then it would mean that none of these instructions are committed (retired) until they are the target of the “current EIP” AND no intervening cache invalidate has arrived.  Presumably if a invalidate arrives then the speculated instruction (and its dependents) would be discarded/re-executed.  As in (putting aside compiler actions for the moment):

    CPU0          CPU1

    a=0;          …

    b=0;          …

    …           …

    a=1;          while (b!=2);

    b=2;          c=a;

    …           e=c+1;

    CPU1 could speculatively process c=a, and possibly e-c+1 (etc), but could not retire c=a (etc) until it became the target of “current EIP”.  Assume this speculation observes a==0.  At a later time, CPU1 observes b==2.  Because stores are ordered, this will always have been preceded by CPU0 making an exclusive fetch for the line containing a, which will be observed as an “invalidate” by CPU1, which would discard the observation a==0.  

    Is this the kind of play that allows loads to be unordered and ordered at the same time??

  7. MSDNArchive says:

    Jag, it looks something like that.  There is something by Frey that describes the general algorithm that is often cited.  For specific details though, you’d need to contact each hardware vendor.

  8. JAG says:

    Okay, thanks.  I found some words from Andy Glew on the topic, reinforcing your comments.

    http://groups.google.com/group/comp.arch/browse_frm/thread/64bc52823607b2c0/e120cb283153d58a?lnk=st&q=load+order+x86+fence+glew&rnum=3&hl=en#e120cb283153d58a

    I’ve another related question, this one about the xxxBarrier intrinsics (as in 7.1).  The way I read it (which could be whacky of course), given

      load a

      __ReadBarrier

      load b

    the "load a" is not permitted to be moved below the barrier, but there are no constraints on moving the "load b" above the barrier.  Similarly, given

      store a

      __WriteBarrier

      store b

    the "store a" may not move below the barrier, but the "store b" is not constrained.

    This seems odd to me (eg, it doesn’t match acquire/release semantics), so I’m guessing I’m misreading somewhere…

    TIA

    JAG