volatile, acquire/release, memory fences, and VC2005

One of the more common questions I get about VC2005 code generation relates to the code generation of volatile on x86/x64. If we take a look at MSDN we see that it defines the semantics for volatile in VC2005 as :

 

o A write to a volatile object (volatile write) has Release semantics; a reference to a global or static object that occurs before a write to a volatile object in the instruction sequence will occur before that volatile write in the compiled binary.

 

o A read of a volatile object (volatile read) has Acquire semantics; a reference to a global or static object that occurs after a read of volatile memory in the instruction sequence will occur after that volatile read in the compiled binary.

 

So, what does this mean for code that you might write? Let's look at the Read Acquire semantics in an example. In this example the volatile variable has the name 'V'.

 

Read Acquire Semantics:

Store A

Load B

Load V

Store C

Load D

 

The Read Acquire semantics say that Store C and Load D must remain below Load V. Store A and Load B are not constrained by Load V (at least, they have no constraint as a result of the load acquire semantics, but other hardware constraints may constrain their movement).

 

Now let’s look at Store Release semantics. Again, the volatile variable is 'V':

 

Store Release Semantics:

Store A

Load B

Store V

Store C

Load D

 

The store release semantics state that Store A and Load B must remain about Store V. In this case Store C and Load D are not constrained by Store V (again, at least not with respect to the store release semantics). 

 

OK, this behavior is exactly what many people want. So what they often do at this point is they use volatile in code and then they look at the generated assembly code to see what type of synchronization the compiler introduces to ensure that the acquire/release semantics are preserved. (Note, that the compiler has internal constraints which ensure that the compiler does not violate these semantics when it generates code). For many people they're surprised to see that there are no synchronization primitives used. Wait, this can't be right?! How do we keep the CPU from violating these semantics without some type of lfence or sfence or something? Well lets talk about what the hardware might do to our instruction sequence. 

 

With respect to a single core all loads and stores are perceived by the programmer to occur in program order (note, that when I say program order, at this point I mean the assembly program). There is no reordering that occurs. OK, that makes things easy, but again that's just a single core looking at its own instruction sequence.

 

But things get more interesting when you have more than one processor/core (doesn't it always?). Across processors, one "might" see different ordering, i.e., processor 1 might observe loads/stores retired in a different order than processor 0 has in its program order (note, this is probably the weakest memory model you will see on x86/x64, but if we work here, we'll work for something stronger). Hmmm… that may cause problems for our volatile (or will it?). Lets dig into what this reordering Processor 1 might observe is.

 

The possible reordering that Processor 1 might observe is that Loads can pass Stores (as long as the Store is non-conflicting). But Loads with respect to other Loads will remain ordered. And Stores with respect to other Stores will remain ordered. Lets see an example:

 

Original Program Order on Processor 0:

Load A

Store B

Load C

Load D

Store E

Load F

 

Possible Reordering Processor 1 Might See:

Load A

Load C

Load D

Store B

Load F

Store E

 

or

 

Load A

Load C

Load D

Load F

Store B

Store E

 

If you look at this, you see that Loads can "float" upwards past Stores (again, as long as the Store is non-conflicting), and "can" continue to float upwards until it hits another Load. 

 

So how does this affect our volatile semantics? Let’s start with the Read Acquire semantics example (example copied from above):

 

Read Acquire Semantics:

Store A

Load B

Load V

Store C

Load D

 

Another processor observing this instruction sequence may see Load B float above Store A, which is fine (no violation). But since Store's don't float upward, Store C must remain below Load V. Load D can float upward, but it can't go past another Load, so it can't pass Load V. Thus any instruction originally below Load V, will be observed by another processor to execute after Load V. Good.

 

Now let’s look at the Store Release Semantics (again, copied from above):

 

Store Release Semantics:

Store A

Load B

Store V

Store C

Load D

 

In this case Load D can float past Store C and Store V, but Store Release semantics don't care about instructions that occur below the Store V, so no violation here. Loads can float upward, but not downward, so Load B can not be observed to execute after Store V. And Store's are always observed in program order. Again we're good. So our volatile model is preserved, even with this reordering semantics.

 

Last thing… these rules don't apply to SSE streaming instructions or fast string operations; so if you are using weakly ordered instructions, then you'll need to use lfence, sfence, mfence, etc...

 

PS - On Itanium, with its weaker memory model, we generate ld.acquire and st.release instructions explicitly.