The Joys of Compiler and Processor Reordering: Why You Technically Need Read-Side Barriers

In a previous post on compiler and processor reordering, I said that for multi-threaded, lock-free code to be totally correct - you need a barrier on the read sides of the code - but that it was pretty complicated and wasn't required on any processors that Windows currently runs on. I also said that if there were equally as sick-minded individuals as myself that wanted to dive into the specifics of why - that I would post about it. There are - as it turns out - a lot of sickos out there and this is that post. :)

Consider this code:

//

// Global data definition and initialization

//

ULONG x = 1;

ULONG* p = &x;

ULONG y = 0;

VOID

T1(

    VOID

    )

{

    //

    // Set y to 1

    //

   

    y = 1;

    //

    // Compiler and processor barrier that prevents reodering

    // and forces an “invalidate” message to be sent to all

    // processors caching locations that have been written to by

    // this processor since the last barrier and before this

    // one – in other words the variable y

    //

    MemoryBarrier();

    //

    // Set p to point to y

    //

    p = &y;

}

VOID

T2(

    VOID

    )

{

    ULONG i = 1;

    //

    // Read the contents of the memory location pointed

    // to by p and store its contents in i

    //

   

    i = *p;

    ASSERT(i != 0);

}

In this code T1() is the producer – it populates p with the pointer to some variable. T2() consumes p. From reading this code we can observe the following facts about this code:

- p will only ever point to x or y

- x will always have a value of 1

- y will initially be set to 0, then set to 1

- a memory barrier is issued by T1() after y is set to 1 and before p is set to point to y

Given these facts we should be able reason that we could never see i with the value of 0. However, there exists at least one processor where this is possible – the Alpha 21264. It is a narrow windowed race condition – but it is real – and if you wrote code that ran on that processor you would have to code within its model and worry about this exact problem. So how exactly could this occur on this processor? Let’s explore.

First - before T1() or T2() start to run, x, y and p are all initialized. Then T1() starts running on processor P1 and T2() starts running on processor P2. At this point any dereference of p will access the location containing the variable x. T1() then sets y to the value 1, but this does not mean that the outside world knows about this update to y yet. That comes when we issue the call to MemoryBarrier() . When MemoryBarrier() is called a message is sent to all the other processors’ probe queues . A probe queue is a per processor queue of addresses that have been invalidated by other processors. Each processor has its own queue and processes it independently of other processors. So in this case – when T1() calls MemoryBarrier() – a message is sent to P2 indicating that the memory location that holds the contents of the variable y, is no longer valid – it contains stale data and should not be used. Excellent! This is exactly what we want. The next thing T1() does is set p to hold the address of y. This is perfectly correct from the point of view of T1() and P1. y is fully initialized before it is published via p. This is good.

So how does T2() ever see i as 0? Let’s follow the flow of T2() (a very important piece of information is that P2 currently has y cached in its processor cache with the value 0). The first thing that T2() does is define and initialize i to the value of 1. Next it dereferences the address contained in p. What address will that be? We don’t know – it is a race condition. What we do know is that it will either be the address of x or the address of y. First let’s look at the case where p is pointing to x. In this case we know the value of i will be 1 because x is and has always been 1 since it was initialized. Not too interesting of a case. So let’s consider the more interesting case where p is pointing to y. If p is pointing to y then we know the following: y was initialized to 0, was then changed to 1 by T1() before T1() issued a MemoryBarrier() call. Now we need to know what a MemoryBarrier() call does on this processor. Here is what the manual says:

If Cbox CSR SYSBUS_MB_ENABLE is clear, the Cbox waits until the integer queue is

empty and then performs the following actions:

a. Sends all pending MAF and IOWB entries to the system port.

b. Monitors Cbox CSR MB_CNT[3:0], a 4-bit counter of outstanding committed

events. When the counter decrements from one to zero, the Cbox marks the

youngest probe queue entry.

c. Waits until the MAF contains no more Dstream references and the SQ, LQ, and

IOWB are empty.

When all of the above have occurred and a probe response has been sent to the system

for the marked probe queue entry, instruction execution continues with the

instruction after the MB .

The most important parts are highlighted – basically that text says “When a processor issues a memory barrier – it stalls instruction execution on the local processor, sends a message to all other processors letting them know to flush their caches because they now have invalid entries in them, flushes its own probe queue and continues executing instructions.”. Armed with that knowledge we can finally complete this long, weird, memory system - trivial pursuit sojourn. When T2() reads *p it is a 2 step process. First it loads the value of p – in this case the address of y -and then reads from that address. Because P2 has a cache entry for y – the read gets a cache hit. WAIT! T1() issued a MemoryBarrier() before setting p to hold the address of y, so how could P2 get a cache hit on y when T1() running on P1 just invalidated it?! Well – as it turns out – on the 21264 processor - local cache hits for reads are allowed to bypass the probe queue on the local processor! That’s right – the cache hit for the read of y on P2 bypasses the probe queue message telling P2 that y is not valid! Crazy-ness abounds!! So why would you ever design a processor this way? Because it is efficient, fast and beautiful - no blocking on cached reads. And any real programmer would know to put memory barriers before shared reads! Duh!! J

It would be an outright nightmare to program to this model. I have never had to do it – thank GOD! But assuming you did have to – how would you fix the code so that it worked correctly? Like this:

VOID

T2(

    VOID

    )

{

    ULONG i = 1;

    //

    // Read the contents of the memory location pointed

    // to by p and store its contents in i

    //

   

    MemoryBarrier();

    i = *p;

    ASSERT(i != 0);

}

The MemoryBarrier() in T2() forces P2 to drain its probe queue before continuing to execute instructions. This fixes the problem by not allowing the cache hit for the read to progress. This is also really weird because the memory barriers on this processor have local effects that are different that their global effects. And in some cases local memory barriers depend on remote barriers to complete their logical semantics. Like the local (local to P1) barrier – that depends on the remote barrier (local to P2 – but remote to P1) to complete the invalidate enforcement.

Well – I hope that I did a decent job explaining this problem and hope that it convinces you how hard it is to write correct lock-free code. Especially if it has to be portable to new processors that may have different memory barrier semantics than the processors you currently write code for. If I didn’t – or it didn’t – I hope it at least gave you something horrible and weird to think about for a few minutes. :D