The Joys of Compiler and Processor Reordering

So I thought that a good first technical blog entry would be one about a common – but “hardly thought about by most programmers” problem called “reordering”. It is a subtle problem but very important to understand if you write lock-less multithreaded code or write code that directly reads and writes mapped hardware registers.

First we need to define reordering in concrete terms. I define it this way: reordering occurs when a program executes load and/or store instructions (load = read and store = write) in an order that is different than the order specified in the program’s source code. There are 2 flavors of reordering: compiler reordering and processor reordering. Reordering problems are specific to multi-threaded code running on single-processor or multi-processor systems. Reordering is not a problem for single threaded code. Also, reordering is not specific to kernel mode code although it has some interesting ramifications for kernel mode developers that I'll talk about in a future post. Let’s look at some code:

int

AddEm(

    void

    )

{

    int a, b, c, z;

    a = 1; // (1)

    b = 2; // (2)

    c = 3; // (3)

    z = a + b + c; // (4)

    return z;

}

In what order do the operations happen in this code (notice that I numbered the operations in the code)? The fact of the matter is – we don’t know! It could be 1,2,3,4 or 3,2,1,4 or 1,3,2,4 etc.. These are all legal optimizations that the compiler or processor can make. Why? Well – because to a single thread they are unobservable changes. A single thread can’t observe the intermediate stages of the values of a, b and c – it can only see their values at the point it sums them and stores the result in z – operation 4. The only thing that we know about the order of the operations in the above code is that the sequence will end with operation 4. If the sequence didn’t end with operation 4, then we would be able to observe the reordering in a single thread of execution – that would be really bad. Why would a compiler or processor want to do reordering anyway? There are many reasons that optimizations occur, however, let’s not focus on the reasons for the optimizations and focus on how to deal with them in our code (one reason for optimization might be combining memory accesses that are to the same processor cache line yet disparate in the source code). Now let's look at some multi-threaded code:

typedef struct _FOO {
ULONG A;
ULONG B;
ULONG C;

    ULONG D;

    BOOLEAN Initialized;

} FOO;

VOID

InitAFoo(FOO* f)

{

    f->A = 4;

    f->B = 8;

    f->C = 15;

    f->D = 16;

    f->Initialized = TRUE;

}

//

// A global variable of type FOO

//

FOO f;

VOID

T1(VOID)

{

    //

    // Init the global FOO

    //

    InitAFoo(&f);

    return;

}

VOID

T2(VOID)

{

    ULONG AVal;

    //

    // See if our FOO is initialized

    //

 

    if (f.Initialized) {

        AVal = f.A;

 

        // Use AVal here

}

    return;

}

This code is a little more interesting as we have 2 threads of execution that both use the same global data. This means that they can observe intermediate states in the global data as the other execution path modifies it. Let’s assume that the routine InitAFoo() gets reordered so that its execution order is like this:

VOID

InitAFoo(FOO* f)

{

    f->A = 4; // (4)

    f->B = 8; // (3)

    f->C = 15; // (1)

    f->D = 16; // (5)

    f->Initialized = TRUE; // (2)

}

If T1() and T2() run in parallel then there exists a window of time where T2() can see f.Initialized equal to TRUE before f.A has been initialized. This is really bad if f.A is a pointer or anything that we use in any way for that matter.

I should point out here that T1() and T2() are sharing data without a lock. This type of progrmming raises the "difficult to get right" bar quite a bit. In fact if you use locks around your shared data then you never have to worry about reordering problems (more on why in a moment). T1() and T2() are (sort of) an example of lock free code. They are not technically “lock free” under the strictest definition – but the same principles apply – they modify global data without a shared locking mechanism. Lock-free code counts on state transitions for synchronization. In this case – the intended synchronizing state transition is when f.Initialized gets set to TRUE. The idea is that T1() is a producer (it produces initialized FOOs). T2() is a consumer of the FOOs. T2() knows that a FOO is ready for consumption when its Initialized field is set to TRUE.

The easiest way to fix this code is to associate a lock with f – and then acquire the lock every time the code reads or writes f. In other words – T1() would call InitAFoo() with the lock held and T2() would read f under the lock like so:

VOID

T1(VOID)

{

    //

    // Init the global FOO

    //

    Lock(fLock);

    InitAFoo(&f);

    Unlock(fLock);

    return;

}

VOID

T2(VOID)

{

    ULONG AVal;

    //

    // See if our FOO is initialized

    //

 

    Lock(fLock);

    if (f.Initialized) {

        AVal = f.A;

       

        // Use AVal here

    }

    Unlock(fLock);

    return;

}

Life is so much easier when you use locks. But just for the sake of this topic – let’s say we have a really good reason to - and decide not to use locks and to continue down this lock-less path.

So what is it about the code that messes us up here? It is the fact that there is no reliable order of initialization of f as observed by T1() . What we really need is a way to make sure that f.Initialized is always set to TRUE after the other fields of f have been initialized. How do we do that? Well – on most platforms there is a thing called a “memory barrier”. In the generic sense a memory barrier is a construct that prevents a compiler or processor from reordering instructions around it. Windows has a routine called MemoryBarrier() that does just this. But where exactly would we put this in our code? Well – where do we need the strict order of operations to occur? After the initialization of the fields of f and before f.Initialized. Basically stated – we don’t care what order the fields of f are initialized in - as long as they are all initialized before f.Initialized is set to TRUE. So we could modify our code to be:

VOID

InitAFoo(FOO* f)

{

    f->A = 4;

    f->B = 8;

    f->C = 15;

    f->D = 16;

    MemoryBarrier();

    f->Initialized = TRUE;

}

By changing the code in this way we force the order of operations as observed by T2() . With the code written as above, T2() can always know that if f.Initialized is TRUE then the fields of f are initialized. Note that MemoryBarrier() acts as a compiler and processor memory barrier. There are many flavors of barriers and they vary by platform. Some barriers only prevent writes, some only prevent reads. MemoryBarrier() is what is called a “full barrier” – it blocks both reads and writes from moving around it. There are also variable decorations like "volatile" that have barrier implications. If the specific barrier types and volatiles are interesting I can cover them in another post.

 

Note: For this code to be explicitly correct we would also have to do this in T2() :

 

VOID

T2(VOID)

{

ULONG AVal;

//

// See if our FOO is initialized

//

MemoryBarrier();

if (f.Initialized) {

AVal = f.A;

// Use AVal here

}

return;

}

The reason we need this barrier on the "read side" is very complicated and will not happen on any current processor that Windows supports. However, if you are a sick individual and want the whole story let me know and I'll post about it.

 

So why do locks prevent you from worrying about this? You guessed it – all locks will have an embedded memory barrier! If they didn’t they wouldn’t be a lock. My personal advice is to always write your code with locks. If you can prove to yourself that the locked paths are too slow – make sure you are only locking around the required elements of your code (don’t lock around a whole function if you just need to lock around a single assignment). If you perf is still suffering – then and only then should you put yourself through the torture of writing correct lock-free code. By the way – did I mention that every compiler and platform have different rules about how things can get reordered. More to come on this later.

Well – I hope I haven’t bored you to tears with this. I it is not interesting – please let me know and I will start an entry about Lost!