Fun programming problem: a simple lock-free algorithm

Can the program below ever print “oops?” 

 

#include <stdio.h>

#include <process.h>

struct Globals {

    volatile int start;

    int a;

    int b;

    volatile int end;

};

Globals globals;

void WriterThread(void*) {

    int i = 0;

    while (true) {

        globals.start = i;

        globals.a = i;

        globals.b = i;

        globals.end = i;

        i++;

    }

}

void ReaderThread(void*) {

    while (true) {

        int end = globals.end;

        int a = globals.a;

        int b = globals.b;

        int start = globals.start;

        if (start == end)

            if (a != b)

                printf("oops! %d != %d\n", a, b);

    }

}

void main() {

    _beginthread(WriterThread, 0, NULL);

    _beginthread(ReaderThread, 0, NULL);

    getchar();

}

 

 

Answer below....

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 Keep going....

 

 

 

 

 

 

 

 

 

 

 

 

 

The code does indeed print “oops” fairly reliably.

 

So what exactly is going on with this code? First, let’s see what it’s trying to do. Here’s the code again – I’ve annotated some important lines with “statement numbers” to make them easier to discuss.

#include <stdio.h>

#include <process.h>

struct Globals {

    volatile int start;

    int a;

    int b;

    volatile int end;

};

Globals globals;

void WriterThread(void*) {

    int i = 0;

    while (true) {

        globals.start = i; //S1

        globals.a = i; //S2

        globals.b = i; //S3

        globals.end = i; //S4

        i++;

    }

}

void ReaderThread(void*) {

    while (true) {

        int end = globals.end; //S5

        int a = globals.a; //S6

        int b = globals.b; //S7

        int start = globals.start; //S8

        if (start == end)

            if (a != b)

                printf("oops! %d != %d\n", a, b);

    }

}

void main() {

    _beginthread(WriterThread, 0, NULL);

    _beginthread(ReaderThread, 0, NULL);

    getchar();

}

We have one thread that is writing values to globals.a and globals.b, and another thread that is reading these values. The reader expects to see the values in a consistent state – which here means that they satisfy the invariant that a == b, but in real life would probably be much more complex. The simple way to do this, of course, would be to protect these variables with a lock – but presumably the author decided that this would not perform adequately, and so chose to go with a hand-rolled synchronization mechanism.

The idea is that the writer will increment “start” just before updating “a” and “b”, and then after making the updates it will increment “end” to match “start.” The reader starts by reading “end”, then goes ahead and reads “a” and “b”. Having done that, it reads “start”. If start does not equal the previous value of end, then that means the writer might have written to a or b while we were reading them, so we might have read these variables in an inconsistent state – so we throw out the values and try again.

Assuming everything happens as described, this should avoid the race that would otherwise exist between the reader and the writer.

(Yes, this can be foiled by the extremely unlikely case where the reader thread is interrupted for exactly 3^32 iterations of the writer. Let’s ignore that problem for now.)

Our algorithm requires that statement S1 take place before S2, S3, and S4. S4 must take place after S1, S2, and S3. The order of S2 and S3 does not matter. Similarly, S5 must take place before S6, S7, and S8; and S8 must take place after S5, S6, and S7. These sorts of temporal relationships are sometimes notated as “A->B”, meaning that A must precede B. So we can state all of the required temporal relationships in our algorithm as:

                S1->S2

                S1->S3

                S1->S4

                S2->S4

                S3->S4

                S5->S6

                S5->S7

                S5->S8

                S6->S8

                S7->S8

C++ generally permits a program to be executed in any order that would produce the same results as if it was executed in “program order,” on a single thread. In fact, the C++ standard says absolutely nothing about threading, and this is one practical consequence of that fact. So without any special treatment, none of our temporal relationships will be enforced. We might get lucky, or we might not. Guess which one is more likely.

The only “escape hatch” that C++ provides for controlling the order of execution is this thing called “volatile,” which our program does use on the “start” and “end” variables. So what does “volatile” do?

The C++ standard has little to say about volatile. It was originally intended for the very specific application of writing device drivers that communicate with hardware via memory-mapped I/O. To make that work, compilers must follow these rules:

  • A read or write to a volatile variable must not be eliminated by the compiler.
  • Reads and writes to volatile variables must not be reordered by the compiler, relative to one-another.

Note that this does not place any restrictions on non-volatile accesses, nor does it say anything about what the hardware is allowed to do. It is assumed that if you’re reading and writing memory-mapped I/O ports, then the hardware will know better than to reorder things.

Visual C++ has an augmented definition of volatile. It adds the following rules:

  • Reads and writes to any global memory (even if not volatile) must never be moved before a volatile read.
  • Reads and writes to any global memory must never be moved after a volatile write.
  • These rules will be enforced at the hardware level as well.

Back to the quiz problem: this code attempts to enforce its required temporal relationships by tagging the “start” and “end” variables with the “volatile” keyword. Given the VC++ rules for volatile variables, this enforces the following:

                S1->S4

                S2->S4

                S3->S4

                S5->S6

                S5->S7

                S5->S8

In other words, nothing may move after the volatile write in S4, and nothing may move before the volatile read in S5.

Notice that we are missing some of our required temporal relationships. This is because the volatile write in S1 does not affect anything occurring after it in program order, nor does the volatile read in S8 affect anything coming before it.

(Careful readers may notice that there are two additional temporal relationships that must be enforced for this algorithm to work. As we go around each loop, the last statement must come before the first one (i.e., a previous iteration of S4 must occur before a subsequent iteration of S1). I didn’t know how to notate this, so I left it out. :) Luckily, these relationships are satisfied by the original program, since S1, S4, S5, and S8 are all volatile accesses, which cannot be reordered relative to each other.)

So this program is not correct. But do we get lucky? Let’s look at how VC++ compiles this code for x86:

void WriterThread(void*) {

    int i = 0;

00401000 xor eax,eax

00401002 jmp WriterThread+10h (401010h)

00401004 lea esp,[esp]

0040100B jmp WriterThread+10h (401010h)

0040100D lea ecx,[ecx]

    while (true) {

        globals.start = i; //S1

00401010 mov dword ptr [globals (403370h)],eax

        globals.a = i; //S2

00401015 mov dword ptr [globals+4 (403374h)],eax

        globals.b = i; //S3

0040101A mov dword ptr [globals+8 (403378h)],eax

        globals.end = i; //S4

0040101F mov dword ptr [globals+0Ch (40337Ch)],eax

        i++;

00401024 add eax,1

    }

00401027 jmp WriterThread+10h (401010h)

void ReaderThread(void*) {

00401030 push esi

00401031 mov esi,dword ptr [__imp__printf (4020A0h)]

00401037 jmp ReaderThread+10h (401040h)

00401039 lea esp,[esp]

    while (true) {

        int end = globals.end; //S5

00401040 mov edx,dword ptr [globals+0Ch (40337Ch)]

        int a = globals.a; //S6

        int b = globals.b; //S7

        int start = globals.start; //S8

        if (start == end)

00401046 cmp dword ptr [globals (403370h)],edx

0040104C mov eax,dword ptr [globals+4 (403374h)]

00401051 mov ecx,dword ptr [globals+8 (403378h)]

00401057 jne ReaderThread+10h (401040h)

            if (a != b)

00401059 cmp eax,ecx

0040105B je ReaderThread+10h (401040h)

                printf("oops! %d != %d\n", a, b);

0040105D push ecx

0040105E push eax

0040105F push offset string "oops! %d != %d\n" (4020F4h)

00401064 call esi

00401066 add esp,0Ch

    }

00401069 jmp ReaderThread+10h (401040h)

Notice that we did get lucky in WriterThread. The compiler did not reorder any of our global variable accesses. Another compiler, or a new version of this one, might decide to compile this code differently.

ReaderThread is a different story. S5 precedes S6, S7, and S8, as required by the semantics of volatile reads, but S8 has been moved to before S6 and S7!

When I run this program on my machine, it does in fact display “oops.”

Now here’s a surprise: on my machine it doesn’t print “oops” until approximately 5 million iterations have passed. Even though the code emitted by the compiler is clearly not a correct implementation of this synchronization algorithm, it still has only a 0.00002% chance of failing. Imagine this was just a small piece of a much larger system; what are the chances that anyone would find this bug in testing?

So how do we fix this? I know you all think I’m going to say, “use a lock.” And I am. Later. For now, let’s figure out how to make the original algorithm work.

Recall our required temporal relationships:

                S1->S2

                S1->S3

                S1->S4

                S2->S4

                S3->S4

                S5->S6

                S5->S7

                S5->S8

                S6->S8

                S7->S8

I’ve highlighted the ones that were not satisfied by our use of volatile. We need a mechanism to prevent S2 and S3 from moving before S1, and S6 and S7 from moving past S8. We can do this with explicit “memory barriers,” which are instructions to the compiler and the hardware to restrict reorderings in a specific way. There is nothing in the C++ standard that provides for memory barriers, but luckily most compilers do step outside of the standard and provide some sort of support for these.

In VC++, we have a function called MemoryBarrier() which implements a “full memory barrier,” meaning that nothing will be moved in either direction across a call to MemoryBarrier(). In our case, if we want to prevent S2 and S3 from moving before S1, we can insert a memory barrier after S1:

void WriterThread(void*) {

    int i = 0;

    while (true) {

        globals.start = i; //S1

        MemoryBarrier();

        globals.a = i; //S2

        globals.b = i; //S3

        globals.end = i; //S4

        i++;

    }

}

Similarly, we need a memory barrier before S8:

void ReaderThread(void*) {

    while (true) {

        int end = globals.end; //S5

        int a = globals.a; //S6

        int b = globals.b; //S7

        MemoryBarrier();

        int start = globals.start; //S8

        if (start == end)

            if (a != b)

                printf("oops! %d != %d\n", a, b);

    }

}

Now all of our required temporal relationships are satisfied, and the program no longer prints “oops.”

Now to say the least that was an awful lot of mental effort to arrive at a working implementation of a very simple algorithm. You certainly wouldn’t want to have to do this kind of analysis for every line of your code! Hopefully the end product is worth it; it had better be much faster than it would have been with a lock, or we just wasted a lot of time (not to mention that in real-life code, we’d probably get it wrong even with all this analysis).

Attached you will find two versions of the original problem, one using the original custom synchronization mechanism, and the other implemented with a basic spin lock. I’ve added some simple timing code so we can see how fast each one goes.

(Please, please, please don’t take the spin lock implementation from the attached code and use it in production code. It is meant as an illustration only).

On my machine, with maximum contention (meaning each thread is running full-speed-ahead on its own CPU), the lock-free implementation does about 0.3 successful iterations of the reader per millisecond. The spin lock version does about 3500 iterations/ms. So in this scenario, all our hard work bought us a 12,000-fold decrease in performance.

 

Of course, this is unfair - the algorithm is clearly designed to perform best with little contention - i.e., the writer only writes every so often, while the reader reads very frequently. 

 

We can simulate minimum contention by simply not running the writer thread. In this scenario, the lock-free reader does 38,000 iterations/ms, while the spin lock implementation does 33,000 iterations/ms. So the lock-free implementation is just 6% faster.

So was all this work worth a best-case perf improvement of 6%, with a risk of such horrible worst-case performance? And the very real risk that we might have gotten it wrong? Well, it depends on the situation. :) What I can say for sure is that this sort of code should not be entered into lightly. Getting it right requires the sort of rigorous, mind-numbing analysis we had to do here, but typically on a larger scale. And it usually has very little, if any, benefit.

TestCode.zip