Concurrency, part 5 – What happens when you can’t avoid lock misordering?


Yesterday, I talked about simple deadlocks.  In general, deadlocks are caused by violating lock ordering.  But what do you do if you can't avoid violating lock ordering?

It does happen - in my experience, the most common place where this can occur is when you have a list of objects, where the list is protected by a different lock from the objects.

So you have (yeah, I know it's a stupid example, but...):

class Thing1
{
    CriticalSection _Thing1Lock;
    Thing1Container *_MyContainer;
public:
    void Lock() { EnterCriticalSection(&_Thing1Lock); }
    void Unlock() { LeaveCriticalSection(&_Thing1Lock); }

    void Thing1Method1()
    {
        Lock();
        ...
        Unlock();
    }
    ~Thing1()
    {
        Lock();
        _MyContainer->RemoveThing1FromList(this);
        Unlock();
    }
};

class Thing1Container
{
    CAtlList<Thing1 *> _Thing1List;
    CRITICAL_SECTION _Thing1ListLock;

    void EnumerateThing1()
    {
        POSITION pos;
        EnterCriticalSection(&_Thing1ListLock);
        pos = _Thing1List.GetHeadPosition();
        while (!pos)
        {
            Thing1 *thing1 = _Thing1List.GetNext(pos);
            thing1->DoSomeWork();
        }
        LeaveCriticalSection(&_Thing1ListLock);
    }
    AddThing1ToList(Thing1 *Thing1)
    {
        EnterCriticalSection(&_Thing1ListLock);
        _Thing1List.Add(Thing1);
        LeaveCriticalSection(&_Thing1ListLock);
    }
    RemoveThing1FromList(Thing1 *Thing1)
    {
        EnterCriticalSection(&_Thing1ListLock);
        pos = _Thing1List.Find(Thing1);
        _Thing1List.RemoveAt(pos);
        LeaveCriticalSection(&_Thing1ListLock);
    }

}

If the Thing1 class also has its own lock, or if the call to thing1->DoSomeWork() may take a "long" time (for some value of long that depends on your application), then holding the lock across the call to DoSomeWork can be a major issue.

So the obvious change is to release the lock across the call to thing1->DoSomeWork().

    void EnumerateThing1()
    {
        POSITION pos;
        EnterCriticalSection(&_Thing1ListLock);
        pos = _Thing1List.GetHeadPosition();
        while (!pos)
        {
            Thing1 *thing1 = _Thing1List.GetNext(pos);
            LeaveCriticalSection(&_Thing1ListLock);
            thing1->DoSomeWork();
            EnterCriticalSection(&_Thing1ListLock);
        }
    }
 

Unfortunately, now you have an even bigger problem.  The instant you left the critical section, another thread could have come along and destroyed thing1.  Normally, the call to destroy thing1 would have blocked on the _Thing1ListLock, but once you exited the critical section, the other thread could run and destroy the object.

The immediate solution to this problem is to add reference counting to thing1. 

class Thing1
{
    CriticalSection _Thing1Lock;
    LONG _ReferenceCount;
    Thing1Container *_MyContainer;
public:
    LONG AddRef() { return InterlockedIncrement(&_ReferenceCount); }
    LONG Release()
    {
        LONG lRet = InterlockedDecrement(&_ReferenceCount);
        if (lRet == 0)
        {
            delete this;
        }
        return lRet;   
    }
    void Lock() { EnterCriticalSection(&_Thing1Lock); }
    void Unlock() { LeaveCriticalSection(&_Thing1Lock); }

    void Thing1Method1()
    {
        Lock();
        ...
        Unlock();
    }
    ~Thing1()
    {
        Lock();
        _MyContainer->RemoveThing1FromList(this);
        Unlock();
    }
};

Next, you want to change EnumerateThing1 to:

    void EnumerateThing1()
    {
        POSITION pos;
        EnterCriticalSection(&_Thing1ListLock);
        pos = _Thing1List.GetHeadPosition();
        while (!pos)
        {
            Thing1 *thing1 = _Thing1List.GetNext(pos);
            thing1->AddRef();
            LeaveCriticalSection(&_Thing1ListLock);
            thing1->DoSomeWork();
            EnterCriticalSection(&_Thing1ListLock);
            thing1->Release();
        }
    }

You also need to change AddThing1ToList:

    AddThing1ToList(Thing1 *Thing1)
    {
        EnterCriticalSection(&_Thing1ListLock);
        Thing1->AddRef();
        _Thing1List.Add(Thing1);
        LeaveCriticalSection(&_Thing1ListLock);
    }
    RemoveThing1FromList(Thing1 *Thing1)
    {
        EnterCriticalSection(&_Thing1ListLock);
        pos = _Thing1List.Find(Thing1);
        _Thing1List.RemoveAt(pos);
        Thing1->Release();
        LeaveCriticalSection(&_Thing1ListLock);
    }

Note that I'm using InterlockedIncrement and InterlockedDecrement for the reference counting.  That's because InterlockedIncrement and InterlockedDecrement don't require an additional lock - they atomically increment or decrement the variable by using low level CPU instructions that avoid the need for a separate lock.  Also, when adding reference counting to an object, don't forget to make the destructor of that object private (or protected, if it's virtual) - that way you don't accidentally delete the object.

I was also careful to add a reference to Thing1 when it was put on the list.  This guarantees that thing1 won't be destroyed when we call thing1->Release() in EnumerateThing1 - otherwise we could have a lock inversion.

Now there's still a problem with the enumeration function above - it's related to the use of the "pos" variable.  If another thread called AddThing1ToList (or, even worse, RemoveThing1FromList) after the lock was released, then the "pos" variable is invalidated!  It doesn't even have to be a remove call for the current item in the enumeration,  the "pos" variable in EnumerateThing1 is only valid as long as no thread has modified the list.  So you've just introduced a data corruption into your application by releasing the lock.  So you need to be careful to ensure that your list doesn't get corrupted.

There are a several of ways of handling that - my current favorite is to have a temporary list and stage the thing1's in the temporary list:

    void EnumerateThing1()
    {
        POSITION pos;
        CAtlList<Thing1 *> callbackList;
        EnterCriticalSection(&_Thing1ListLock);
        pos = _Thing1List.GetHeadPosition();
        while (!pos)
        {
            Thing1 *thing1 = _Thing1List.GetNext(pos);
            thing1->AddRef();
            callbackList.Add(thing1);
        }
        LeaveCriticalSection(&_Thing1ListLock);
        pos = callbackList.GetHeadPosition();
        while (thing1 = callbackList.RemoveHead())
        {
            thing1->DoSomeWork();
            thing1->Release();
        }
    }

This solution applies my first principal: I ensured that the data was never accessed on more than one thread.  There are other solutions, this is just the easiest - it has its own issues, since it builds a temporary copy of the list (and if the list is long, that may involve a lot of memory).  Btw, this solution is essentially the one that the C# uses with the foreach construct - a temporary, immutable copy of the list is made and the caller iterates over that temporary copy of the list (CLR purists will disagree with me on this one, and they're right, but at a 20,000 foot level, that's essentially what happens).

By now, my fourth principal of concurrency should be pretty clear: "Don't call into other objects with locks being held, and make sure that you reference count all your objects".

In general, callbacks can be VERY tricky, and usually require that the callback function clearly describe their semantics.  For example, the waveOutProc function passed in to waveOutOpen has the following caveat:

Applications should not call any system-defined functions from inside a callback function, except for EnterCriticalSection, LeaveCriticalSection, midiOutLongMsg, midiOutShortMsg, OutputDebugString, PostMessage, PostThreadMessage, SetEvent, timeGetSystemTime, timeGetTime, timeKillEvent, and timeSetEvent. Calling other wave functions will cause deadlock.

This caveat exists because winmm doesn't release some internal locks across callback functions and thus can deadlock easily.

Tomorrow, I'll talk a bit about some of the other issues associated with reference counting - there are some a really subtle bugs that can happen if you're not careful when using reference counts.

Edit: Don't call DoSomeWork in the loop 🙂

Comments (11)

  1. Anonymous says:

    ~Thing1()

    {

    Lock();

    _MyContainer->RemoveThing1FromList(this);

    Unlock();

    }

    I don’t understand this code. I don’t see why would you want to lock the Thing1 instance in its destructor, while you remove it from the list?

    Is some other thread supposed to execute methods on the Thing1 instance while it is being destructed?!

  2. Anonymous says:

    Adi, it IS a stupid example 🙂

    But sometimes stuff like that happens (I’ve seen it more times than I want to). Usually, there’s a local variable in Thing1 that flags that Thing1’s been put into _MyContainer – you take the Thing1 lock to check the "I’m in _MyContainer" condition, then you remove Thing1 from _MyContainer.

    I did say I was winging this 🙂 I’m sure that if I spent enough time, I could come up with a real world example, but…

  3. Anonymous says:

    Starting to get interesting. Thanks for taking the time to put this up.

  4. Anonymous says:

    Larry,

    In the last version of EnumerateThing1(), did you mean to call thing1->DoSomeWork(); in the first while loop as well? I think it should be only in the second while loop.

    (Also, you managed to confuse "principle" and "principal" again! 😉

    -K

  5. Anonymous says:

    Did you mean to call thing1->DoSomeWork() in the last listing while building callbackList?

  6. Anonymous says:

    Well… OK.

    My (indirect) point is that in a real world example, you should never have a situation when the lock hierarchy is not well defined. If you have a mixup, then you have a bug in the design of the original code.

    So, the right way to fix it is to examine the locking dependencies, and what introduced them in the first place.

    That said, you might have "strong" dependencies and "weak" dependencies, but you should always have a lock hierarchy…

    I would add one more point – when you design a lifecycle management for certain objects, usually the simplest approach is to define some sort of reference counters to simplify the memory management. It is a similar strategy as with COM objects – each thread should have its own reference. But there are subtleties there too – you again need to distinguish between strong and weak references when you have cross-reference between two objects…

  7. Anonymous says:

    Did you mean to call thing1->DoSomeWork() will building callbackList?

  8. Anonymous says:

    Did you mean to call thing1->DoSomeWork() will building callbackList?

  9. Anonymous says:

    Adi,

    In the real world, you don’t get to control your lock ordering. At least if you’re doing systems programming. The problem especially occurs when one component (say winmm) calls into another component (say your application), which calls back into the first component (winmm).

    Each has their own well defined lock hierarchy, but due to their internal architectures, they get a lock mis-ordering.

    I ran into a deadlock yesterday morning that was caused by just this.

    This kind of thing happens with any reasonably complex system, especially if you’re component interacts with other components.

  10. Anonymous says:

    Larry, supposing that for some reason a kernel object was involved instead of a critical section, would you use explicit reference counting or DuplicateHandle?

  11. Anonymous says:

    James,

    It doesn’t matter – the critical section is there to protect some data.

    Now, if the data in question was a kernel object, duplicating the handle doesn’t change the issue – a handle is a pointer to some kernel data structure, which has its own protection structures.

Skip to main content