Understanding the consequences of WAIT_ABANDONED


One of the important distinctions between mutexes and the other synchronization objects is that mutexes have owners. If the thread that owns a mutex exits without releasing the mutex, the mutex is automatically released on the thread's behalf.

But if this happens, you're in big trouble.

One thing many people gloss over is the WAIT_ABANDONED return value from the synchronization functions such as WaitForSingleObject. They typically treat this as a successful wait, because it does mean that the object was obtained, but it also tells you that the previous owner left the mutex abandoned and that the system had to release it on the owner's behalf.

Why are you in big trouble when this happens?

Presumably, you created that mutex to protect multiple threads from accessing a shared object while it is an unstable state. Code enters the mutex, then starts manipulating the object, temporarily making it unstable, but eventually restabilizing it and then releasing the mutex so that the next person can access the object.

For example, you might have code that manages an anchored doubly-linked list in shared memory that goes like this:

void MyClass::ReverseList()
{
 WaitForSingleObject(hMutex, INFINITE);
 int i = 0; // anchor
 do {
  int next = m_items[i].m_next;
  m_items[i].m_next = m_items[i].m_prev;
  m_items[i].m_prev = next;
  i = next;
 } while (i != 0);
 ReleaseMutex(hMutex);
}

There is nothing particularly exciting going on. Basic stuff, right?

But what if the program crashes while holding the mutex? (If you believe that your programs are bug-free, consider the possiblity that the program is running over the network and the network goes down, leading to an in-page exception. Or simply that the user went to Task Manager and terminated your program while this function is running.)

In that case, the mutex is automatically released by the operating system, leaving the linked list in a corrupted state. The next program to claim the mutex will receive WAIT_ABANDONED as the status code. If you ignore that status code, you end up operating on a corrupted linked list. Depending on how that linked list is used, it may result in a resource leak or the system creating an unintended second copy of something, or perhaps even a crash. The unfortunate demise of one program causes other programs to start behaving strangely.

Then again, the question remains, "What do you do, then, if you get WAIT_ABANDONED?" The answer is, "Good question."

You might try to repair the corruption, if you keep enough auxiliary information around to recover a consistent state. You might even design your data structures to be transactional, so that the death of a thread manipulating the data structures does not leave them in a corrupted state. Or you might just decide that since things are corrupted, you should throw away everything and start over, losing the state of work in progress, but at least allowing new work to proceed unhindered.

Or you might simply choose to ignore the error and continue onwards with a corrupt data structure, hoping that whatever went wrong won't result in cascade failures down the line. This is what most people do, though usually without even being aware that they're doing it. And it's really hard to debug the crashes that result from this approach.

Exercise: Why did we use indices instead of pointers in our linked list data structure?

[Raymond is currently away; this message was pre-recorded.]

Comments (27)
  1. Tim Smith says:

    When using shared memory in user space, the address of the section will be different for each process. Thus using absolute addressing doesn’t work. Using indices is a form a relative addressing.

  2. Somebody says:

    Tim Smith wrote: "When using shared memory in user space, the address of the section will be different for each process".

    That is almost correct. The only nitpick to make is that that ‘will’ is not correct; this should read "When using shared memory in user space, the address of the section can be different for each process."

  3. Tim Smith says:

    I use "will" because even thought it is technically true that "can" is more correct, as far as programming goes, "can" holds no functional weight. It is best to assume from the start that they won’t be the same… ever.

  4. Tim Smith wrote: "Thus using absolute addressing doesn’t work."

    Well, there’s always __based pointers (only half-joking).

  5. grantri says:

    This may not be what Raymond is looking for, but using 32-bit integer indices will use less memory for the data structure than using pointers when run on a 64-bit processor. Also on a 32-bit processor you’re going to be ‘wasting’ the low order 2 or even 3 bits (depending on the size of each element) of each pointer. By using integer indices you could bit-pack in a few boolean fields for each element.

    The more I think about it, I think I’ve got the real answer: a relocatable linked list (without using __based pointers). The list is allocated in an array (for good data locality), but the array must be moved occasionally if it is to grow. When it moves, you either have to move each element individually and update all those pointers, or just memmove the whole thing including indices and be done with it.

    Lastly it’s a correctness issue. You can easily validate or range check an index, unlike a pointer.

  6. josh says:

    "It is best to assume from the start that they won’t be the same… ever."

    Hm, I’d say it’s best to assume that you don’t know. Otherwise you might be tempted to do something like testing the value of an address against a stored one to see if you’re in the same process. (still, it’s safer than assuming they’ll be the same by far)

    If you know something is always true, there’s usually a way to exploit it. You want to keep track of what things you know and don’t know.

  7. microbe says:

    So..a thread crashes probably due to a software bug, and other threads now are so smart that they try to "fix" whatever the thread has corrupted.

    I’m a bit suspicious of this "software fixes itself" approach. If you’ve got a corruption, you’ve got it anyway. Nothing else in the same process is reliable anyway.

  8. Zahical says:

    I don’t think shared memory has something to do with the use of subscript vs. pointers in this case.

    First, WAIT_ABANDONED is returned when the _thread_ (which is not necessarily in another process) is aborted or exited. So this scenario can happen even within a single multithreaded process.

    And second, as the MSDN C++ specs state ‘the result of a subscript expression e1[ e2 ] is given by: *( (e2) + (e1) )’ so no matter if you write m_items[i] or *(m_items+i) you are still dealing with pointers.

    Unfortunately, though, I’m not sure what the answer to Raymond’s puzzle is…

  9. kbiel says:

    ≠ is supposed to be the not equal symbol, but blogs.msdn.com lack a certain functionality.

  10. Ray Trent says:

    > And second, as the MSDN C++ specs state ‘the result of a subscript expression e1[ e2 ] is given by: *( (e2) + (e1) )’ so no matter

    > if you write m_items[i] or *(m_items+i) you are still dealing with pointers.

    The difference is that, using the most typical thread local storage method of mapping shared memory objects, e1 will be different in the 2 processes, and so the index version *will* work (at least it *can* be made to work if you’re reasonably careful).

    The pointer version can’t be made to work across process boundries, period.

  11. Dean Harding says:

    While the shared memory thing is valid, I think Raymond was probably looking for a simpler answer than that (although, we’re talking about Raymond [I can dig it] so maybe not).

    Anyway, to that end, I’d go with what grantri said, in that you’d use arrays to take advantage of data locality.

  12. anonymous says:

    There’s another thing you can do if you get WAIT_ABANDONED: pack up your toys and go home. If you’re writing, say, a filesystem or a DB server, and you’re forced to handle non-transactional structures at a particular point in the app, and they’ve been left inconsistent, then you’d probably rather have the server grind to a halt than blunder ahead corrupting customers’ data.

    (With some qualifications: (1) you might perform some kind of consistency checks after the WAIT_ABANDONED and decide it’s safe to go on if they pass, (2) it might be that only specific threads have to pack up their toys and others can be assumed safe.)

  13. The shared memory/thread local storage thing sounds like a non-starter — we’re talking about multithreading access to common memory, so I can’t imagine there’s any pointer base issues involved here, or I’ve been very wrong about pointers and multithreaded access.

    So thinking of other possibility – is it something to do with PAE or extended memory support? By using indices you’re effectively using old-school segments, with a size of the size of your object. Indices can be positive or negative, and don’t have to refer to contiguous blocks of memory.

  14. Aaron says:

    >>>>

    The difference is that, using the most typical thread local storage method of mapping shared memory objects, e1 will be different in the 2 processes, and so the index version *will* work (at least it *can* be made to work if you’re reasonably careful).

    >>>>

    Your saying e1[ e2 ] and *( (e2) + (e1) ) are NOT always semantically equivalent for basic types?

    Can you give an example of where two different values would be returned? Im cornfused.

  15. Bryan says:

    Aaron — it’s not a difference between e1[ e2 ] and *( (e2) + (e1) ). It’s a difference between e1[ e2 ] and *ptr.

    With base+offset, you can change the base in different processes. With *ptr (and the pointers held in shared memory), you can’t change the base ever.

  16. kbiel says:

    "Your saying e1[ e2 ] and *( (e2) + (e1) ) are NOT always semantically equivalent for basic types? Can you give an example of where two different values would be returned?"

    Imagine two have processes, process 1 (p1) and process 2 (p2), and e1 and e2 in shared memory.

    p1: *( (e2) + (e1) ) ≠ p2: *( (e2) + (e1) )

    So,

    p1: *( (e2) + (e1) ) ≠ p2: e1[e2]

    and vice versa, but since e2 is relative to e1,

    p1: e1[e2] = p2: e1[e2]

  17. Bryan says:

    To be more concrete than my last reply: If your list is declared like so (pardon the lack of indentation):

    struct list_item {

    struct list_item *next;

    struct list_item *prev;

    int val;

    };

    and the head of the list is declared like so:

    struct list_item *head;

    then imagine p1 sets up the shared memory area. For simplicity, assume that head points to the base address of the shared memory area, also, so it doesn’t have to be stored in the area anywhere. p1 creates the area, and sticks a few list_items in it, linked in order. If the base address is 0x10000000 in p1’s address space, then perhaps head->next’s value will be 0x10000000C. This value is stored in the shared memory area (since it’s part of *head).

    But now p2 maps the same area in at 0x20000000 (so that’s the value of head in p2), and looks at head->next. It’s value is still 0x1000000C, because that’s what p1 set it to. But if p2 tries to dereference that (head = head->next, printf("%dn", head->val);), it’ll either get an access violation, or a different random piece of memory that it owns.

    Whereas, if you store offsets into the shared memory area, then *((head)+(8)) or head[8] in p1 is the same byte as *((head)+(8)) or head[8] in p2 (the addresses are different in each process, unless the shared memory area was mapped at the same address in each). So you never get an access violation or random memory bytes.

  18. TC says:

    Tim Smith wrote:

    > I use "will" because even thought it is

    > technically true that "can" is more correct,

    > as far as programming goes, "can" holds no

    > functional weight. It is best to assume from

    > the start that they won’t be the same…

    > ever.

    Huh? Why assume a demonstrable falsehood?

    From a coding viewpoint, their equaility (or inequality) either /matters/, or it does not matter. If it matters, you should get it right! If it does not matter, you do not care one way or the other.

  19. DL says:

    Saving indices instead of pointers allows the linked list to be serialized "as-is" (e.g. saved to a file).

  20. Ahh…you know I really didn’t absorb the whole "killing the process" thing, or what it implied. Okay, disregard my post.

  21. Tim Smith says:

    TC, coding for the case where the pages map at the same location is functionally a subset of coding for when the pages map at different locations. By coding for the more generic case, we fully cover the more specialized case.

    Thus, "can" has no functional weight. That case buys you nothing that isn’t already covered by the more generic case.

    (I am sure there are some obsure cases where you can do some really uber neato l33t stuff when the pages map to the same location, but …)

  22. Nick Lamb says:

    Back to school…

    MAY is the correct word here in a technical document. Programmers need to be aware that

    #1 The addresses probably won’t be the same, so they can’t use an address from one context in a context which may have a different address map.

    AND

    #2 On the other hand the addresses could be the same anyway either by chance or because of some COW tricks, so they can’t rely on the addresses being different to trigger some test somewhere (e.g. trying to detect the context in which your function was called by looking at the addresses used…)

    This is what is implied by the MAY keyword in technical documents, as an implementor you are free to do either, and therefore as a user of an implementation you mustn’t assume anything.

  23. Bryan says:

    Dennis:

    > we’re talking about multithreading access to common memory

    I don’t think we necessarily are. Raymond said:

    > But what if the program crashes while holding

    > the mutex? (If you believe that your programs

    > are bug-free, consider the possiblity that

    > the program is running over the network and

    > the network goes down, leading to an in-page

    > exception. Or simply that the user went to

    > Task Manager and terminated your program

    > while this function is running.)

    >

    > In that case, the mutex is automatically

    > released by the operating system, leaving the

    > linked list in a corrupted state. *The next

    > program to claim the mutex* will receive

    > WAIT_ABANDONED…

    He’s talking about multiple processes, not multiple threads. (Killing a process from task manager ends all its threads.)

    But even if Raymond’s specific example used only one process, the APIs in question (to create and map shared memory, and to create and manipulate mutexes) all work across process boundaries too. So it’s a good idea to not discount multi-processing, even if the program as it’s written today only uses one process.

  24. TC says:

    Tim Smith, I beg to disagree still. If you assume that two things A and B "won’t be the same… ever", then you have the right to write code such as this:

    if A = B then

    ‘ we have two copies of the same thing;

    ‘ we DO NOT have two different things!

    But if your assumption is incorrect, and the two things /can/ be the same, sometimes, then the code above is incorrect. The bad assumption has led to a bug.

    It’s a general logical principle that you can prove anything, if you start with a falsehood. Assuming a demonstrable falsehood, in a programming context, is asking for trouble, IMHO.

  25. Ray Trent says:

    "Your saying e1[ e2 ] and *( (e2) + (e1) ) are NOT always semantically equivalent for basic types?

    Can you give an example of where two different values would be returned? Im cornfused."

    No, I’m not saying that at all. Your statement will still be true, assuming you interpret it correctly. I think the easiest way of stating this is:

    I’m saying that typically e1 is just a pointer stored somewhere in the process, not in the shared memory. Since the base address of a shared memory block can be different in different processes, it’s important that the pointer stored in e1 can have a value that differs in different processes.

    If, on the other hand, you use pointers that are themselves stored *in* the shared memory (i.e. in the linked list itself rather than in a per-process location), they can’t have different values in different processes…

  26. You can choose the consequences, but you need to know that there are consequences.

Comments are closed.