If you don’t want to try to repair the data, then don’t, but you should at least know that you have corrupted data


When I wrote about understanding the consequences of WAIT_ABANDONED, I mentioned that one of the possible responses was to try to repair the damage, but some people are suspicious of this approach.

Mind you, I'm suspicious of it, too. Repairing corruption is hard. You have to anticipate the possibility, create enough of a trail to be able to reconstruct the original data once the corruption is recognized, and then be able to restore the data to some semblance of consistency. I didn't say that this was mandatory; I didn't even say that it was recommended. I just listed it as one of the options, an option for the over-achievers out there.

For most cases, attempting repair is overkill. But you still have to know that something went wrong; otherwise, one crashed program will lead to more crashed programs as they try to operate on inconsistent data. The purpose of the article was to raise awareness of the issue, based on my observation that most people blindly ignore the possibility that the mutex was abandoned.

Comments (2)
  1. Yuhong Bao says:

    Yep, one of the problems is that a process termination can happen at any instruction, so you’d have to disassemble the code to see the exact combination of states that the structure can get into. That is why atomic instructions are so important.

  2. strik says:

    You could also use some transactional system like the following ad-hoc solution:

    1. Get the mutex
    2. If WAIT_ABANDONED, check for any "left over" data structure and repair it

    3. Before doing any changes, just record what you want to do. You have to record the state before, or the state afterwards. When this record has been written successfully, mark it as complete (setting one bit might suffice, or putting a pointer into it into the data structure).

    4. Now, perform your changes.

    5. release the mutex.

    If you get WAIT_ABANDONED, you know that another user has just crashed. Thus, in step 2, you can either recreate what was there (if the stored state is the state before), or perform the changes "the other one" wanted to perform (if the stored state is what should have been changed). After performing it, mark the information as "incomplete" before proceeding with step 3.

    Note, however, that this is an ad-hoc solution; there might still be some race I did not think about yet. Note also that you have to make sure that the compiler does not interfere by its optimisations or changes to the order of executed operations.

    Note also that undoing the anticipated changes might be easier than completing them.

    This is one solution of the form Raymond named as "transactional" above.

Comments are closed.