If a process crashes while holding a mutex, why is its ownership magically transferred to another process?


A customer was observing strange mutex ownership behavior. They had two processes that used a mutex to coordinate access to some shared resource. When the first process crashed while owning the mutex, they found that the second process somehow magically gained ownership of that mutex. Specifically, when the first process crashed, the second process could take the mutex, but when it released the mutex, the mutex was still not released. They discovered that in order to release the mutex, the second process had to call Release­Mutex twice. It's as if the claim on the mutex from the crashed process was secretly transferred to the second process.

My psychic powers told me that that's not what was happening. I guessed that their code went something like this:

// code in italics is wrong
bool TryToTakeTheMutex()
{
 return WaitForSingleObject(TheMutex, TimeOut) == WAIT_OBJECT_0;
}

The code failed to understand the consequences of WAIT_ABANDONED.

In the case where the mutex was held by the first process when it crashed, the second process will attempt to claim the mutex, and it will succeed, and the return code from Wait­For­Single­Object will be WAIT_ABANDONED. Their code treated that value as a failure code rather than a modified success code.

The second program therefore claimed the mutex without realizing it. That is what led the customer to believe that ownership was being magically transferred to the second program. It wasn't magic. The second program misinterpreted the return code.

The second program saw that Try­To­Take­The­Mutex "failed", and it went off and did something else for a while. Then the next time it called Try­To­Take­The­Mutex, the function succeeded: It was a successful recursive acquisition, but the program thought it was the initial acquisition.

The customer didn't reply back, so we never found out whether that was the actual problem, but I suspect it was.

Comments (29)
  1. Joshua says:

    Hmmm I made abandoned mutex an asserted condition. Then again, I don't use cross-process mutexes except for no duplicate process checks.

  2. Adam Rosenfield says:

    My psychic powers appear to be working today as well — my first thought was "they're probably not checking for a WAIT_ABANDONED error".

  3. Zarat says:

    @Joshua

    Yep, it's a good idea to assert on WAIT_ABANDONED and terminate yourself, unless you explicitely want to recover from crashed processes (most people won't want to, but some do).

    Unfortunately this requires reading the docs and understanding the return codes.

    I made it a habit to terminate the process on any return code I don't expect, unless it's a failure HRESULT and I can handle generic failures without knowing why it failed. This way I can at least get a crash dump and examine what happened. Has served me well quite a few times to fix cases which weren't observed during normal development.

  4. Count Zero says:

    I never really understood the necessity of WAIT_ABANDONED or the "abandoned" status of the synchronisation objects. I mean it comes with really few benefits, but causes a lot of code break this way. I know they all are actually wrong but they are wrong in a very special and hard-to-detect way. You can force extensive automated and manual testing of those sources and there is a good reason the error will never appear – since it requires a certain process failing at a certain point to be detectable in an other un-(or loosely) related process. Why don't  they simply become released?

    Joshua – Then what are you use them for? For intraprocess synchronisation you can use Crtital Sections and for  communicating a status there are Events and Semaphores. Using a Mutex is a waste of resources (time mainly) if you use it within a process.

  5. Henke37 says:

    Crashes: they cause trouble for following code.

  6. Joshua says:

    @Count Zero: You're right, the specific object I am using is almost always some kind of event. I tend to use "mutex" as the non-specific object term and interpret it as such on encountering it. I still check for WAIT_ABANDONED and assert.

  7. dave says:

    >I never really understood the necessity of WAIT_ABANDONED or the "abandoned" status of the synchronisation objects.

    A mutex is there to establish consistency guarantees: it ensures that an object can never be used in an inconsistent state.

    Code which wants to modify the state of the object, from one consistent state to another consistent state, may have to proceed through some inconsistent state.  But that's okay while it is holding the mutex; the mutex guarantees that no-one else (that follows the locking rules) will see the inconsistent states.

    But if the code doing the modifications crashes holding the mutex, the state may be inconsistent. Thus there are two main choices on mutex-holder-crashes:

    1.  Leave the mutex held forever and the object will be inaccessible forever.

    2.  Release the mutex but tell the next guy to claim the mutex that the consistency warranty is null and void.

    Case #2 is "mutex abandoned".

  8. Crescens2k says:

    @Count Zero:

    For a bit of an addition to what dave said, consider the following sequence of events.

    Process A and Process B use a mutex to control access to a shared file. Process A needs to update the file, but due to a bug causing memory corruption, it has an access violation trying to access some memory address. When the process terminates, it has only managed to update part of the data in the file.

    Process B finds that it needs to update the file and tries to obtain the mutex. It gets notified of the wait abandoned state and then goes about updating the file. What will it find the state of the file to be? It could find references to an element that has been removed, or values that are not in sync.

  9. Goran says:

    @Joshua: you should perhaps revisit that. Abandoned mutexes are about threads, not processes. It's a dead thread that causes "abandoned" case (here, it was merely a thread in another process, which is immaterial).

    It is possible to build your code to never abandon a mutes held by a thread (always do RAII, /Eha, don't TerminateThread), in which case you can't ever abandon an in-process muted. I would never use /Eha… unless c++/cli

  10. Joshua says:

    @Goran: What's to revisit? I assert on abandoned mutex. I don't ever choose to abandon one. The only way I know to make that code reachable is somebody else calling TerminateThread on one of my threads.

  11. Ben Voigt says:

    @Goran: So now in case of exception you still break the consistency guarantee, but without having the mutex marked as abandoned? You've quite possibly done the only thing than corrupting data…. doing so silently.

  12. Mark VY says:

    @Ben: typo alert: "done the only thing than" –> "done the only thing worse than"??

  13. DebugErr says:

    It's always good to read "code in italics is wrong" and then the whole code block is in italics :)

  14. Count Zero says:

    @dave – Theoretically I agree. Only I don't see any practical use. If we are talking about production code and the IPC data is vital, my program will check the consistency of all input data especially data received through some sort of IPC (to close an obvious attack vector) and will fail gracefully if it is inconsistent. If the IPC data is not vital, my program will happily ignore it.

    In the best case scenario the abandoned mutex is only a clue that the data might be inconsistent, but not a guarantee. So it (the abandoned state of the mutex) is only useful if the code is not so badly written (God, how I miss the opportunity to use italic text!), which does no input validation but checks the (extra) state of the sync object.

  15. Dave says:

    I've always considered WAIT_ABANDONED to be another one of those "In order to demonstrate our superior intellect, we will now report to you an error condition you cannot handle" cases.

  16. Count Zero says:

    @Joshua – Three objections:

    1) As I already said using a Mutex to sync threads in a single process is a waste of resources. You should use CriticalSections, Semaphores or Events for that purpose. You answered "I tend to use "mutex" as the non-specific object term and interpret it as such on encountering it.". Let's assume you are using a wrong terminology but the right methods. THen your coding practice is still "a bit" wrong. Let's see why…

    2) Are you aware that (native/built-in) assertions are only raised in debug builds? So even if you assert on WAIT_ABANDONED, your production code could still break in case of a WAIT_ABANDONED return code. (But it won't… see my next point about the why.)

    3) According to the documentation of the WaitForSingleObject() API function the WAIT_ABANDONED return code can only be returned if the object you are waiting for is a Mutex (and the process that owns it is terminated without releasing it), so an assertion checking the aforementioned state is kind of pointless.

  17. manuell says:

    @Dave You CAN handle the case! That is: 1) aborting or 2) reset state to "consistent".

  18. dave says:

    Well, of course 'you' (the app programmer) can handle it – but only if you're told it needs handling, which is why there is a mutex-abanadoned indication.

    In fact you pretty much have to handle it: pick either of the two methods you mention.  (Most cases I'd code as 'abort', only making the effort to find and fix inconsistencies in situations that warranted that extra work).

  19. dave says:

    Uh-oh: too many daves.  I now see that I replied to a reply to capital-D Dave.  Sorry for the confusion.

    http://www.poetryfoundation.org/…/171612

  20. Joshua says:

    @Count Zero: I ship with assertions on. This particular one is in the default case of the select decoding WaitForMultipleObjects. It would also assert on an event signalled off the array (WAIT_OBJECT_0 + 4 on a size 3 array for example).

  21. Count Zero says:

    @Joshua – Still there is the problem that you said "You're right, the specific object I am using is almost always some kind of event." if by "some kind of event" you mean a windows event, that won't ever return WAIT_ABANDONED since – according to the documentation of WaitForSingleObject(), WAIT_ABANDONED can only be returned if the object you are waiting for is a Mutex – so an assertion checking that state is pointless.

  22. dave says:

    Logically speaking, you can only have an 'abandoned' state for objects with ownership semantics, which doesn't include events.

    I think the mutex is the only such synch object, at least of those exposed to user mode.

  23. Joker_vD says:

    @Count Zero: On the contrary, ASSERTing against an "impossible" thing that may only happen if your OS has glitched, for example, makes sense. Assertions are for checking that the contracts are followed, exceptions are for checking that the work is being done. If a function promises to return only non-negative numbers, you should ASSERT(f() >= 0). If it instead promises to return negative numbers if there was an error, then you may "if (f() < 0) throw unexpected_error("I'm lazy lol");" or something.

  24. dave says:

    @Joker_vD

    … which usage is exactly in keeping with the meaning of the word "assert".

  25. John Doe says:

    @Joshua, so you don't use mutexes except for the top-most reason almost everyone else uses mutexes.

    How quaint…

  26. Lars says:

    @dave: Which, ironically, is proof that you can't have the 'case'. *rimshot*

  27. Dave (the other Dave) says:

    @manuell:

    >You CAN handle the case! That is: 1) aborting

    Which is basically just crashing, i.e. not handling it at all.

    >or 2) reset state to "consistent".

    That's like saying that you can handle a crashed hard drive by resetting its state to non-crashed. If you're doing something sufficiently complex and critical that you need to protect it with a mutex to make sure no-one else disturbs it then it's not just a case of calling the WinMakeConsistentOnWaitAbandoned() (documented in the appendix to the apocrypha to MSDN) function to fix things.

    [True in general, but you may be able to roll back to a previous checkpoint or something like that. -Raymond]
  28. Count Zero says:

    @Joker_vD – I also use assertions to handle "impossible" conditions, but in my case "impossible" means "very unlikely". I mainly use assertions for "premature" parameter checking at the beginning of subroutines. Input parameters might be wrong (type/value/missing/some sort of null) during development (mostly they are not), but can not be wrong in production code in real life environment (Since that code has gone through extensive testing.), so they can be checked in assertions.

    Invalidity of data received through IPC communication does not fall in the area of "impossible" (or "very unlikely"), since it is received from another module(/process maybe running in another context or on another machine) which can be impersonated, or the actual communication might be corrupted. Even in the heavily tested and (to the boundary of possibilities) bug free production code, so an assertion here seems to be a failure.

    @Joshua – Shipping with assertions on seems to be a bad practice. It is like shipping debug code. It actually cries for a D.O.S. attack (by leaving obvious attack vectors opened) – not to mention the unnecessary code compiled into your executable and the CPU time eaten by the unnecessary code running on all assertion checking for "impossible" conditions. Oh, and did I mention the (mostly similar) structures built in your executable that can be detected and hijacked for injecting malicious code? I'm not saying you can't ship stabile, solid and working code with assertions on if your entire coding practice is built around the "ship with assertions on" principle, but there is only two ways for that: You can either have special assertions for "possible impossible" conditions and only include those (which is still misused terminology which can mislead the maintenance programmer), or you can prepare your program to handle a lot of "impossible" situations and fail gracefully over them (which enlarges source code produces unnecessarily huge executables and eats up a lot of resources AND still causes confusion for the maintenance guy).

  29. Anonymous Coward says:

    Perhaps the best decision to appease both sides would be:

    * Have WaitForSingleObject/etc return WAIT_OBJECT_0 (or whatever the usual return value would be) instead of WAIT_ABANDONED, but:

    * Also have a WasMutexAbandoned function which returns TRUE if the mutex was abandoned since the function was last called (and the function must be called while holding the mutex).

    That way, programs that aren't written to expect abandoned mutexes don't cause additional bugs (other than the ones caused by inconsistent data), and carefully-written programs can still avoid propagating corrupted data.

    Of course, there's a third side which will say that developers should be forced to deal with the abandoned mutex case (as WAIT_ABANDONED does), and also it's too late to change anything now.

Comments are closed.

Skip to main content