Cleaning up shared resources when a process is abnormally terminated

This post came into my suggestion box yesterday from Darren Cherneski:

We have a system that has an in-memory SQL database running in shared memory that is created with CreateFileMapping(). Processes start up, attach to it via a DLL, do some queries, and shut down. The problem we keep running into during development is when a developer starts up a process in the debugger, performs a query (which gets a lock on a database table), and then the developer hits Shift-F5 to stop debugging, the lock on the database table doesn't get released. We've put code in the DllMain() function of the DLL to perform proper cleanup when a process crashes but DllMain() doesn't seem to get called when a developer stops a processes in the debugger.

Windows has hundreds of system DLLs where a process can get a handle to a resource (Mutex, file, socket, GDI, etc). How do these DLLs know to cleanup when a developer hits Stop in the debugger?

It's a great question which comes up on our internal Win32 programming alias once a month or so, and it illustrates one of the key issues with resource ownership.

The interesting thing is that this issue only occurs with named synchronization objects.  Unnamed synchronization objects are always private, so the effects of a process abnormally terminating are restricted to that process.  The other resources mentioned above (files, sockets, GDI, etc) don't have this problem; because when the process is terminated, the handle to the resource is closed, and closing that handle causes all the per-process state (locks on the file, etc) to be flushed.  The problem with synchronization objects is that with the exception of mutexes, they have state (the signaled state) that's not tied to a process or thread.  The system has no way of knowing what to do when a handle is closed with an event set to the signaled state, because there is no way of knowing what the user intended.

Having said that, a mutex DOES have the concept of an owning thread, and if the thread that owns a mutex terminates, then one of the threads blocked waiting on the mutex will be awoken with a return code of WAIT_ABANDONED.  That allows the caller to realize that the owning thread was terminated, and perform whatever cleanup is necessary.

Putting code in the DllMain doesn't work, because, as the Darren observed, the DllMain won't be called when the process is terminated abruptly (like when exiting the debugger).

To me, the right solution is to use a mutex to protect the shared memory region, and if any of the people waiting on the mutex get woken up with WAIT_ABANDONED, they need to recognize that the owner of the mutex terminated without releasing the resource and clean up.

Oh, and I looked Windows doesn't have "hundreds of system DLLs where a process can get a handle to a resource"  There are actually a relatively few cases in the Windows code base where a named shared synchronization object is created (for just this reason).  And all of the cases I looked at either use a mutex and handle the WAIT_ABANDONED error, or they're using a manual reset event (which don't have this problem), or they have implemented some form of alternative logic to manage this issue (waiting with a timeout, registering the owner in a shared memory region, and if the timeout occurs, looking for the owner process still exists).

The reason that manual reset events aren't vulnerable to this issue, btw is that they don't have the concept of "ownership", instead, manual reset events are typically used to notify multiple listeners that some event has occurred (or that some state has changed).  In fact, internally in the kernel, manual reset events are known as NotificationEvents for just this reason (auto-reset events are known as SynchronizationEvents).  Oh, and mutexes are known as Mutants internally (you can see this if you double click on a mutex object using the WinObj tool) Why are they called mutants?  Well, it's sort-of an in joke.  As Helen Custers put it in "Inside Windows NT":

The name mutant has a colorful history.  Early in Windows NT's development, Dave Cutler created a kernel mutex object that implemented low-level mutual exclusion.  Later he discovered that OS/2 required a version of the mutual exclusion semaphore with additional semantics, which Dave considered "brain-damaged" and which was incompatible with the original object. (Specifically, a thread could abandon the object and leave it inaccessible.)  So he created an OS/2 version of the mutex and gave it the name mutant.  Later Dave modified the mutant object to remove the OS/2 semantics, allowing the Win32 subsystem to use the object.  The Win32 API calls the modified object mutex, but the native services retain the name mutant.

Edit: Cleaned up newsgator mess.