A process shutdown puzzle, Episode 2


A customer reported that their program would very sporadically crash in the function Close­Thread­pool­Cleanup­Group­Members. The customer was kind enough to provide a stack trace at the point of the crash:

ntdll!RtlUnhandledExceptionFilter2+0x31e
KERNELBASE!UnhandledExceptionFilter+0x175
ntdll!RtlUserThreadStart$filt$0+0x3f
ntdll!__C_specific_handler+0x8f
ntdll!RtlpExecuteHandlerForException+0xd
ntdll!RtlDispatchException+0x3a6
ntdll!RtlRaiseException+0x223
ntdll!TppRaiseInvalidParameter+0x48
ntdll!TpReleaseCleanupGroupMembers+0x246
litware!CThreadPool::UnInitialize+0x22
litware!_CRT_INIT+0xbf
litware!__DllMainCRTStartup+0x18b
ntdll!LdrpCallInitRoutine+0x3f
ntdll!LdrShutdownProcess+0x205
ntdll!RtlExitUserProcess+0x90
kernel32!ExitProcessImplementation+0xa
contoso!wmain+0x193
contoso!__wmainCRTStartup+0x13d
kernel32!BaseThreadInitThunk+0xd
ntdll!RtlUserThreadStart+0x1d

The customer wondered, "Could the problem be that my cleanup group does not have a callback? MSDN seems to suggest that this is okay."

The exception being thrown is STATUS_INVALID_PARAMETER, but that doesn't really say much.

But that's okay, because the smoking gun isn't the exception being raised. It's in the stack.

Do you see it?

The code is calling Close­Thread­pool­Cleanup­Group­Members from inside DllMain while handling the DLL_PROCESS_DETACH notification. Looking further up the stack, you can see this was triggered by a call to ExitProcess, and now all the stuff you know about how processes exit kicks in.

For example, that the first thing that happens is that all threads are forcibly terminated.

That's your next clue.

Observe that the customer's DLL is trying to communicate with the thread pool during process termination. But wait, all the threads have already been terminated. It's trying to communicate with a nonexistent thread pool.

The thread pool realizes, "Hey, like I've already been destroyed. I can't do what you ask because there is no thread pool any more. You want me to block until all currently executing callback functions finish, but those callback functions will never finish (if they even exist at all) because the threads hosting their thread pool got destroyed. Not that I can tell whether they are executing or not, because I am already destroyed. The only options are to hang or crash. I think I'll crash."

The customer needs to restructure the program so that it either cleans up its thread pool work before the ExitProcess, or it can simply skip all thread pool operations when the reason for the DLL_PROCESS_DETACH is process termination.

Comments (25)
  1. This makes me wonder why someone was calling ExitProcess.

    [Perhaps because they wanted to exit the process? The alternative is, what, Sleep(INFINITE)? -Raymond]
  2. Anonymous says:

    "The alternative is, what, Sleep(INFINITE)?"

    The alternative is to return from main(). Which, AFAIK, will also call ExitProcess for you.

  3. Anonymous says:

    [Perhaps because they wanted to exit the process? The alternative is, what, Sleep(INFINITE)? -Raymond]

    TerminateProcess().

  4. Anonymous says:

    @Cesar:  The code tearing down the thread pool is in a DLL, as indicated by __DllMainCRTStartup present on the call stack.

    @Joshua:  TerminateProcess is not an acceptable API for normal process shutdown.  And it's certainly not an acceptable API for a DLL to use to terminate its host process under normal circumstances.  If you believe either of those statements are false then I feel for your users.

    blogs.msdn.com/…/9921676.aspx

  5. Anonymous says:

    Order of operations.

    Who would have thought those high school algebra classes were laying such an important foundation?

    JamesNT

  6. Anonymous says:

    Hang or crash only? How about just returning back to the caller with an error code.

    [How do you clean up from a failed clean-up? -Raymond]
  7. Anonymous says:

    [But this helper DLL had a pending threadpool work item that the app didn't know about. -Raymond]

    Creating a threadpool in a DLL sounds like a problem waiting to happen.

    [How do you clean up from a failed clean-up? -Raymond]

    You don't, but you try to keep going so the next cleanup handler can run. Maybe it has a buffer that needs flushing.

    [But what if the next cleanup handler assumed that the previous one successfully cleaned up? -Raymond]
  8. Anonymous says:

    [But what if the next cleanup handler assumed that the previous one successfully cleaned up? -Raymond]

    This is why I design my software to be TerminateProcess() safe by having well-defined COMMIT points and ROLLBACK on startup.

    [And when ROLLBACK fails? -Raymond]
  9. Anonymous says:

    BTW where was that (your) SEH filter that usually catches all exceptions raised inside DllMain? I think its a only one place where its suitable actually…

  10. Anonymous says:

    One of the comments over in the "Clean-up functions can't fail because…" post says to tell the user when something bad happens, since maybe he can do something.  Such as, "Out of memory?  Close some other application."

    Um, that won't help.  Virtual memory, anyone?

  11. Anonymous says:

    @ChrisR: It isn't the DLL exiting the process (if it were, it could do this cleanup BEFORE calling ExitProcess).

    Note in the stack trace, you have two modules which are not part of Windows: the main application "contoso" and the library "litware".

  12. Anonymous says:

    [And when ROLLBACK fails? -Raymond]

    Same basic story as when ROLLBACK fails on SQL Server and about as unlikely.

    [Well, this library was trying to do a rollback (canceling the stuff it had started) and it failed. Maybe this library should have been written in SQL. -Raymond]
  13. I suppose I've opened a bit of a can of worms, but I'm interested in whether there's more to the story.  For example, how did litware.dll got loaded?  If contoso.exe called CoCreateInstance(ILitWareInterface) and then leaked a reference (perhaps not even calling CoUninitialize()) then the appropriate thing to do is fix the bug in contoso.exe.

    Contrariwise, if litware.dll is being loaded by another process calling CreateRemoteThreadEx I am not sure how it's supposed to work, and I am willing to believe that your recommendation of "skip all cleanup if you're in process termination" is all that needs to be fixed.

    [Contoso called LitWare_DoSomethingAwesome, and as part of its work, LitWare_DoSomethingAwesome queued up some background tasks. Then Contoso decided that it was done and exited. It had no idea that LitWare_DoSomethingAwesome was going to schedule additional work onto the thread pool. (For example, maybe the LitWare folks decided that to improve performance, the DoSomethingAwesome function would return immediately and finish doing the awesome stuff in the background.) -Raymond]
  14. Anonymous says:

    I guess my point was lost. I avoid the problem of trying a rollback from a weird state largely by doing the rollback at startup rather than shutdown.

    [I guess I don't understand your point, then. If you roll back at startup, then your service is unavailable for the entire lifetime of the process! -Raymond]
  15. [I guess I don't understand your point, then. If you roll back at startup, then your service is unavailable for the entire lifetime of the process! -Raymond]

    Easy. One creates checkpoints (committed transaction points) periodically, and if the proces crashes, it's restored to the saved checkpoint by rolling back all uncommitted transactions.

    [I think we're talking about different things. You're talking about recovering from an app that crashes because it can't clean up properly. I'm talking about why the app is crashing at cleanup in the first place. -Raymond]
  16. Anonymous says:

    @Ben Voight:  Indeed I saw that.  I was illustrating to Joshua and Cesar that their suggestions would not work, since the DLL does not (and should not) control process shutdown.

  17. I see.  So Litware assumed that their work items would complete before the process exited, and did not provide any way for Contoso to wait on the awesome thing being completed, or to abort doing the awesome thing because it's time to shut down now.

    [Sure, that's one scenario. I'm sure you can be creative and come up with others. Just look at all the people who want to create a worker thread in their DLL_PROCESS_ATTACH. -Raymond]
  18. Anonymous says:

    Moral of the story: it would really really help if the error messages explained what went wrong. Just allocate that error condition one unique code, that you never, not *ever* reuse anywhere else. Then document what causes it. Much better than STATUS_INVALID_PARAMETER, isn't it?

    You'll probably counter this by saying that back then the value was only 16 bits and you wouldn't have enough values for all the distinct error conditions. Let alone consistently generate non-colliding values. That's fair enough, I suppose.

    The idea still holds though. If this were an exception-based API, I'd love to get an exception saying "Cannot wait for callbacks to finish because thread pool is already destroyed, and the callbacks will never return" instead of "An argument is invalid". The customer (who seems to be one of the smarter ones) would not have to ask for support then.

  19. The alternative is a clean shutdown path.

    I'm basically wondering why clean shutdown failed.

    [This *is* the clean shutdown path. The app is exiting normally. But this helper DLL had a pending threadpool work item that the app didn't know about. -Raymond]
  20. Anonymous says:

    Hi Raymond,

    I'm ordinarily a .NET programmer who doesn't normally have to deal with DllMain or any of that sort of thing, so please correct me if I'm wrong, but it seems that the best way of doing this sort of thing is as follows:

    1. The external library should have some sort of "Init" function (i.e. "InitLitware") that the host process should call from its own "main"/"WinMain" function (or anywhere really), rather than the library doing any sort of heavy initialisation in its "DllMain" function.
    2. The host process should then be able to call some of the library's functions (i.e. "DoSomethingAwesome").

    3. Lastly, the external library should also have some sort of "Cleanup" function (i.e. "CleanupLitware") that the host process should call when it is ready (i.e. before exiting from "main"/"WinMain", or calling "ExitProcess" or "TerminateProcess"), rather than the library doing any sort of cleanup in its "DllMain" function. This function should block the host process until it is complete (which includes waiting for any pending operations), so that the problem above can't occur.

    Is that about right?

  21. Anonymous says:

    The MSDN article doesn't seem to suggest what happens if you try to close the threadpool twice, does this also crash?

  22. @djhayman:

    Unfortunately "best" is open to many different interpretations, and some programmers think that it is best for the people who use their library not to worry about initialisation and cleanup, and this is done in DllMain. In some ways you can understand this because if a library has some form of persistant state and the developer forgets to signal to that library that it is shutting down, then that persistant state may become corrupt.

    Well anyway, I agree with you, the executable knows better about the state of the program. It is also documented in the DllMain documentation that when the process is terminating, all threads besides the one you are currently in has been terminated. It is also documented that you shouldn't do anything too complex in DllMain, like communicate with other threads. But people don't read, people don't listen, they just do what they want and then get confused when the program crashes.

    @Neil:

    What motivated you to ask that question? Since what you get back from CreateThreadpool is a pointer to a user mode structure, can you give any guarantee that, after you close the thread pool, it doesn't clean up the memory right away? Accessing anything after you said you don't want to use it any more is a bug.

  23. cheong00 says:

    Say, I have such a litware, of which the vendor is long gone, that dare to do something with inappropiate timing on DllMain, but I still have to use the library. What advise can be provided to minimize to damage?

  24. Anonymous says:

    For the kind of thing that Raymond described, it might be good enough to load the DLL with LoadLibrary and unload it with FreeLibrary before you exit the process. That way the threads are still alive when the library tries to destroy the thread pool.

  25. cheong00:

    If a DLL creates threads, it's IMPOSSIBLE to stop them safely from DllMain. It HAS to be a dedicated DLL function to shut it down properly.

Comments are closed.