Why am I getting a crash at shutdown inside the thread pool?

A customer reported a crash in WinHTTP when their application shuts down a WebSocket. Specifically, it occurs when one of their DLL's global objects is being destructed.

The customer sent us a redacted call stack:

00a5e11c 7753ebbe ntdll!KiFastSystemCallRet
00a5e120 77581174 ntdll!NtAlpcSendWaitReceivePort+0xa
00a5e1d0 7758078a ntdll!SendMessageToWERService+0x14d
00a5ecc0 77580c10 ntdll!ReportExceptionInternal+0xde
00a5f118 7758085b ntdll!RtlReportExceptionEx+0x379
00a5f170 775a74dc ntdll!RtlReportException+0x9b
00a5f180 77541454 ntdll!TppRaiseInvalidParameter+0x51
00a5f194 77540ddd ntdll!_EH4_CallFilterFunc+0x12
00a5f1bc 77544d33 ntdll!_except_handler4_common+0x8d
00a5f1dc 775508d2 ntdll!_except_handler4+0x20
00a5f200 775508a4 ntdll!ExecuteHandler2+0x26
00a5f2c8 7753f477 ntdll!ExecuteHandler+0x24
00a5f2c8 775a74c2 ntdll!KiUserExceptionDispatcher+0xf
00a5f660 7755ddb0 ntdll!TppRaiseInvalidParameter+0x37
00a5f66c 774ecdd2 ntdll!TppTimerpValidateTimer+0x6e1a2
00a5f690 757ddadb ntdll!TpSetTimerEx+0x1b
00a5f6b8 757c646d WINHTTP!HTTP_THREAD_POOL::SetTimer+0x42
00a5f6f0 757c6070 WINHTTP!WEB_SOCKET_HANDLE_OBJECT::Close+0x1bb
00a5f754 69699832 WINHTTP!WinHttpWebSocketClose+0x9c
 global atexit call being made here
00a5f814 696d1f7d XXXXXX!_CRT_INIT+0xaa
00a5f874 7753cd4e XXXXXX!__DllMainCRTStartup+0x1ee
00a5f894 77505525 ntdll!LdrxCallInitRoutine+0x16
00a5f8e4 775057cb ntdll!LdrpCallInitRoutine+0x43
00a5f97c 77518e3f ntdll!LdrShutdownProcess+0x101
00a5f990 77065736 ntdll!RtlExitUserProcess+0x63
00a5f99c 77065471 msvcrt!__crtExitProcess+0x17
00a5f9e0 77065715 msvcrt!doexit+0x10a
00a5f9f4 00be2369 msvcrt!exit+0x11
00a5fa2c 7752b2dd contoso!__wmainCRTStartup+0x114
00a5fa70 7752b2a7 ntdll!__RtlUserThreadStart+0x2f
00a5fa80 00000000 ntdll!_RtlUserThreadStart+0x1b

The customer concluded, "We have some ideas that may work around the issue by using WINHTTP_OPTION_WEB_SOCKET_CLOSE_TIMEOUT to avoid the close timeout, but we'd like confirmation as to whether this will actually solve the problem."

Okay, first let's understand the problem, then we can look at possible solutions.

The customer has a DLL with a global object, and as we learned some time ago, global objects in DLLs are destructed as part of DLL_PROCESS_DETACH. The problem is that the thread pool has already shut down by the time this DLL gets around to destroying global objects. We know this because one of the first steps in process termination is terminating all but one of the threads. A thread pool without any threads is not really a thread pool any more.

At process termination, the thread pool is electrified. Any attempt to schedule new work on the thread pool will result in an immediate crash. In this case, the problem is that the customer's DLL is closing a WinHTTP WebSocket, and one of the things that WinHTTP does when it closes a WebSocket is to schedule a thread pool timer so it can abort the close handshake if it takes too long.

Okay, so the chain of events goes like this: Thread pool gets electrified, then the DLL starts destructing its objects, and one of the objects tries to close a WebSocket, and closing the WebSocket creates a thread pool timer, but the thread pool is electrified, so the process crashes.

Okay, now that we understand the problem, let's look for solutions.

The customer's proposed workaround is to use WINHTTP_OPTION_WEB_SOCKET_CLOSE_TIMEOUT to set the timeout to INFINITE. This tells WinHTTP to let the close operation take as long as it wants, which means that it doesn't bother creating a thread pool timer to abort a close operation that is taking too long (because you said that there's no such thing as "too long").

That solves the proximate problem, but really this is just playing whack-a-mole. You may be able to get rid of this crash caused by closing a WinHTTP WebSocket, but this may merely expose some other object that is also using the thread pool at destruction, and you're going to have to go through all this analysis again and look for a way to get that other object to avoid the thread pool at process termination.

The best solution is to try to get rid of the global variables in the first place. If you can't do that, then you at least want to avoid running the destructors at process termination. There are a few ways of accomplishing this:

  • Clean up the global variables explicitly prior to process termination. The destructors will run at DLL_PROCESS_DETACH, but since you already released the resources, the destructors won't do anything.
  • Neuter the global variables in DLL_PROCESS_DETACH if the reason for the notification is that the process is terminating. That way, when their destructors run, they won't do anything.
  • A special case of the previous item is to set a flag in DLL_PROCESS_DETACH if the reason for the notification is that the process is terminating. Have the destructors check the flag and do nothing if the flag is set.

The point is that you don't want to do any cleanup at process termination, because the process has already stopped providing services, and lots of things may be electrified. You just want to let the process terminate and stay out of its way.

Exercise: By a startling coincidence, the day I wrote this blog entry, this question arrived from another customer. Use what you know to diagnose the customer's problem. (In particular, why is the problem sporadic?)

We are using a C++ wrapper around Win32 timers. During object destruction, we deactivate the timer by following the recommended pattern: ::Set­Threadpool­Timer(this->GetHandle(), nullptr, 0, 0); This works fine, but in some rare scenarios, we encounter this crash.

contoso!std::unique_ptr<WinAPI::ThreadPool::Timer<...>, ...>::reset+0x23
contoso!Contoso::SharedMemoryCache::`scalar deleting destructor'+0x14
contoso!`dynamic atexit destructor for 'Extension::s_extension''+0x23

Any pointers would be appreciated.

Comments (6)
  1. Joshua says:

    Well. you managed to pick a case where it doesn’t matter, but in general closing sockets from DLL_PROCESS_DETACH is a lot better than letting them fall off the process because this generates a graceful shutdown the other side can decide was graceful. In particular, this causes the reader on the other side to get EOF rather than an error condition on read and so know it reached the end and handle likewise.

    1. Kevin says:

      Network considerations require the other side to handle an error condition gracefully. The network can go down at any time. Ergo, it should be reasonably safe (if perhaps a bit impolite) to let the OS close the connection.

      You should probably also have an explicit in-band indication that the connection is about to close (e.g. the Connection header in HTTP). That way, you don’t need to depend on the TCP FIN handshake. But, fundamentally, if the other side is no longer talking to you, there’s little you can do about it, other than trying to reconnect or going away and finding someone else to talk to.

      1. Joshua says:

        Handle error condition gracefully != never have a success condition.

  2. Alois Kraus says:

    I guess the customer is not always keeping correctly track of its shared_ptr which on exit may or may not trigger the dtor of the SharedMemoryCache. Another cause might be that the Extension::s_extension is not always loaded but only under specific circumstances which might also be the cause for sporadic crashes in specific process environments.

  3. David Haim says:

    Speaking about winhttp and thread pool, why winhttp’s thread pool opens so many threads for asynchronous requests? I remember when I tried to make about 10,000 asynchronous connection, the thread pool hit the worker count limit (512), and the application pretty much froze down. No matter how much i’ve tried to optimize the code, it appeared that winhttp didn’t really think it over before asking new thread from the threadpool.

    Libcurl in this case was magnitude better, and using the “multi” interface with just on thread I was able to receive the entire responses without having any of them freeze or die in the middle.

    I would definitely want a winhttp internals guide.

  4. Dave Bacher says:

    There’s some program — contososerver. It has loaded an Extension named contoso, which has some tuple based cache. Some of the tuples reference the same SharedMemoryCache object, and when the last reference goes stale, it is releasing the SharedMemoryCache. That uses a ThreadPool that is also a static / global object. The ThreadPool has already been reclaimed by the C runtime.

    Since you can’t really determine the order here, barring whatever interface contososerver is using to talk to contoso having some explicit destroy call, your best bet may be to actually handle this exception. Ideally, if you control both (the names imply that), you’d want to have your extension interface perform the cleanup after WM_CLOSE but before WM_DESTROY (so likely the WM_CLOSE handler after you’ve made the determination you’re not going to veto), so that you have a working message pump if your extension or a dependency needs it.

    At any rate, waiting until detach would generally be the wrong thing.  In this specific case, if you didn’t control the interface (but it looks like they did), it would probably be OK to catch the exception, ignore it and continue. Probably.

Comments are closed.

Skip to main content