Partially eliminating the need for SetThreadpoolCallbackLibrary and reducing the cost of FreeLibraryAndExitThread

Update: Daniel points out that there is still a race condition here, so this trick won't work. Rats.

The documentation for the Set­Threadpool­Callback­Library says

This prevents a deadlock from occurring when one thread in DllMain is waiting for the callback to end, and another thread that is executing the callback attempts to acquire the loader lock.

If the DLL containing the callback might be unloaded, the cleanup code in DllMain must cancel outstanding callbacks before releasing the object.

Managing callbacks created with a TP_CALLBACK_ENVIRON that specifies a callback library is somewhat processor-intensive. You should consider other options for ensuring that the library is not unloaded while callbacks are executing, or to guarantee that callbacks which may be executing do not acquire the loader lock.

I'm not going to help you with the DllMain cleanup issues. (My plan is to simply avoid the issue by preventing the DLL from unloading while a callback is still pending. That way, you never have to cancel the callback from DllMain.) But I am going to help with the "consider other options for ensuring that the library is not unloaded while callbacks are executing."

The first-pass solution is to use the same trick we use when creating worker threads: We bump the DLL reference count when queueing the work item and use Free­Library­When­Callback­Returns to decrement the reference count after the callback finishes. (We can't use Free­Library­And­Exit­Thread, of course, since we're running on a thread on loan to us from the thread pool. Exiting the thread from a thread pool callback is like demolishing the house you're renting.)

The second-pass solution is to manage the DLL reference count manually. (Don't go down this route unless your profiling suggests that DLL reference count management is a performance bottleneck.) The rule is still that the DLL reference count is prevented from dropping to zero while a callback is pending, but instead of incrementing the reference count each time we scheduled a callback, we'll increment it only when the number of callbacks goes from zero to nonzero. Conversely, we decrement the reference count only when the number of callbacks drops from nonzero to zero.

You can think of this as proxying the reference count, similar to how COM creates proxies that collapse Add­Ref and Release calls and signal the remote object only when the reference count transitions from zero to nonzero or vice versa.

This optimization works for Free­Library­And­Exit­Thread, too, so let's fold that in while we're there.

LONG g_lProxyRefCount = 0;

BOOL ProxyAddRefThisDll()
 if (InterlockedIncrement(&g_lProxyRefCount) == 1) {
  HMODULE hmod;
 return TRUE;

void ProxyFreeLibraryAndExitThread(DWORD dwExitCode)
 if (InterlockedDecrement(&g_lProxyRefCount) == 0) {
  FreeLibraryAndExitThread(g_hinstSelf, dwExitCode);
 } else {

void ProxyFreeLibraryWhenCallbackReturns(PTP_CALLBACK_INSTANCE pci)
 if (InterlockedDecrement(&g_lProxyRefCount) == 0) {
  FreeLibraryWhenCallbackReturns(pci, g_hinstSelf);
Comments (23)
  1. Joshua says:

    Better solution: don't get the problem in the first place. In my experience, thread pools are obsolete.

  2. alegr1 says:

    What if you need to have a thread (or other thread-related constructs) in a DLL which is managed only by LoadLibrary/FreeLibrary, and doesn't have a dedicated export to shut it down outside of a loader lock? For example, you need to add a thread into a DLL used by applications, and you cannot change the calling appication?

    The solution is to have the first (legacy interface) DLL load a secondary DLL which would actually contain the code executed by threads. On the first DLL's DETACH_PROCESS (unload), it should call a function in a secondary DLL to signal the shutdown of threads (only signal, not wait!). The threads in the second DLL would also wait for all thread pool callbacks, and other asynchronous rundown. Then the threads would exit by FreeLibraryAndExitThread. Bingo! Both DLLs are now unloaded.

  3. RaceProUK says:

    @Joshua – Except in web servers, RDBMSs, in fact anything that needs to handle an arbitrary number of parallel requests.

  4. Joshua says:

    @RaceProUK: My benchmarks show one thread per connection is the right way to do it, and the thread startup cost is negligible, and memory is now cheap enough that the memory for the thread stacks costs less than the complexity of using thread pools.

    [Good luck scaling to more than a few thousand connections. (See also: The C10K problem.) -Raymond]
  5. Daniel says:

    The code for ProxyFreeLibraryAndExitThread actually contains a serious race condition:

    If the two last threads call it at the same time, its possible that:

    1. thread A first calls InterlockedExchange (receives 1), so goes to the ExitThread branch.

    2. before it goes any further thread B gets active goes to InterlockedExchange, receives 0 and calls


    3. Thread A now would like to continue, but it's code has been dropped in the meantime…

    [You're right. Rats. -Raymond]
  6. Dan Bugglin says:

    I don't have too much experience with thread pools, but when I had a task I needed to run 2500 times, in parallel, but obviously only a handful at once, just assigning a thread pool to handle them all was like delicious MAGIC.  It worked very nicely.

    (Specifically, the task was sending a network packet to a game server to query the game information and waiting for the response, then parsing it, then invoking the UI thread to update with the game information in a server list.)

  7. Dan Bugglin says:

    Addendum: I tried launching 2500 threads at once on my first attempt.  It did not end well.

  8. Jim Lyon says:

    There is a far, far simpler solution that works for many use cases (many, but not all):

    During initialization, increase the reference count on your DLL. Never decrease it. You are guaranteed to never get unloaded until the process terminates. If this is OK for your situation, then all other bookkeeping is moot.

  9. alegr1 says:


    If every workitem was also waiting for the response, you've been doing it wrong. You should have used completion ports to receive responses.

  10. Joshua says:

    [Good luck scaling to more than a few thousand connections. (See also: The C10K problem.) -Raymond]

    I'd hit database overloaded and killed first.

    If actually facing the C10K problem on the frontend, I'd move the whole thing to the UNIX world so each thread could do 500 connections by using select(). (Select caps out at 64 in the Windows world.)

  11. DWalker says:

    Re the C10K problem:  I used to work for an airline.  In the 1970s, 1980s, and 1990s, airline reservations systems had 80,000 terminals connected at once, and could handle 5,000 transactions per second.  Now, all of those travel agents are likely using PCs instead of directly-connected terminals.  I don't work for an airline anymore, so I don't know.

    Today's PCs are likely more powerful than mainframes of that era, and they certainly have more memory.  I do know that the TPF operating system is a fascinating thing.  

  12. Cesar says:

    @Joshua: the correct solution for Windows is not select, it is something more complicated. Take a look at libevent2, it does the right thing on whatever operating system you are using.

  13. Matt says:


    If you want to pin the library so it never exits, then pin it instead of just messing with the reference count:

    void PinThisDll()


     HMODULE _ignore;




  14. Gabe says:

    Joshua: In my experience, select() is obsolete. In fact, the C10k problem exists partially due to the poor scalability of select() — the same list of 500 sockets has to be traversed by the kernel for every single call. If you are stuck using a select-based framework, use WSAAsyncSelect which has no limit on sockets (it's more like Linux's epoll than select). However, the proper way to do it is to use a completion port (and implicitly a thread pool).

  15. voo says:

    Haven't we progressed a bit since the old age of "use one thread per connection" by now? I'd at least hope so, all those asynchronous event-driven architectures exist for a reason after all. You still want threadpools obviously but only in the number of cores available to the machine.

  16. Deduplicator says:

    About "use one thread per connection is obsolete": Better "use the right tool for the job".

    If any client interaction is dominated by waiting, not by frantic processing, and you really have to serve many connections, one thread/process per connection is too heavyweight.

    If there are only ever a handful connections or you must do heavy processing for each client, the resource overhead is negligible while the conceptual overhead for keeping things asynchronous might be telling.

    One size does not fit all.

  17. Joker_vD says:

    All this reference counting stuff is a bit ominous, wouldn't you agree? The reference counting has to be done by both the thing being referenced AND things that refernce it, and may God help you if someone, just someONE slips a +1 or -1.

  18. Henke37 says:

    So what does the threadpool do when someone Exits the thread? Most likely it just shrugs and moves on, while taking the old thread down from whatever internal lists it keeps.

  19. Crescens2k says:


    I would disagree with the reference counting needs to be done with the thing referencing it too.

    For the COM reference counting, I normally go along the lines of one reference = one variable. This is along similar lines, one reference = one thread.

    While it is still possible to do slips, like a double Release on a single pointer, or missing a Release resulting in an outstanding reference. I have noticed that the most problems I have had from reference counting is from the server side, and even that is rare and easy to fix.

  20. Adam Rosenfield says:

    @Joshua: select() caps out at 64 sockets by default, but according to the docs, you can actually #define FD_SETSIZE to whatever value you want before #include'ing <Winsock2.h> (…/ms739169%28v=vs.85%29.aspx).

    That said, as others have said, select() doesn't scale well to large numbers of sockets due to the need to iterate through the entire FD_SET.  A callback-oriented API like WSAAsyncSelect() or epoll() scales much better.  Particularly on Linux, where select() is O(largest FD) but epoll() is O(number of FDs).

  21. Joker_vD says:

    @Crescens2k: One reference = one variable? That reminds of a story about how stack could have been implemented:…/olerant.html — scroll to the end, to the box starting with "Long, long time ago…".

  22. Anonymous Coward says:

    Joker_vD: I don't think someone who doesn't understand why thinking in terms of interfaces or what problem reference counting solves is in any position to criticise OLE/COM.

  23. alegr1 says:


    The writer of "OLE rant" doesn't get OLE at all. Here is a hint: object ownership/lifetime tracking is VERY difficult problem. He thinks that instead of reference counting you should just delete the object when you're done with it: " there is very little need for refcounting as long as you agree not to destroy the object while you are using its interfaces". But knowing when you're done with using its interfaces is the hardest part. And that's what the reference counting makes extremely easy.

Comments are closed.