The case of the orphaned LpdrLoaderLock critical section

A customer found that under heavy load, their application would occasionally hang. Debugging the hang shows that everybody was waiting for the Lpdr­Loader­Lock critical section. The catch was that the critical section was reported as locked, but the owner thread did not exist.

0:000> !critsec ntdll!LdrpLoaderLock

CritSec ntdll!LdrpLoaderLock+0 at 7758c0a0
WaiterWoken        No
LockCount          4
RecursionCount     2
OwningThread       150c
EntryCount         0
ContentionCount    4
*** Locked

0:000> ~~[150c]k
              ^ Illegal thread error in '~~[150c]k'

How can a critical section be owned by thread that no longer exists?

There are two ways this can happen. One is that there is a bug in the code that manages the critical section such that there is some code path that takes the critical section but forgets to release it. This is unlikely to be the case for the loader lock (since a lot of really smart people are in charge of the loader lock), but it's a theoretical possibility. We'll keep that one in our pocket.

Another possibility is that the code to exit the critical section never got a chance to run. For example, somebody may have thrown an exception across the stack frames which manage the critical section, or somebody may have tried to exit the thread from inside the critical section, or (gasp) somebody may have called Terminate­Thread to nuke the thread from orbit.

I suggested that the Terminate­Thread theory was a good one to start with, because even if it wasn't the case, the investigation should go quickly because the light is better. You're not so much interested in finding it as you are in ruling it out quickly.¹

The customer replied, "We had one call to Terminate­Thread in our application, and we removed it, but the problem still occurs. Is it worth also checking the source code of the DLLs our application links to?"

Okay, the fact that they found one at all means that there's a good chance others are lurking.

Before we could say, "Yes, please continue your search," the customer followed up. "We found a call to Terminate­Thread in a component provided by another company. The code created a worker thread, and decided to clean up the worker thread by terminating it. We commented out the call just as a test, and it seems to fix the problem. So at least now we know where the problem is and we can try to fix it properly."

¹ In the medical profession, there's the term ROMI, which stands for rule out myocardial infarction. It says that if a patient comes to you with anything that could possibly remotely be the symptom of a heart attack, like numbness in the arm or chest pain, you should perform a test to make sure. Even though the test is probably going to come back negative, you have to check just to be safe. Here, we're ruling out Terminate­Thread, which I guess could go by the dorky acronym ROTT.

Comments (11)
  1. Joshua says:

    Ah yes ROMI they decided to not believe me when I told them straight up it wasn't one. There's no such thing as a slow (weeks) heart attack.

    While there valid uses of TerminateThread, any uses of that pattern where the program is a whole-prohram transform (no COM for you whatsoever and most of GDI is walking on eggshells after that.)

  2. alegr1 says:

    The checklist:

    1. TerminateThread, SuspendThread.

    2. _exit.

    3. CreateThread, _beginthread (instead of proper _beginthreadex).

  3. Maurits says:

    TIL ~~[thread-id] syntax.

  4. JM says:

    @Joshua: haven't you ever watched House? Everybody lies. Even if they seemingly have no reason to. :-)

    Besides, can you imagine the headlines if a patient dies of a heart attack and the doctors have to defend themselves with "he told us it couldn't be one"? If nothing else, who says the symptoms you were feeling for weeks couldn't be unrelated and you happened to have a heart attack at that very moment? As unlikely as it is, if the test is sufficiently cheap, there's no reason not to do it. Hickam's dictum: patients can have as many diseases as they damn well please.

  5. Dan Bugglin says:

    @JM "OF COURSE I TRIED REBOOTING ARE YOU CALLING ME A LIAR oh hey it works now thanks"

  6. Ken Hagan says:

    ROMI is actually different from "the light is better". The logic behind ROMI is that the cost of the condition is so high that you need to test for that one first. The logic of the lamp-post is that the cost of the test is so low that it doesn't hurt to do that one first. (The two could actually conflict, if testing for MI was greater than testing for TT.)

  7. Gabe says:

    In order to be able to reproduce this problem reliably, I'm going to guess that they start a worker thread, determine the work item is unnecessary, and cancel it via TerminateThread. Since the system is under heavy load, the thread has not finished initializing before being terminated, leaving the loader lock abandoned.

    So I guess if you want to make sure you can safely terminate a thread, do it from the loader lock?

  8. Joshua says:

    @Gabe: HeapAlloc! I wasn't kidding about whole program transform.

  9. immibis says:

    @Gabe: Nothing makes the loader lock special. The thread could be holding any other lock.

    You could try and terminate the thread while holding every lock, but good luck with that.

  10. Cheong says:

    @Joshua: Although heart attacks usually comes quick and sudden, blockage of blood vassals can take months to develop.

    There could be case that the reduced blood flow won't affect daily activities (for now), but will case problem when he's doing exercise.

  11. Marc K says:

    @The MAZZTer: "Yes, I could reboot and that would probably temporarily resolve the problem.  But, then we would lose this opportunity for you to find the bug in your product and develop a permanent fix." is something I find myself telling vendors these days.  To me, rebooting is no longer a suitable resolution unless you already have enough information to produce a root cause analysis.

Comments are closed.