The Debug Ninja speaks: Debugging a stop 0x20

Hello, I am the Debug Ninja. Recently Jeff approached me about contributing to this debugging blog, and as the Debug Ninja I felt an obligation to share at least a small amount of Ninja knowledge with the world. Today I will start by explaining how to debug stop 20 blue screens. Unlike typical blue screens where debugging starts with stack analysis, a stop 20 requires a different approach.

 

Now you are probably wondering, “Great Debug Ninja, what is a stop 20 blue screen?” A stop 20’s literal translation is KERNEL_APC_PENDING_DURING_EXIT. In common language that means that we attempted to terminate a thread while Asynchronous Procedure Calls were disabled for this thread. The operating system forces a bugcheck under these conditions because if APCs are disabled at thread termination it means a driver has a bug that disabled APCs more times than it enabled them. Usually these bugs result in difficult to debug crashes or hangs later, so we stop the system at thread termination to make debugging easier.

 

Perhaps you are now asking “How might a driver disable APC’s more times than it enables them?” Good question Grasshopper. As described in the WDK, a driver can disable APCs by entering a critical region, a guarded region, or by raising the IRQL to APC_LEVEL or higher. However, not all of those methods will result in a stop 20 bugcheck. Only calls that change the APC disable count in the KTHREAD structure can result in a stop 20. The APIs KeEnterCriticalRegion, KeWaitForSingleObject, KeWaitForMultipleObjects, KeWaitForMutexObject, or FsRtlEnterFileSystem will decrement the APC disable count. A driver should then call KeLeaveCriticalRegion, KeReleaseMutex, or FsRtlExitFileSystem to re-enable APCs; these calls increment the APC disable count in the KTHREAD structure.

 

As you review the APIs mentioned above you will see that there are several ways for a driver writer to get into a situation where APCs are disabled and not re-enabled. Many of the ways we get into this situation are difficult to debug and require instrumentation that is beyond the scope of this blog. In this blog we are going to focus on the most common cause a stop 20 blue screen, an orphaned ERESOURCE. A brief review of the WDK documentation for ExAcquireResourceExclusiveLite and ExAcquireResourceSharedLite will reveal that before you can acquire an ERESOURCE you must first disable normal kernel APC delivery by calling KeEnterCriticalRegion. This means that if you orphan an ERESOURCE you will leave the APC disable count decremented, and when the thread is terminated the system will bugcheck.

 

Now you certainly want to ask “Kind Ninja, will you show me how to debug such a problem?” Absolutely Grasshopper!

 

We start by opening the dump and checking the cause of the crash.

 

1: kd> .bugcheck

Bugcheck code 00000020

Arguments 00000000 0000fffc 00000000 00000001

 

Next we check what thread was being terminated; we can see this in the call stack as the first parameter to PspTerminateThreadByPointer.

 

1: kd> kb

ChildEBP RetAddr Args to Child

b5e57c80 8094c546 00000020 00000000 0000fffc nt!KeBugCheckEx+0x1b

b5e57d18 8094c63f 00000000 00000000 8bf99330 nt!PspExitThread+0x64c

b5e57d30 8094c991 8bf99330 00000000 00000001 nt!PspTerminateThreadByPointer+0x4b

b5e57d54 8088978c 00000000 00000000 05c2ffb8 nt!NtTerminateThread+0x71

b5e57d54 7c8285ec 00000000 00000000 05c2ffb8 nt!KiFastCallEntry+0xfc

Finally we can look at the list of ERESOURCE structures with !locks to see if our thread owns any of these locks.

 

1: kd> !locks

**** DUMP OF ALL RESOURCE OBJECTS ****

KD: Scanning for held locks....

Resource @ Ninja!NinjaLock (0x808a48c0) Shared 2 owning threads

    Contention Count = 35

     Threads: 8bf99330-02<*> 8c1d19f0-01<*>

!locks shows us that the thread in question is a shared owner of the Ninja driver’s NinjaLock. The author of the Ninja driver needs to look at how their driver uses this ERESOURCE and determine why the lock was orphaned, unfortunately that means I need to do more work. To find the bug that caused this problem I reviewed the code that uses NinjaLock. That code was acquiring the NinjaLock inside of a try-except block. I forgot to release the lock in the exception handler, resulting in the orphaned lock that we see here. I guess that’s why I’m the Debug Ninja, and not the Code Writing Ninja.