Crash in Unloaded_mydll.dll – how to find out who called unloaded dll.

The other day I happened to came across an interesting crash dump. It was clear that something called our module which was already unloaded and so whole application crashed. On the first look it was however very unclear what code in what module called our dll. This case was closed as external cause, but I wanted to dig deeper to understand what actually happened.

Symptoms

The crash dump did have following confusing stack trace:

 0:001> k
Child-SP          RetAddr           Call Site
0000007b`cee4fe60 0000007b`d091e970 <Unloaded_myproductiondll.dll>+0x18361
0000007b`cee4fe68 00000000`00000000 0x7b`d091e970

The exception record didn't contain much more helpful info. The last value of instruction pointer was really pointing to address within our module (as we'll see later):

 0:001> .ecxr
rax=0000000000000001 rbx=0000007bd091e970 rcx=000007feed8f33aa
rdx=0000000000000000 rsi=0000000000000001 rdi=0000000000000000
rip=000007fee6528361 rsp=0000007bcee4fe60 rbp=0000000000000000
r8=0000007bcee4fe28  r9=0000000000000000 r10=0000000000000000
r11=0000000000000246 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010200
<Unloaded_myproductiondll.dll>+0x18361:
000007fe`e6528361 ??              ???

Investigation steps

First I tried to have a look what modules are in the address space and where does the instruction pointer point to:

 0:001> lm
(…)

Unloaded modules:
000007fe`e6340000 000007fe`e6506000   msxml3.dll
000007fe`ec700000 000007fe`ec8f3000   WININET.dll
000007fe`ed520000 000007fe`ed65d000   urlmon.dll
000007fe`eb0c0000 000007fe`eb0c5000   Normaliz.dll
000007fe`eae70000 000007fe`eb0bc000   iertutil.dll

000007fe`e6510000 000007fe`e6548000 myproductiondll.dll 
000007fe`ed890000 000007fe`ed8e4000   WS2_32.dll
000007fe`ed370000 000007fe`ed379000   NSI.dll 

We can see that our module (myproductiondll) is in address space, last instruction pointer value is in it’s ranges. Also it’s not currently loaded. So we can try to reload it the standard way:

 0:001> !sym noisy
noisy mode - symbol prompts on

0:001> .reload /f myproductiondll*

0:001> lmvm myproductiondll.dll
start             end                 module name

Unloaded modules:
000007fe`e6510000 000007fe`e6548000   myproductiondll.dll
    Timestamp: Fri Oct 14 12:00:00 2011 (4E98E6E2)
    Checksum:  0003E937
    ImageSize:  00038000

That didn’t work (as it doesn’t load modules that were already unloaded from address space). However verbose output of module list (lmvm) gives us little more info about the binary – including the timestamp. This should be enough to locate the actual binary on the release share location. Let’s point executable image path to this location (you can read more on executable image path in my blog post about setting up windows debugging):

 0:001> .exepath+ \\foo\bar\Vxx.x\productionbinaries
DBGHELP: Symbol Search Path: srv*c:\mysymcache*https://symbolserver
Executable image search path is: \\foo\bar\Vxx.x\productionbinaries
Expanded Executable image search path is: \\foo\bar\Vxx.x\productionbinaries

The module output gave us the memory ranges where the module was loaded – we can compute size of the module and then force reload the module by supplying base address and size of the module:

 0:001> ? 000007fe`e6548000 - 000007fe`e6510000
Evaluate expression: 229376 = 00000000`00038000

0:001> .reload myproductiondll.dll=000007fe`e6510000,00000000`00038000
DBGHELP: \\foo\bar\Vxx.x\productionbinaries\myproductiondll.dll - OK
DBGENG:  Partial symbol load found image \\foo\bar\Vxx.x\productionbinaries\myproductiondll.dll.
SYMSRV:  https://symbolserver/myproductiondll.pdb/F08FC95575DXXYXX9YYY8CDE0EAEBB581/file.ptr
SYMSRV:  myproductiondll.pdb from \\symbols\files: 2165760 bytes - copied         
DBGHELP: myproductiondll - private symbols & lines 
        c:\mysymcache\myproductiondll.pdb\F08FC95575DXXXXYXXF8CDE0EAEBB581\myproductiondll.pdb

Now once the module is fully loaded, we can have debugger to show us the stack trace once again:

 0:001> k
SYMSRV:    No longer indexed.
SYMSRV:  c:\mysymcache\kernel32.pdb\492C273DC1724466A04XXYXX91ACC1722\kernel32.pdb not found
SYMSRV:  https://symbolserver/kernel32.pdb/492C273DC1724466A04XXYXX91ACC1722/kernel32.pdb not found
DBGHELP: kernel32.pdb - file not found
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for kernel32.dll - 
DBGHELP: kernel32 - export symbols
Child-SP          RetAddr           Call Site
0000007b`cee4fe60 000007fe`ed241668 myproductiondll!ThreadWriteFunc+0x145 [x:\foo\bar.cxx @ 118]
0000007b`cee4fe90 000007fe`ed906e5d kernel32!BaseThreadInitThunk+0x18
0000007b`cee4fec0 00000000`00000000 ntdll!RtlUserThreadStart+0x21

So now we have a caller (it’s the same dll – so no external fault by the way), also if symbols were properly source indexed (you can read more on source indexing in my blog post about setting up windows debugging), then we also have code location

Root cause

In this particular case, the dll has a worker thread that is being spawned during construction and which exits during destruction. So destructor code needs to synchronize with the worker thread and this was the attempt to do that:

 (...)

        fOK = ::GetExitCodeThread
                (
                m_hWorkerThread,
                &dwRet
                );

        //
        //  If we get a STILL_ALIVE code back, we wait for worker to finish
        //

        if (STILL_ACTIVE == dwRet)
        {
            ::WaitForSingleObject
                    (
                    m_hWorkerExitEvent,
                    100000      // time out after 100 seconds
                    );
        }

(...)

Basically destructor is waiting on an event that the worker thread code is setting BEFORE exiting:

 }
(...)

    if ((NULL != pLocalPub) &&
        (NULL != pLocalPub->m_hEntryExitEvent))
    {
        ::SetEvent
                (
                pLocalPub->m_hEntryExitEvent
                );
    }

    ::ExitThread(0);  <-------------------- thread can be swapped here and never exit 
                                                       before the destructor finishes!!!

    return 0;         <-------------------- location of the crash
}

This can became problematic when the destructor is automatically being called by loader when unloading the dll (e.g. when member of this dll is being used as global variable on dllmain level of a different dll, which is even more dangerous – I’ll try to create different blog post discussing how global loader lock in such a case prevents from proper synchronization). Worker thread can be swapped right after call to ExitThread and the destructor code might be able to finish and system loader might be able to finish it’s work and unload the dll. Then once the scheduling mechanism swaps the worker thread back it is already executing code from unloaded module.

Resolution

In this particular case the resolution is very easy – when waiting for thread to finish one should always wait on handle of the thread which is automatically signaled by the OS once the thread exits (instead of relying on the thread code to signal some handle during it’s cleanup logic).

More generically – one should carefully synchronize between the worker threads and make sure the reference counting is always properly accomplished so that module is not being unloaded until all outstanding worker threads are done with their work.