Debugging a hang: Chasing the wait chain inside a process


Today we're going to debug a hang. Here are some of the (redacted) stacks of the process. I left some red herrings and other frustrations.

0: kd> !process ffffe000045ef940 7
PROCESS ffffe000045ef940
    SessionId: 1  Cid: 0a50    Peb: 7ff6b661f000  ParentCid: 0a0c
    DirBase: 12e5c6000  ObjectTable: ffffc0000288ae80  HandleCount: 1742.
    Image: contoso.exe

        THREAD ffffe000018d68c0  Cid 0a50.0a54  Teb: 00007ff6b661d000 Win32Thread: fffff90143635a90 WAIT: (WrUserRequest) UserMode Non-Alertable
            ffffe000046192c0  SynchronizationEvent

        nt!KiSwapContext+0x76
        nt!KiSwapThread+0x14c
        nt!KiCommitThreadWait+0x126
        nt!KeWaitForSingleObject+0x1cc
        nt!KeWaitForMultipleObjects+0x44e
        0xfffff960`0038bed0
        0x1
        0xffffd000`24257b80
        0xfffff901`43635a90
        0xd
        0xffffe000`00000001
        0xfffff803`ffffff00

        THREAD ffffe000045f88c0  Cid 0a50.0a8c  Teb: 00007ff6b64ea000 Win32Thread: 0000000000000000 WAIT: (UserRequest) UserMode Non-Alertable
            ffffe000041c1830  SynchronizationEvent

        nt!KiSwapContext+0x76
        nt!KiSwapThread+0x14c
        nt!KiCommitThreadWait+0x126
        nt!KeWaitForSingleObject+0x1cc
        nt!NtWaitForSingleObject+0xb1
        nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ ffffd000`248ebc40)
        ntdll!ZwWaitForSingleObject+0xa
        ntdll!RtlpWaitOnCriticalSection+0xe1
        ntdll!RtlEnterCriticalSection+0x94
        ntdll!LdrpAcquireLoaderLock+0x2c
        ntdll!LdrShutdownThread+0x64
        ntdll!RtlExitUserThread+0x3e
        KERNELBASE!FreeLibraryAndExitThread+0x4c
        combase!CRpcThreadCache::RpcWorkerThreadEntry+0x62
        KERNEL32!BaseThreadInitThunk+0x30
        ntdll!RtlUserThreadStart+0x42

        THREAD ffffe00003c46080  Cid 0a50.0a9c  Teb: 00007ff6b64e6000 Win32Thread: fffff90143713a90 WAIT: (UserRequest) UserMode Non-Alertable
            ffffe000041c1830  SynchronizationEvent

        nt!KiSwapContext+0x76
        nt!KiSwapThread+0x14c
        nt!KiCommitThreadWait+0x126
        nt!KeWaitForSingleObject+0x1cc
        nt!NtWaitForSingleObject+0xb1
        nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ ffffd000`367ece40)
        ntdll!ZwWaitForSingleObject+0xa
        ntdll!RtlpWaitOnCriticalSection+0xe1
        ntdll!RtlEnterCriticalSection+0x94
        ntdll!LdrpAcquireLoaderLock+0x4c
        ntdll!LdrpFindOrMapDll+0x75d
        ntdll!LdrpLoadDll+0x394
        ntdll!LdrLoadDll+0xc6
        kernelbase!LoadLibraryExW+0x142
        kernelbase!LoadLibraryExA+0x26
        contoso!__delayLoadHelper2+0x2b
        contoso!_tailMerge_Winmm_dll+0x3f
        contoso!PolarityReverser::OnCompleted+0x28
        contoso!PolarityReverser::Reverse+0xf4
        contoso!ListItem::ReversePolarity+0x7e
        contoso!View::OnContextMenu+0x8
        contoso!View::WndProc+0x25e
        user32!UserCallWinProcCheckWow+0x13a
        user32!DispatchClientMessage+0xf8
        user32!__fnEMPTY+0x2d
        ntdll!KiUserCallbackDispatcherContinue
        user32!ZwUserMessageCall+0xa
        user32!RealDefWindowProcWorker+0x1e2
        user32!RealDefWindowProcW+0x52
        uxtheme!_ThemeDefWindowProc+0x33e
        uxtheme!ThemeDefWindowProcW+0x11
        user32!DefWindowProcW+0x1b6
        comctl32!CListView::WndProc+0x25e
        comctl32!CListView::s_WndProc+0x52
        user32!UserCallWinProcCheckWow+0x13a
        user32!SendMessageWorker+0xa72
        user32!SendMessageW+0x10a
        comctl32!CLVMouseManager::HandleMouse+0xd10
        comctl32!CLVMouseManager::OnButtonDown+0x27
        comctl32!CListView::WndProc+0x1a4186
        comctl32!CListView::s_WndProc+0x52
        user32!UserCallWinProcCheckWow+0x13a
        user32!DispatchMessageWorker+0x1a7

        THREAD ffffe0000462b8c0  Cid 0a50.0ac0  Teb: 00007ff6b64dc000 Win32Thread: 0000000000000000 WAIT: (UserRequest) UserMode Non-Alertable
            ffffe0000462c980  NotificationEvent

        nt!KiSwapContext+0x76
        nt!KiSwapThread+0x14c
        nt!KiCommitThreadWait+0x126
        nt!KeWaitForSingleObject+0x1cc
        nt!NtWaitForSingleObject+0xb1
        nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ ffffd000`201e9c40)
        ntdll!ZwWaitForSingleObject+0xa
        KERNELBASE!WaitForSingleObjectEx+0xa5
        contoso!CNetworkManager::ThreadProc+0x94
        KERNEL32!BaseThreadInitThunk+0x30
        ntdll!RtlUserThreadStart+0x42

        THREAD ffffe000046ad340  Cid 0a50.0b38  Teb: 00007ff6b64b6000 Win32Thread: 0000000000000000 WAIT: (UserRequest) UserMode Non-Alertable
            ffffe000049108c0  Thread

        nt!KiSwapContext+0x76
        nt!KiSwapThread+0x14c
        nt!KiCommitThreadWait+0x126
        nt!KeWaitForSingleObject+0x1cc
        nt!NtWaitForSingleObject+0xb1
        nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ ffffd000`2563bc40)
        ntdll!ZwWaitForSingleObject+0xa
        KERNELBASE!WaitForSingleObjectEx+0xa5
        litware!CDiscovery::Uninitialize+0x8c
        litware!CApiInstance::~CApiInstance+0x48
        litware!CApiInstance::`scalar deleting destructor'+0x14
        litware!std::tr1::_Ref_count_obj<CApiInstance>::_Destroy+0x31
        litware!std::tr1::_Ref_count_base::_Decref+0x1b
        litware!std::tr1::_Ptr_base<CApiInstance>::_Decref+0x20
        litware!std::tr1::shared_ptr<CApiInstance>::{dtor}+0x20
        litware!std::tr1::shared_ptr<CApiInstance>::reset+0x3c
        litware!CSingleton<CApiInstance>::ReleaseRef+0x97
        litware!LitWareUninitialize+0xed
        fabrikam!CDoodadHelper::~CDoodadHelper+0x67
        fabrikam!_CRT_INIT+0xda
        fabrikam!__DllMainCRTStartup+0x1e5
        ntdll!LdrpCallInitRoutine+0x57
        ntdll!LdrpProcessDetachNode+0xfe
        ntdll!LdrpUnloadNode+0x77
        ntdll!LdrpDecrementNodeLoadCount+0xd0
        ntdll!LdrUnloadDll+0x34
        KERNELBASE!FreeLibrary+0x22
        combase!CClassCache::CDllPathEntry::CFinishObject::Finish+0x28
        combase!CClassCache::CFinishComposite::Finish+0x80
        combase!CClassCache::FreeUnused+0xda
        combase!CoFreeUnusedLibrariesEx+0x2c
        combase!CDllHost::MTAWorkerLoop+0x7d
        combase!CDllHost::WorkerThread+0x122
        combase!CRpcThread::WorkerLoop+0x4e
        combase!CRpcThreadCache::RpcWorkerThreadEntry+0x46
        KERNEL32!BaseThreadInitThunk+0x30
        ntdll!RtlUserThreadStart+0x42

        THREAD ffffe000046db8c0  Cid 0a50.0b50  Teb: 00007ff6b64aa000 Win32Thread: fffff9014370da90 WAIT: (UserRequest) UserMode Non-Alertable
            ffffe000046dcae0  NotificationEvent
            ffffe000046dd3c0  SynchronizationEvent

        nt!KiSwapContext+0x76
        nt!KiSwapThread+0x14c
        nt!KiCommitThreadWait+0x126
        nt!KeWaitForMultipleObjects+0x22b
        nt!ObWaitForMultipleObjects+0x1f8
        nt!NtWaitForMultipleObjects+0xde
        nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ ffffd000`21801c40)
        ntdll!ZwWaitForMultipleObjects+0xa
        KERNELBASE!WaitForMultipleObjectsEx+0xe1
        USER32!MsgWaitForMultipleObjectsEx+0x14e
        contoso!EventManagerImpl::MessageLoop+0x32
        contoso!EventManagerImpl::BackgroundProcessing+0x134
        ntdll!TppWorkpExecuteCallback+0x2eb
        ntdll!TppWorkerThread+0xa12
        KERNEL32!BaseThreadInitThunk+0x30
        ntdll!RtlUserThreadStart+0x42

        THREAD ffffe000049108c0  Cid 0a50.06cc  Teb: 00007ff6b6470000 Win32Thread: 0000000000000000 WAIT: (UserRequest) UserMode Non-Alertable
            ffffe000041c1830  SynchronizationEvent

        nt!KiSwapContext+0x76
        nt!KiSwapThread+0x14c
        nt!KiCommitThreadWait+0x126
        nt!KeWaitForSingleObject+0x1cc
        nt!NtWaitForSingleObject+0xb1
        nt!KiSystemServiceCopyEnd+0x13
        ntdll!ZwWaitForSingleObject+0xa
        ntdll!RtlpWaitOnCriticalSection+0xe1
        ntdll!RtlEnterCriticalSectionContended+0x94
        ntdll!LdrpAcquireLoaderLock+0x2c
        ntdll!LdrShutdownThread+0x64
        ntdll!RtlExitUserThread+0x3e
        KERNEL32!BaseThreadInitThunk+0x38
        ntdll!RtlUserThreadStart+0x42

Since debugging is an exercise in optimism, let's ignore the stacks that didn't come out properly. If we can't make any headway, we can try to fix them, but let's be hopeful that the stacks that are good will provide enough information.

Generally speaking, the deeper the stack, the more interesting it is, because uninteresting threads tend to be hanging out in their message loop or event loop, whereas interesting threads are busy doing something and have a complex stack trace to show for it.

Indeed, one of the deep stacks belongs to thread 0a9c, and it also has a very telling section:

        ntdll!RtlpWaitOnCriticalSection+0xe1
        ntdll!RtlEnterCriticalSection+0x94
        ntdll!LdrpAcquireLoaderLock+0x4c
        ntdll!LdrpFindOrMapDll+0x75d
        ntdll!LdrpLoadDll+0x394
        ntdll!LdrLoadDll+0xc6
        kernelbase!LoadLibraryExW+0x142
        kernelbase!LoadLibraryExA+0x26
        contoso!__delayLoadHelper2+0x2b
        contoso!_tailMerge_Winmm_dll+0x3f

The polarity reverser's completion handler is trying to load winmm via delay-load. That load request is waiting on a critical section, and it should be clear both from the scenario and the function names that the critical section it is trying to claim is the loader lock. In real life, I just proceeded with that conclusion, but but just for demonstration purposes, here's how we can double-check:

0: kd> .thread ffffe00003c46080
0: kd> kn
  *** Stack trace for last set context - .thread/.cxr resets it
 # Call Site
00 nt!KiSwapContext+0x76
01 nt!KiSwapThread+0x14c
02 nt!KiCommitThreadWait+0x126
03 nt!KeWaitForSingleObject+0x1cc
04 nt!NtWaitForSingleObject+0xb1
05 nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ ffffd000`367ece40)
06 ntdll!ZwWaitForSingleObject+0xa
07 ntdll!RtlpWaitOnCriticalSection+0xe1
08 ntdll!RtlEnterCriticalSection+0x94
09 ntdll!LdrpAcquireLoaderLock+0x4c
0a ntdll!LdrpFindOrMapDll+0x75d
0b ntdll!LdrpLoadDll+0x394
0c ntdll!LdrLoadDll+0xc6
0d kernelbase!LoadLibraryExW+0x142
0e kernelbase!LoadLibraryExA+0x26
0f contoso!__delayLoadHelper2+0x2b
10 contoso!_tailMerge_Winmm_dll+0x3f

We need to grab the critical section passed to Rtl­Enter­Critical­Section, but since this is an x64 machine, the parameter was passed in registers, not on the stack, so we need to figure out where the rcx register got stashed.

I'm going to assume that the same critical section is the first (only?) parameter to Rtlp­Wait­On­CriticalSection. I don't know this for a fact, but it seems like a reasonable guess. The guess might be wrong; we'll see.

We disassemble the function look to see where it stashes rcx.

0: kd> u ntdll!RtlpWaitOnCriticalSection
    mov     qword ptr [rsp+18h],rbx
    push    rbp
    push    rsi
    push    rdi
    push    r12
    push    r13
    push    r14
    push    r15
    mov     rax,qword ptr [ntdll!__security_cookie (000007ff`3099d020)]
    xor     rax,rsp
    mov     qword ptr [rsp+80h],rax
    mov   r14,qword ptr gs:[30h]
    xor     r12d,r12d
    lea     rax,[ntdll!LdrpLoaderLock (00007fff`d4f51cb8)]
    mov     r15d,r12d
    cmp     rcx,rax
    mov     ebp,edx
    sete    r15b
    mov     rbx,rcx // ⇐ Bingo

Awesome, we can suck rbx out of the trap frame.

0: kd> .trap ffffd000`367ece40
rax=0000000000000000 rbx=00007fffd4f51cb8 rcx=000007f8136f2c2a
rdx=0000000000000000 rsi=00000000000001e8 rdi=0000000000000000
rip=000007f8136f2c2a rsp=000000000cf7f798 rbp=0000000000000000
 r8=000000000cf7f798  r9=0000000000000000 r10=0000000000000000
r11=0000000000000344 r12=0000000000000000 r13=0000000000000000
r14=000007f696870000 r15=000000007ffe0382
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
ntdll!ZwWaitForSingleObject+0xa:
000007f8`136f2c2a c3              ret

Okay, let's see if that value in rbx pans out.

0: kd> !cs 0x00007fff`d4f51cb8
-----------------------------------------
Critical section   = 0x00007fffd4f51cb8 (ntdll!LdrpLoaderLock+0x0)
DebugInfo          = 0x00007fffd4f55228
LOCKED
LockCount          = 0x8
WaiterWoken        = No
OwningThread       = 0x0000000000000b38
RecursionCount     = 0x1
LockSemaphore      = 0x1A8
SpinCount          = 0x0000000004000000

Hooray, we confirmed that this is indeed the loader lock. I would have been surprised if it had been anything else! (If you had been paying attention, you would have noticed the lea rax,[ntdll!LdrpLoaderLock (00007fff`d4f51cb8)] in the disassembly which already confirms the value.)

We also see that the owning thread is 0xb38. Here's its stack again:

        nt!KiSwapContext+0x76
        nt!KiSwapThread+0x14c
        nt!KiCommitThreadWait+0x126
        nt!KeWaitForSingleObject+0x1cc
        nt!NtWaitForSingleObject+0xb1
        nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ ffffd000`2563bc40)
        ntdll!ZwWaitForSingleObject+0xa
        KERNELBASE!WaitForSingleObjectEx+0xa5
        litware!CDiscovery::Uninitialize+0x8c
        litware!CApiInstance::~CApiInstance+0x48
        litware!CApiInstance::`scalar deleting destructor'+0x14
        litware!std::tr1::_Ref_count_obj<CApiInstance>::_Destroy+0x31
        litware!std::tr1::_Ref_count_base::_Decref+0x1b
        litware!std::tr1::_Ptr_base<CApiInstance>::_Decref+0x20
        litware!std::tr1::shared_ptr<CApiInstance>::{dtor}+0x20
        litware!std::tr1::shared_ptr<CApiInstance>::reset+0x3c
        litware!CSingleton<CApiInstance>::ReleaseRef+0x97
        litware!LitWareUninitialize+0xed
        fabrikam!CDoodadHelper::~CDoodadHelper+0x67
        fabrikam!_CRT_INIT+0xda
        fabrikam!__DllMainCRTStartup+0x1e5
        ntdll!LdrpCallInitRoutine+0x57
        ntdll!LdrpProcessDetachNode+0xfe
        ntdll!LdrpUnloadNode+0x77
        ntdll!LdrpDecrementNodeLoadCount+0xd0
        ntdll!LdrUnloadDll+0x34
        KERNELBASE!FreeLibrary+0x22
        combase!CClassCache::CDllPathEntry::CFinishObject::Finish+0x28
        combase!CClassCache::CFinishComposite::Finish+0x80
        combase!CClassCache::FreeUnused+0xda
        combase!CoFreeUnusedLibrariesEx+0x2c
        combase!CDllHost::MTAWorkerLoop+0x7d
        combase!CDllHost::WorkerThread+0x122
        combase!CRpcThread::WorkerLoop+0x4e
        combase!CRpcThreadCache::RpcWorkerThreadEntry+0x46
        KERNEL32!BaseThreadInitThunk+0x30
        ntdll!RtlUserThreadStart+0x42

Reading from the bottom up, we see that this thread is doing some work on behalf of COM; specifically, it is freeing unused libraries. The fabrikam library presumably responded S_OK to Dll­Can­Unload­Now, so COM says, "Okay, then out you go."

As part of DLL_PROCESS_DETACH processing, the C++ runtime library runs global destructors. The CDoodadHelper destructor calls into the Lit­Ware­Uninitialize function in litware.dll. That function decrements a reference count, and it appears that the reference count went to zero because it's destructing the CApi­Instance object. The destructor for that function calls CDiscovery::Uninitialize, and that function waits on a kernel object.

The debugger was kind enough to tell us what the object is:

        THREAD ffffe000046ad340  Cid 0a50.0b38  Teb: 00007ff6b64b6000 Win32Thread: 0000000000000000 WAIT: (UserRequest) UserMode Non-Alertable
            ffffe000049108c0  Thread

It's a thread.

Going back to the thread dump at the start, we also can see what thread ffffe000049108c0 is doing. Here it is again:

        nt!KiSwapContext+0x76
        nt!KiSwapThread+0x14c
        nt!KiCommitThreadWait+0x126
        nt!KeWaitForSingleObject+0x1cc
        nt!NtWaitForSingleObject+0xb1
        nt!KiSystemServiceCopyEnd+0x13
        ntdll!ZwWaitForSingleObject+0xa
        ntdll!RtlpWaitOnCriticalSection+0xe1
        ntdll!RtlEnterCriticalSectionContended+0x94
        ntdll!LdrpAcquireLoaderLock+0x2c
        ntdll!LdrShutdownThread+0x64
        ntdll!RtlExitUserThread+0x3e
        KERNEL32!BaseThreadInitThunk+0x38
        ntdll!RtlUserThreadStart+0x42

That thread is trying to acquire the loader lock so it can send DLL_THREAD_DETACH notifications. But the loader lock is held by the Free­Library. Result: Deadlock, as the two threads are waiting for each other. (You can also see that thread 0xa8c is stuck in the same place because it too is trying to exit.)

The underlying problem is that the Fabrikam DLL is waiting on a thread (indirectly via LitWare) while inside its own Dll­Main.

The Fabrikam code could avoid this problem by calling Lit­Ware­Uninitialize when its last object is destroyed rather than when the DLL is unloaded. (Of course, it also has to remember to call Lit­Ware­Initialize when its first object is created.)

Comments (32)
  1. Joshua says:

    [... optimism ...] And the crash I debugged yesterday had a stack trace not come through for one thread because RSP was pointing to unallocated RAM. Guess which stack trace I was after.

  2. alegr1 says:

    That's what Internet Explorer loves to do. Or used to do: I gave up on IE a couple years ago. The fault may be with Flash, or other plugin.

    [Everybody wants a Web browser to have a plug-in model, but they don't realize that having a plug-in model means that plug-ins can screw up. -Raymond]
  3. Eric says:

    And this is why I'll stick to web programming.  Although I guess poring over Fiddler logs isn't really any better, maybe I'm just used to it.

  4. 12BitSlab says:

    Raymond, Microsoft needs to increase the support cost for Contoso a whole bunch.  Seems like their code is always causing you problems.

    :)

    P.S. Thanks for the debugging lesson.

  5. Joshua says:

    [Everybody wants a Web browser to have a plug-in model, but they don't realize that having a plug-in model means that plug-ins can screw up. -Raymond]

    And many years later we finally learned to host plug-ins in their own processes as much as possible (flash is a great example of one that can be and should be).

  6. Gabe says:

    So Thread 1 is inside the loader lock waiting for Thread B to exit, but Thread B can't exit because it is waiting on the loader lock that Thread 1 holds?

    I kinda saw that coming when I saw LdrpAcquireLoaderLock and figured that the earlier posts this week were leading up to this.

  7. KyleJ61782 says:

    And this is the reason why Raymond always says that doing anything substantial in DllMain is usually a bad idea.

  8. not important says:

    What about destructors for global objects? When are these executed? If they are executed when the DLL is unloaded then these destructors cannot call LitewareUninitialize. Maybe the moral of the story includes: do not do "big" things in your destructor (like call into third party code. Or wait on a thread. etc...). Because sometimes you do not control when the destructor is executed.

  9. Yuri says:

    Very interesting post, I like these type of debugging sessions.

    Used to read Mark Russinovich excellent step by step debugging, lots of useful things I learned there, sad that he stopped posting.

  10. Tim says:

    One of the problems I have with this blog is it shakes my belief that I'm the awesomest programmer in the land. I sure has heck can't read assembler or a stack dump like this. Thanks for another humility lesson, Raymond.

    <eyes crossed in confusion while bowing and proclaiming *I'm not worthy!*>

  11. Ian Boyd says:

    IE would love non-binary plugins (cf. another browsers). And with IE (and other browser) hosting each page in separate Low Mandatory Integrity Level processes, plugins can do zero damage and require no binaries to secure (only HTML and JavaScript).

  12. Joker_vD says:

    @Joshua: Are you proposing that instead of injecting plugins' code inside our address space, giving it a few pointers to our internal structures, we run them in separate processes, maybe even with different rights, giving them a few pipes for data exchange and RPC and stuff?

    That sounds like a great idea: if a plugin crashes, its process handle is signaled (and the pipe is broken, but it may break on its own). And if the host crashes, simmetrically, the plugin sees the broken pipe, and should exit. Unfortunately, plugins are not written well, so they don't account for host crashes, so they don't exit. Yay, zombie plugins!

  13. Harry Johnston says:

    @Joker_vD: easily solved; put the plugin process into a job object with the JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE flag.

  14. Neil says:

    Bah, I had a loader lock hang this morning but I wasn't running a symbols build so I was too lazy to debug it. (Also when I killed it and ran the debug build it didn't lock. Sigh.)

  15. – "Everybody wants a Web browser to have a plug-in model, but they don't realize that having a plug-in model means that plug-ins can screw up."

    Oh, believe me, everyone does realize that. Only the majority consensus is to accept it as part of the ecosystem and deal with it, because it is beneficiary. Mozilla Foundation and Google have succeeded. Ask yourself: Why Microsoft hasn't come up with an "app-free OS" concept yet!? (Office-free Windows! *Chuckle*)

  16. Joker_vD says:

    @Harry Johnston: Oh, but the host process is already in a job object. What do I do now? Nested jobs were introduced in Windows 8 only.

  17. cheong00 says:

    @Joshua: Maybe, may not.

    In old model when some Flash Ad. screw up, it just crashed the current process. Now when it blows up, it crashes all the Flash plugins in IE, Firefox and Chrome, no matter I run them in protected mode (or equivalent) or not. That means it's not safe to do equity trading on web while surfing websites that may have Flash Ad., and I have no way to evade it except by running the thing inside a VM.

    Talk about usability.

  18. Harry Johnston says:

    @Joker_vD: I think the only common scenario where the browser would be in a job object is when it is launched from the Startup menu, and in that case you can use CREATE_BREAKAWAY_FROM_JOB to escape.  But if you're in a non-breakaway job for some reason, then IMO the job owner is responsible for worrying about leftover processes.

    On the other hand, I don't see that zombie plugins are likely to do much harm anyway.  You can always kill them when the user restarts the browser.  So perhaps there was no need for a job object in the first place.  (I wonder what Chrome does?)

  19. Harry Johnston says:

    @cheong00: that's an implementation detail.  There's no need for different browsers to share the same plugin process.

  20. Drak says:

    @cheong00: Or, you could simply uninstall Flash completely. I haven't had it on my PC for a couple of years now, and I don't feel I'm missing out :)

  21. cheong00 says:

    @Drak: Or I should have changed my bank, except that trading fee is only waived if you put your salary paying bank account in the same bank, and not all banks offer something like this. Btw, seems the equity price update component is offered by 3rd party as well.

    @Harry: Agreed. Just that their current implementation make me feel more inconvenient than the old way. And honestly speaking, I smell the possibility of unexpected information leakage if some hole is found in the shared process. Not feeling very comfortable about that.

  22. Daniel says:

    Honestly, I'm still astonished, that this problem hasn't been addressed long ago.

    So far there still is no way to initialize a DLL without publishing a second method which MUST be called after loading it (and and cleanup method which MUST be called before cleanup).

    This in turn requires that the dll itself contains a second reference count (initialize will increase, cleanup will decrease it) or that the client does some reference counting (good luck if you have other dll's dependent on that dll too).

  23. alegr1 says:

    @Harry:

    >I think the only common scenario where the browser would be in a job object is when it is launched from the Startup menu, and in that case you can use CREATE_BREAKAWAY_FROM_JOB to escape.

    The browser needs to create a job object and attach the plugin processes to it. When the browser exits or crashes, it implicitly or explicitly closes the last handle to the job, and that would kill the remaining plugin processes.

  24. AsmGuru62 says:

    @Daniel:

    Just one method needed:

    BOOL ComponentInitialize (BOOL bConstructing);

    Pass 1 and DLL constructs its stuff.

    Pass 0 and DLL destructs its stuff.

  25. Joker_vD says:

    @Harry Johnston: Zombie plugins are a problem. I've seen plugins that open some important files in exclusive mode, so the second copy of the plugin can not work. And how do I kill zombie plugins on restart? Run through the list of all processes and kill everything that has the image file inside my "Plugins" folder and has parent process "[System]"?

  26. Joker_vD says:

    @Harry: Wait, so if I start two browsers, one crashes, I restart it, then it will see a bunch of plugins that don't have his current PID and terminate them all—including the plugins of the first copy that are working fine. Or worse, it restarts with the same PID, and now things get very interesting...

    Also, a DLL can start another process which won't be killed by the watchdog thread, and why would the plugin do it? Maybe it starts tor.exe and uses it to provide access to the .onion sites, whatever. It would be much easier if I could just reliably ask the OS to "If I die, kill every process I spwaned, yes, even those ones that specifically asked not to kill them in this scenario".

    Well, I guess I can make the watchdog to put a breakpoint on the CreateProcess, and instead spawn a separate watchdog that will launch the requested executable and will monitor the first watchdog... I wonder what does Cygwin do.

  27. @alegr1: yes, exactly - but if the browser was launched from the Startup folder, it already belongs to a job created by Explorer, so you can't put your child processes in *your* job unless you use CREATE_BREAKAWAY_FROM_JOB.  (Or you could use nested jobs in Windows 8 or later.)

    @Joker_vD: all the plugins would normally use the same executable, e.g., Firefox's plugin-container.exe, so you can enumerate the candidates easily enough.  You could perhaps create an event object matching each plugin process, with the process ID as part of the name; any candidate process without a corresponding event object is a zombie.  But in retrospect it would be easier for each of the plugin processes to have a thread dedicated to watching the parent process.  If you launch the watchdog thread *before* you load the plugin DLL, it should be reliable enough no matter what the DLL does.

  28. Anomymous Coward says:

    @Joker_vD: that would quickly lead to a "If my parent dies, don't kill me, yes, even if the parent specified the 'If I die, kill every process I spawned, yes, even those ones that specifically asked not to kill them in this scenario' flag" flag.

    And then a `If I die, kill every process I spawned, yes, even those ones that specified the "If my parent dies, don't kill me, yes, even if the parent specified the 'If I die, kill every process I spawned, yes, even those ones that specifically asked not to kill them in this scenario' flag" flag` flag.

  29. @Joker_vD: no, in the model I described the other browser's plugin processes will not be affected, because the associated event objects will still exist.  Nor will it matter if the browser gets the same process ID it had before; we're associating the event objects with the process IDs of the plugin processes, not that of the parent process.  (The watchdog model is still preferable, IMO.)

    Regarding the possibility that the plugin creates a subprocess itself: personally, I would be inclined to explicitly prohibit doing so.  If I *had* to allow it, though, the plugin would be responsible for shutting down the children and itself cleanly when signaled by the watchdog thread; if it failed to do so in a timely manner, the watchdog thread would kill the plugin process and the user would have to deal with the child manually.  (That shouldn't happen frequently, and if it does, it's the plugin author's fault: my advice to the user would be to uninstall the plugin.)

    It wouldn't really be safe to simply kill the entire process tree as soon as the browser dies; what if one of the children is in the middle of writing to a file?  But we can do so if we want to, using a job object.  (We can't do that if we're in Windows 7 or earlier, and the process already belongs to a job object and isn't allowed to use CREATE_BREAKAWAY_FROM_JOB.  But that shouldn't ever happen, so we don't really need to worry about it too much.)

  30. Joker_vD says:

    @Anonymous Coward: No, it wouldn't. It would be simply "I don't care about my children processes", "I want my children processes to terminate when I terminate", and "I want my children processes to terminate when I terminate except those children processes that asked to not suffer this fate"; plus a special "I'd like to escape from the impeding doom if you please" call for children processes, and that's it.

    @Harry Johnston: Well, I don't know if it's possible to prohibit a process to create another processes on Windows. And "plugin would be responsible..." is BS. We moved plugins from DLLs into separate processes exactly because they're irresponsible, aren't we? Yes, if the plugin process writes to a file, and the main process crashes, it will cause incomplete write — but it's no different from when the DLL writes to a file, and the host process crashes. We don't lose any guarantees we had.

    "We can't do that if we're in Windows 7 or earlier" is sad, because our clients are not going to move from Windows 7 and Windows Server 2008 for at least another couple of years. After all, we already wrote a BAT-file around taskkill.exe which can be used for cleaning after crashes in our use scenarios, so while it's not anywhere near perfect, it works well enough to justify not "upgrading" to Windows 8.

  31. @Joker_vD: we put the plugin into a separate process so it can't crash the browser as easily, and to reduce the risk of the plugin and the browser interfering with each other.  Recognizing that the quality of plugins isn't always as good as that of the browser doesn't mean that we can't expect some minimal level of correct behaviour.  Terminating it the instance the browser crashes is no worse than the all-in-one-process model, but it's no better either; why not give the plugin a few seconds to shut down cleanly if it can?  If it can't, *then* we can terminate it.

    If a plugin launches a subprocess when we've prohibited doing so, or if it is allowed to create subprocesses but fails to properly clean them up when it is told to exit, we can enforce the rules by adding the plugin in question to the browser's blacklist.

    All that said, we *can* still do the kill-all-the-children thing if we want to, even in Windows 7, or Windows XP for that matter, as I've already pointed out more than once.  We just use a job object, combined with CREATE_BREAKAWAY_FROM_JOB to escape Explorer's job object if it is present.  It's not trivial, but it isn't rocket science.

  32. Marc K says:

    Even under the old model of hosting a plugin within the main process, if the plugin creates a child process and the host crashes, we're still left with the child orphaned.  So, switching to an out of process plugin model changes nothing in regards to that issue.  (And the issue shouldn't be used as an argument against an out of process plugin model.)

Comments are closed.

Skip to main content