Debug session: Why is an LPC server not responding?


A particular scenario was hanging, and the team responsible for the scenario debugged it to the point where they saw that their X component was waiting for their Y component, which was waiting for Explorer, so they asked for help chasing the hang into Explorer.

The team was kind enough to have shared what they've learned so far:

kd> !alpc /m 9c14d020

Message 9c14d020
  MessageID             : 0x0274 (628)
  CallbackID            : 0xCCA5 (52389)
  SequenceNumber        : 0x00000016 (22)
  Type                  : LPC_REQUEST
  DataLength            : 0x0094 (148)
  TotalLength           : 0x00AC (172)
  Canceled              : No
  Release               : No
  ReplyWaitReply        : No
  Continuation          : Yes
  OwnerPort             : 82bb9db8 [ALPC_CLIENT_COMMUNICATION_PORT]
  WaitingThread         : 834553c0
  QueueType             : ALPC_MSGQUEUE_MAIN
  QueuePort             : 84646730 [ALPC_CONNECTION_PORT]
  QueuePortOwnerProcess : 846209c0 (explorer.exe)
  ServerThread          : 00000000 <-----------------------
  QuotaCharged          : No
  CancelQueuePort       : 00000000
  CancelSequencePort    : 00000000
  CancelSequenceNumber  : 0x00000000 (0)
  ClientContext         : 02a56b80
  ServerContext         : 00000000
  PortContext           : 0701ea20
  CancelPortContext     : 00000000
  SecurityData          : 962f89b8
  View                  : 00000000
kd> !process 846209c0 0
PROCESS 846209c0  SessionId: 1  Cid: 0804    Peb: 7fbac000  ParentCid: 0724
    DirBase: 3e546380  ObjectTable: 97195300  HandleCount: 1041.
    Image: explorer.exe

Yikes, there is no thread signed up to service the request.

I don't know much about ALPC, but I can fumble around. Fortunately, this is debugging and not rocket surgery, so you still get full points if you stumble across the answer by guessing.

I decided to start guessing by looking at what the !alpc command can tell me.

kd> !alpc -?

  !alpc /m MessageAddress
    Dumps the message at the specified address.

  !alpc /p PortAddress
    Dumps the port at the specified address.

Well, I already saw what the result was for dumping the message, so I may as well dump the port.

kd> !alpc /p 84646730

...
  8 thread(s) are registered with port IO completion object:
    THREAD 84658d40  Cid 0804.0888  Teb: 7fa7e000 Win32Thread: 8214a748 WAIT
    THREAD 8466a040  Cid 0804.08c4  Teb: 7fa74000 Win32Thread: 8214c800 WAIT
    THREAD 84659a00  Cid 0804.08ec  Teb: 7fa72000 Win32Thread: 82158d08 WAIT
    THREAD 8466c8c0  Cid 0804.08f0  Teb: 7fa6e000 Win32Thread: 82160420 WAIT
    THREAD 84671040  Cid 0804.0910  Teb: 7fa68000 Win32Thread: 8217c4e8 WAIT
    THREAD 8460d180  Cid 0804.099c  Teb: 7fa5e000 Win32Thread: 820bad08 WAIT
    THREAD 834278c0  Cid 0804.0c80  Teb: 7fa6b000 Win32Thread: 820b9620 WAIT
    THREAD 8345ad40  Cid 0804.0da0  Teb: 7fba9000 Win32Thread: 821c6d08 WAIT
...

So it looks like there are eight threads signed up to process events on this port. (Is that what this means? I don't know, but I'm going to assume that it does, because this is debugging. Debugging is an exercise in optimism.) Let's see what they're doing.

kd> .thread 84658d40;k
Implicit thread is now 84658d40
  *** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr
940ef394 80f1505f nt!KiSwapContext+0x19
940ef3d0 80f184e0 nt!KiSwapThread+0x34b
940ef3fc 80f163fc nt!KiCommitThreadWait+0x26f
940ef46c 80f4d2df nt!KeWaitForSingleObject+0x459
940ef4b4 80e20838 nt!KiSchedulerApc+0x298
940ef4c8 00000000 hal!KfLowerIrql+0x2c

[the others look the same]

Well, I don't know what they're doing, but it looks like they're waiting for something. But one of the threads looks different:

kd> .thread 84671040;k
Implicit thread is now 84671040
  *** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr
9415f864 80f1505f nt!KiSwapContext+0x19
9415f8a0 80f184e0 nt!KiSwapThread+0x34b
9415f8cc 80eb3d6e nt!KiCommitThreadWait+0x26f
9415f934 810c0527 nt!KeWaitForMultipleObjects+0x4e3
9415fbe4 810c0703 nt!ObWaitForMultipleObjects+0x2fd
9415fd38 80ef113c nt!NtWaitForMultipleObjects+0xca
9415fd38 77945e04 nt!KiFastCallEntry+0x12c
07e2f1c4 779437f6 ntdll!KiFastSystemCallRet
07e2f1c8 7515c136 ntdll!NtWaitForMultipleObjects+0xa
07e2f34c 77752658 KERNELBASE!WaitForMultipleObjectsEx+0xee
07e2f368 777fbe60 KERNEL32!WaitForMultipleObjects+0x19
07e2f3d4 777fc5de KERNEL32!WerpReportFaultInternal+0x1a3
07e2f3e8 777df654 KERNEL32!WerpReportFault+0x6d
07e2f3f4 751e517c KERNEL32!BasepReportFault+0x19
07e2f490 77a0f95a KERNELBASE!UnhandledExceptionFilter+0x1e0
07e2f4a0 77a0fd4d ntdll!TppExceptionFilter+0x1b
07e2f4b4 77a1c66b ntdll!TppWorkerpInnerExceptionFilter+0x13
07e2fb34 77753278 ntdll!TppWorkerThread+0xa6092
07e2fb40 779761a6 KERNEL32!BaseThreadInitThunk+0xe
07e2fb80 77976152 ntdll!__RtlUserThreadStart+0x4a
07e2fb90 00000000 ntdll!_RtlUserThreadStart+0x1c

Ah, well that explains why Explorer isn't responding: It crashed on an unhandled exception! Windows Error Reporting is busy trying to generate a report.

Now to see what the crash was. I don't know for sure, but I'm pretty confident that one of the parameters to Basep­Report­Fault is an EXCEPTION_POINTERS. Why am I confident of that? Because it would be hard to report the fault without it!

kd> dd 07e2f3f4 l4
07e2f3f4  07e2f490 77a0f95a 07e2f4e8 00000001
          ChildEBP RetAddr  Param1

kd> dd 07e2f4e8 l2
07e2f4e8  07e2f620 07e2f63c 
                 ^ ^
   ExceptionRecord ContextRecord
kd> .cxr 0x07e2f63c
eax=00000000 ebx=0451e2f8 ecx=e2af034f edx=77945e00 esi=00000000 edi=0451e2e0
eip=1df7fc6a esp=07e2f920 ebp=07e2f938 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010246
contoso!ContosoPower::Disconnect+0xdd:
001b:1df7fc6a 8b08  mov ecx,dword ptr [eax] ds:0023:00000000=????????

Aha, Explorer crashed due to a null pointer crash in the Contoso­Power::Disconnect function.

Passing the buck onward to Contoso, the report back was that this was a known issue, and a hotfix was available.

Comments (16)
  1. John says:

    More often than not it seems the fault lies with Contoso; those guys suck!

  2. Henke37 says:

    So, was the original team related to Contoso?

    [Nope. -Raymond]
  3. Joshua says:

    Nice. Ah yes the downside in allowing garbage into your memory space is you get the support requests for somebody else's garbage. Unfortunately the classical model of processes is the only model understood by managers.

  4. Douglas says:

    "Fortunately, this is debugging and not rocket surgery"

    Does one require less knowledge and skill than the other?

  5. asdbsd says:

    So if they just waited a bit they'd see the exception report with clear traces of Contoso?

  6. asdbsd says:

    @John: Just look at their wiki page. Contoso Bank, Contoso Pharmaceuticals, Contoso University, Contoso News Agency, Software Developement. It's like they're some evil corporation taking over the world, but this can't be because they're just a fictional entity thought up by Microsoft, right? Ha-ha. Ha-ha, ha… Gotta lock my door.

  7. kog999 says:

    "Fortunately, this is debugging and not rocket surgery"

    "Does one require less knowledge and skill than the other?"

    not really about skill but if you mess up the rocket surgery the rocket could exploade costing millions of dollars and lives. If you messe up debugging you have to emm, try again.

  8. The big problem with rocket surgery is the malpractice insurance.

  9. Gabe says:

    It seems to me like this hang should be temporary. Shouldn't WEP generate the report and then "unhang", somehow releasing the request made by the Y component, while simultaneously producing a report of the cause of the crash?

    Maybe the next question is "Why did WEP hang?"

    [The only way to "unhang" the process is to terminate it, at which point it's too late to generate the crash report. WER can't generate the report until the user gives permission, because generating the report includes transmitting information to Microsoft to ask, "Hey, I hit this failure, do you want me to capture any data beyond the default crash dump?" -Raymond]
  10. Matt says:

    @kog: "if you mess up the rocket surgery the rocket could exploade costing millions of dollars and lives. If you messe up debugging you have to emm, try again."

    Unless you're debugging a rocket.

  11. Joshua says:

    [The only way to "unhang" the process is to terminate it, at which point it's too late to generate the crash report. WER can't generate the report until the user gives permission, because generating the report includes transmitting information to Microsoft to ask, "Hey, I hit this failure, do you want me to capture any data beyond the default crash dump?" -Raymond]

    Close all open handles. Anybody waiting for it gets WAIT_ABANDONED_*. Since they're external to the calling process, WAIT_ABANDONED_* should already be handled. I can't believe I had to write that.

    [Um, WAIT_ABANDONED is not a valid status code when waiting for a process handle. And you didn't clean up other externally-visible state, like window handles, or remove the process from the Process List (because some apps do FindProcessByName). And then you have the impossible situation of a process that still exists but is not present in the process list. -Raymond]
  12. ErikF says:

    @Joshua: Yanking a program's infrastructure out from under it is simply delaying when the program is going to crash and burn. See blogs.msdn.com/…/463977.aspx for possible things that can happen when a program is presented with WAIT_ABANDONED; the answer to "What can I do with invalid handles?" is "Not much."

    Terminating a hung process is probably better all around because you don't leave programs in unstable states where they think that everything's OK but in reality nothing is!

  13. Joshua says:

    Wow. Raymond got it ErikF didn't. My general idea is whatever handles that are being used for the LPC would be closed anyway when WER terminates the process so it could potentially close them now. I overlooked LPC by Windows messaging as opposed to kernel32 handles.

  14. alegr1 says:

    Microsoft is just an umbrella corporation for Contoso.

  15. Gabe says:

    I had assumed that WER was going to silently create a small crash dump and then at some later point send it to MS, whereupon MS might say "next time you hit this crash, capture additional information X".

    Regardless, it sounds like you're saying WER put up UI to ask if the user wants to give permission. Did the user who reported the issue not notice the UI, or did they ignore it because they didn't realize it had to do with their issue?

    [(It could have been done the way you suggested, but that means everybody has to hit the crash twice before the information can be collected. And if the problem is sporadic, it may never get collected since it would require lightning to strike twice.) This happened in a lab, so there is nobody to look at the screen. The testing system reported "test X is unresponsive" and the developers connected to the machine and found this. -Raymond]

Comments are closed.

Skip to main content