Troubleshooting Event ID 5010 — IIS Application Pool Availability


With this event you will get a message stating, “A process serving application pool '%1' failed to respond to a ping. The process id was '%2'.”  When an Event ID 5010  appears in your system log and WAS is the source, you can be assured of two things.  The first is that a worker process failed to respond to a ping from the Windows Process Activation Service (WAS).  The second is that WAS attempted to shut down the offending process.  Generally, this means that either the worker process hung or it did not have enough resources to respond to the ping request.  The resolution in the documentation for this Event is listed as:

Diagnose an unresponsive worker process

A worker process that fails to respond may exhibit one of the following symptoms:

  • Ping failure.
  • Communication failure with WAS.
  • Worker process startup time-out.
  • Worker process shutdown time-out.
  • Worker process orphan action failure.

 

There are some links provided as well that provide guidance on troubleshooting memory leaks, long response times, and high CPU usage in the worker process.   In each of those 3 cases, there is a deterministic approach with which you can take a memory dump and investigate the issue.  However, in practice there are times when the worker process may simply hang without displaying any of those behaviors.  So the question becomes, how do we know when the worker process is hung?  The answer is, WAS knows, so we just have to ask WAS.

I’m going to go through the steps of finding the breakpoints and getting the needed information because the methodology can be used elsewhere.  For example, though the steps at the bottom of the page work for IIS 7, 7.5, and 8, I haven’t tested them with IIS 6 or below.  If you’re currently experiencing this issue and just want to know how to get the memory dump of the Worker Process, feel free to skip down to the bottom.

When working with an issue like this, our first step is always to clarify what we know and what we need to know.

What we know:

  • WAS knows when a worker process hangs
  • WAS knows enough about the worker process to shut it down if it appears hung

 

What we need to know:

  • How to tell when WAS thinks a worker process is hung
  • How to identify the hung process
  • How to create a dump of the worker process before it’s terminated by WAS

 

To get to the answers we need, we’ll investigate the WAS process with WinDbg.  Our goal will be to find a method call that signifies a failed ping, set a breakpoint on it, then identify the associated worker process at the time of failure.  We’ll start by attaching the debugger to the WAS process and reloading symbols.  Symbols are important here because we’ll need them to investigate the method calls.  Once we have the debugger attached, we can take a look at the loaded modules in an effort to make some guesses at where we should look.  We’ll do this by running lm, for list modules.  We can append 1m and sm to only list the module names and sort them alphabetically.

0:000> lm 1m sm
ADVAPI32
bcryptPrimitives
clbcatq
combase
CRYPT32
CRYPTBASE
CRYPTSP
DPAPI
GDI32
HTTPAPI
IISRES
iisutil
iisw3adm
KERNEL32
KERNELBASE
ktmw32
logoncli
mlang
MSASN1
msvcrt
nativerd
NSI
ntdll
ntmarta
ole32
OLEAUT32
pcwum
profapi
RPCRT4
rsaenh
sechost
Secur32
SSPICLI
svchost
USER32
USERENV
w3ctrlps
W3TP
WS2_32
XmlLite

Now that we have the modules, we can make some guesses as to which ones might be related to the failed ping request processing.  It seems likely that the module responsible will either be one of the IIS processes or a process with w3 in the name.  Next in the debugger we can search for method calls in those modules that could likely be related.  We can use the x command to search for the methods using wild cards.

0:000> x *iis*!*ping*
000007fb`45efb2d4 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem (<no parameter info>)
000007fb`45f2c330 iisw3adm!g_aIISWPSiteMappings = <no type information>
000007fb`45f247dc iisw3adm!AddDomainMapping (<no parameter info>)
000007fb`45f13fb0 iisw3adm!CONFIG_CS_PATH_MAPPER::SetConfigFileMappings (<no parameter info>)
000007fb`45f2c4d0 iisw3adm!g_aIISWPGlobalMappings = <no type information>
000007fb`45efdde4 iisw3adm!MESSAGING_HANDLER::HandlePingReply (<no parameter info>)
000007fb`45efc59c iisw3adm!PingResponseTimerExpiredCallback (<no parameter info>)
000007fb`45f24918 iisw3adm!EnsureLSAMapping (<no parameter info>)
000007fb`45f2c490 iisw3adm!g_aIISULGlobalMappings = <no type information>
000007fb`45f2c2d0 iisw3adm!aIISULSiteMappings = <no type information>
000007fb`45f059f4 iisw3adm!PERF_MANAGER::ProcessPerfCounterPingFired (<no parameter info>)
000007fb`45f2c168 iisw3adm!g_cIISWPSiteMappings = <no type information>
000007fb`45efb64c iisw3adm!WORKER_PROCESS::CancelPingResponseTimer (<no parameter info>)
000007fb`45f2d3cc iisw3adm!g_fWASStoppingInProgress = <no type information>
000007fb`45f06be0 iisw3adm!PerfCountPing (<no parameter info>)
000007fb`45f31dc0 iisw3adm!_imp_LsaManageSidNameMapping = <no type information>
000007fb`45ed98a0 iisw3adm!CheckWASIsStopping (<no parameter info>)
000007fb`45f31eb0 iisw3adm!_imp_CreateFileMappingW = <no type information>
000007fb`45f31eb8 iisw3adm!_imp_OpenFileMappingW = <no type information>
000007fb`45f2c164 iisw3adm!cIISULSiteMappings = <no type information>
000007fb`45f2c484 iisw3adm!g_cIISULGlobalMappings = <no type information>
000007fb`45efb164 iisw3adm!WORKER_PROCESS::SendPingWorkItem (<no parameter info>)
000007fb`45efc51c iisw3adm!SendPingTimerCallback (<no parameter info>)
000007fb`45f31df0 iisw3adm!_imp_CreateFileMappingA = <no type information>
000007fb`45f31df8 iisw3adm!_imp_OpenFileMappingA = <no type information>
000007fb`45f2c488 iisw3adm!g_cIISWPGlobalMappings = <no type information>

The first method call returned looks very promising and is a good place to start.  We can set a breakpoint on that line item, tell the process to continue, then intentionally cause a failed ping request to see if the breakpoint trips.

 

0:000> bp iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
0:000> g

To cause a failed ping request, we can suspend a worker process with a tool such as Process Explorer.

image

 

After the time specified in the Application Pool’s configuration, the WAS process should identify the worker process as hung at attempt to shut it down.  If our breakpoint is correct, the debugger, still attached to the WAS process, should break in.  As it turns out, that is exactly what happens in this case.

Breakpoint 0 hit
iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem:
000007fb`45efb2d4 48895c2410      mov     qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705541a9cd000000

The breakpoint has been hit, and if we check Process Explorer, we can see the process is still alive.  Further, we know because of the calling convention that we’re dealing with a WORKER_PROCESS object.  At this point, we could absolutely write a script to dump all of the worker processes on the machine and we would know we got a dump of the hung process.  If however, we’re working on a machine that has a lot of worker processes, it may be more beneficial to find the actual worker process being referred to and just dump that object.  The WORKER_PROCESS object will almost undoubtedly contain a PID for the worker process with which we could use with a script to dump just that worker process.  We can investigate the current thread and try to find the worker process object.  We can then look to see if there is an offset from the object that points to the PID.

We’ll start by dumping the call stack.

0:002> k
Child-SP          RetAddr           Call Site
000000cd`a86cfb98 000007fb`45ef7385 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
000000cd`a86cfba0 000007fb`45ede8e3 iisw3adm!WORKER_PROCESS::ExecuteWorkItem+0x41
000000cd`a86cfbf0 000007fb`45ede442 iisw3adm!WORK_ITEM::Execute+0x13
000000cd`a86cfc30 000007fb`45edb129 iisw3adm!WORK_QUEUE::ProcessWorkItem+0xa2
000000cd`a86cfc70 000007fb`45edafb7 iisw3adm!WEB_ADMIN_SERVICE::MainWorkerThread+0x35
000000cd`a86cfcb0 000007fb`45ed9aec iisw3adm!WEB_ADMIN_SERVICE::ExecuteService+0xdf
000000cd`a86cfcf0 000007fb`4e8f167e iisw3adm!WASServiceThreadProc+0x94
000000cd`a86cfd60 000007fb`50533501 KERNEL32!BaseThreadInitThunk+0x1a
000000cd`a86cfd90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

 

This gives us an idea of the calling order, but the public symbols we’re using won’t give us local variables and in this instance they won’t tell us if and what parameters are passed.  x86 Calling conventions dictate that in a 64 bit process, the first 4 parameters are stored in the CPU registers RCX, RDX, R8, then R9 in that order.  Any parameters in addition to that will be stored on the stack.  If this were a 32 bit process, the parameters would most likely all be on the call stack.  Knowing this, we can start by searching the call stack and the 4 registers mentioned for the PID we identified.  In fact, any of the non null registers displayed with the r command are probably a good place to look, excluding RIP which will point to code.

0:002> r
rax=0000000000000000 rbx=0000000000000000 rcx=000000cda910e520
rdx=000000cda94159d0 rsi=0000000000000001 rdi=000000cda910e520
rip=000007fb45efb2d4 rsp=000000cda86cfb98 rbp=000000cda8709ec0
  r8=000000cda86cfb28  r9=000000cda86cfc10 r10=0000000000000000
r11=0000000000000246 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem:
000007fb`45efb2d4 48895c2410      mov     qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705d41a9cd000000

We can see that RAX and RBX are null, but RCX and RDX are not.  We’ll start with the call stack then move on to the other registers. For the call stack, we’ll search between 100 bytes under the top of the stack pointer (RSP) and up to the stack base (RBP).  Remember, stacks call stacks start at a higher address and grow down as they move up the stack.  We’re searching for a word value because that’s plenty of space for the PID to be stored in and searching for our known PID, 8012.

0:002> s -[1]w @rsp-100 @rbp 0n8012

This returned no results.  So, we don’t have the PID on the call stack.  Next we’ll search from RCX to 300 Bytes greater than RCX.

0:002> s -[1]w @rcx L300 0n8012
0x000000cd`a910e568
0x000000cd`a910e56c

In this case we have two results.  We can see how far from RCX the first result is.

0:002> ?0x000000cd`a910e568-@rcx
Evaluate expression: 72 = 00000000`00000048

We get an offset of 72 bytes, which is very close.  We can do a couple things to verify that we can use this offset in the future.  First, we can check to see if we have any symbol information near RCX.

0:002> dps @rcx
000000cd`a910e520  000007fb`45ecc2b8 iisw3adm!WORKER_PROCESS::`vftable'
000000cd`a910e528  000000cd`43525057
000000cd`a910e530  000000cd`a90f4998
000000cd`a910e538  000000cd`a90f4998
000000cd`a910e540  00000000`00000005
000000cd`a910e548  000000cd`a9390db0
000000cd`a910e550  00000000`00000006
000000cd`a910e558  000000cd`a90f4930
000000cd`a910e560  000000cd`a90f4b40
000000cd`a910e568  00001f4c`00001f4c
000000cd`a910e570  00000000`0000031c
000000cd`a910e578  00000000`00000000
000000cd`a910e580  000000cd`a7f7ed20
000000cd`a910e588  00000000`00000001
000000cd`a910e590  000000cd`00000000
000000cd`a910e598  00000000`00000000

 

It seems RCX is a pointer to a WORKER_PROCESS virtual function table.  This means it is essentially the address for the WORKER_PROCESS object.  It also means our offset of RCX+72 is probably correct and could be used consistently.  We can be sure with one last check.  We need to let the debugger continue, create and pause another worker process, let the debugger break in at our earlier breakpoint, then validate that RCX+72 points to the new PID.

With the worker process paused, we note our new PID is 4356.  In Hex that comes to 1104, so that’s what we’re looking for.

0:002> g
Breakpoint 0 hit
iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem:
000007fb`45efb2d4 48895c2410      mov     qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705d41a9cd000000
0:002> dw @rcx+0n72 l1
000000cd`a910e568  1104

 

Clearly, our offset works as it points to the correct new PID.  This means we can use our breakpoint and this offset to create a script to dump the hung worker process.  Below, I’ve listed the steps to script this with DebugDiag, but it could easily be adapted to work with ADPlus or even CDB and WinDbg directly.

(On a side note, review of the disassembly for the PingResponseTimerExpiredWorkItem showed that it did not have any parameters.  RCX was clearly not a parameter, but it’s a common enough register that it is still good to check.  It definitely paid off in this instance.)

 

Steps to get a memory dump using DebugDiag from the faulting worker process when a 5010 is raised.

The assumption is made that the production machine with the issue does not have internet access as this is a common practice.  If it’s not the case and internet access is available, you can complete Step 1 on the production machine.

 

Step 1: Get Symbols for WAS Service

  1. Open DebugDiag on a production server
  2. Navigate to the Processes tab
  3. Sort by the Service Column Descending and identify the line containing the process for the WAS service
  4. Right click the WAS service and choose create full memory dump
  5. Zip the dump and copy it to a machine with external internet access
  6. On the new machine use the LoadSymbols script found here: http://blogs.msdn.com/b/puneetgupta/archive/2010/05/14/debugdiag-script-to-load-all-symbols-in-a-dump-file.aspx to collect symbols for the WAS service
  7. On the analysis tab of DebugDiag, highlight the line for LoadSymbols
  8. Click Add Data File
  9. Navigate to the dump from the prod server and click ok
  10. Now click Start Analysis
  11. By default, DebugDiag will save the symbol files to c:\symcache.  Copy the contents of that folder to the same path on the production machine.  (If you’ve changed the symbol settings in DebugDiag previously, you must make sure you’re pointing to the Microsoft symbol server and make note of your local path)

 

Step 2: Set up DebugDiag on Production machine

  1. On the Rules tab, click Add Rule and select a Crash Rule
  2. Select “A specific NT Service” and click Next
  3. Highlight the WAS service and click Next
  4. Click Breakpoints then Add Breakpoint
  5. In Breakpoint Expression enter: iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
  6. Under Action Type select Custom
  7. Add the following script in the window that pops up:
Dim w3wpProcessId
rcx = CausingThread.Register("rcx")
w3wpProcessId = Debugger.ReadDWORD(rcx + 72)
WriteToLog "w3wp process id = " & w3wpProcessId
Debugger.CreateDumpForProcessID w3wpProcessID,"Failed Ping",False

 

  1. Set the Action Limit to 5 then click ok
  2. Click Save & Close
  3. Click Next all the way through.  You can specify the location of the dump file on one of the screens before you click finish.
  4. Check the log for svchost under c:\program files\DebugDiag and make sure it shows that the breakpoint was set.  If it was not set, try restarting the Debug Diagnostic Service.  The log should have something like the following:

 

[9/30/2012 2:46:53 PM] Initializing control script
[9/30/2012 2:46:53 PM] Clearing any existing breakpoints
[9/30/2012 2:46:53 PM]
[9/30/2012 2:46:53 PM] Attempting to set unmanaged breakpoint at iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
[9/30/2012 2:46:53 PM] bpID = 0
[9/30/2012 2:46:53 AM] Current Breakpoint List(BL)
  0 e 000007fb`45efb2d4     0001 (0001)  0:**** iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem

 

You can test the above set up by suspending a worker process with Process Explorer and making sure a memory dump is created.  That’s all there is to it.  Of course, when you have the memory dump, that’s just the beginning as you will have to debug that dump to identify the cause of the hang.

 

 

-Jarrod

Comments (2)

  1. Bill Sytin says:

    Perfect debugging technique!

    As far as i know, there is IIS feature "Process Orphaning" (Application Pools -> Your Pool -> Advanced Settings).

    There you could place command f.e. VsJITDebugger.exe -p %1% to invoke Visual Studio Debugger in case of Event ID 5010.

  2. Neville says:

    Excellent post. Very helpful. Thanks.

Skip to main content