With this event you will get a message stating, “A process serving application pool '%1' failed to respond to a ping. The process id was '%2'.” When an Event ID 5010 appears in your system log and WAS is the source, you can be assured of two things. The first is that a worker process failed to respond to a ping from the Windows Process Activation Service (WAS). The second is that WAS attempted to shut down the offending process. Generally, this means that either the worker process hung or it did not have enough resources to respond to the ping request. The resolution in the documentation for this Event is listed as:
Diagnose an unresponsive worker process
A worker process that fails to respond may exhibit one of the following symptoms:
- Ping failure.
- Communication failure with WAS.
- Worker process startup time-out.
- Worker process shutdown time-out.
- Worker process orphan action failure.
There are some links provided as well that provide guidance on troubleshooting memory leaks, long response times, and high CPU usage in the worker process. In each of those 3 cases, there is a deterministic approach with which you can take a memory dump and investigate the issue. However, in practice there are times when the worker process may simply hang without displaying any of those behaviors. So the question becomes, how do we know when the worker process is hung? The answer is, WAS knows, so we just have to ask WAS.
I’m going to go through the steps of finding the breakpoints and getting the needed information because the methodology can be used elsewhere. For example, though the steps at the bottom of the page work for IIS 7, 7.5, and 8, I haven’t tested them with IIS 6 or below. If you’re currently experiencing this issue and just want to know how to get the memory dump of the Worker Process, feel free to skip down to the bottom.
When working with an issue like this, our first step is always to clarify what we know and what we need to know.
What we know:
- WAS knows when a worker process hangs
- WAS knows enough about the worker process to shut it down if it appears hung
What we need to know:
- How to tell when WAS thinks a worker process is hung
- How to identify the hung process
- How to create a dump of the worker process before it’s terminated by WAS
To get to the answers we need, we’ll investigate the WAS process with WinDbg. Our goal will be to find a method call that signifies a failed ping, set a breakpoint on it, then identify the associated worker process at the time of failure. We’ll start by attaching the debugger to the WAS process and reloading symbols. Symbols are important here because we’ll need them to investigate the method calls. Once we have the debugger attached, we can take a look at the loaded modules in an effort to make some guesses at where we should look. We’ll do this by running lm, for list modules. We can append 1m and sm to only list the module names and sort them alphabetically.
0:000> lm 1m sm
Now that we have the modules, we can make some guesses as to which ones might be related to the failed ping request processing. It seems likely that the module responsible will either be one of the IIS processes or a process with w3 in the name. Next in the debugger we can search for method calls in those modules that could likely be related. We can use the x command to search for the methods using wild cards.
0:000> x *iis*!*ping*
000007fb`45efb2d4 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem (<no parameter info>)
000007fb`45f2c330 iisw3adm!g_aIISWPSiteMappings = <no type information>
000007fb`45f247dc iisw3adm!AddDomainMapping (<no parameter info>)
000007fb`45f13fb0 iisw3adm!CONFIG_CS_PATH_MAPPER::SetConfigFileMappings (<no parameter info>)
000007fb`45f2c4d0 iisw3adm!g_aIISWPGlobalMappings = <no type information>
000007fb`45efdde4 iisw3adm!MESSAGING_HANDLER::HandlePingReply (<no parameter info>)
000007fb`45efc59c iisw3adm!PingResponseTimerExpiredCallback (<no parameter info>)
000007fb`45f24918 iisw3adm!EnsureLSAMapping (<no parameter info>)
000007fb`45f2c490 iisw3adm!g_aIISULGlobalMappings = <no type information>
000007fb`45f2c2d0 iisw3adm!aIISULSiteMappings = <no type information>
000007fb`45f059f4 iisw3adm!PERF_MANAGER::ProcessPerfCounterPingFired (<no parameter info>)
000007fb`45f2c168 iisw3adm!g_cIISWPSiteMappings = <no type information>
000007fb`45efb64c iisw3adm!WORKER_PROCESS::CancelPingResponseTimer (<no parameter info>)
000007fb`45f2d3cc iisw3adm!g_fWASStoppingInProgress = <no type information>
000007fb`45f06be0 iisw3adm!PerfCountPing (<no parameter info>)
000007fb`45f31dc0 iisw3adm!_imp_LsaManageSidNameMapping = <no type information>
000007fb`45ed98a0 iisw3adm!CheckWASIsStopping (<no parameter info>)
000007fb`45f31eb0 iisw3adm!_imp_CreateFileMappingW = <no type information>
000007fb`45f31eb8 iisw3adm!_imp_OpenFileMappingW = <no type information>
000007fb`45f2c164 iisw3adm!cIISULSiteMappings = <no type information>
000007fb`45f2c484 iisw3adm!g_cIISULGlobalMappings = <no type information>
000007fb`45efb164 iisw3adm!WORKER_PROCESS::SendPingWorkItem (<no parameter info>)
000007fb`45efc51c iisw3adm!SendPingTimerCallback (<no parameter info>)
000007fb`45f31df0 iisw3adm!_imp_CreateFileMappingA = <no type information>
000007fb`45f31df8 iisw3adm!_imp_OpenFileMappingA = <no type information>
000007fb`45f2c488 iisw3adm!g_cIISWPGlobalMappings = <no type information>
The first method call returned looks very promising and is a good place to start. We can set a breakpoint on that line item, tell the process to continue, then intentionally cause a failed ping request to see if the breakpoint trips.
0:000> bp iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
To cause a failed ping request, we can suspend a worker process with a tool such as Process Explorer.
After the time specified in the Application Pool’s configuration, the WAS process should identify the worker process as hung at attempt to shut it down. If our breakpoint is correct, the debugger, still attached to the WAS process, should break in. As it turns out, that is exactly what happens in this case.
Breakpoint 0 hit
000007fb`45efb2d4 48895c2410 mov qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705541a9cd000000
The breakpoint has been hit, and if we check Process Explorer, we can see the process is still alive. Further, we know because of the calling convention that we’re dealing with a WORKER_PROCESS object. At this point, we could absolutely write a script to dump all of the worker processes on the machine and we would know we got a dump of the hung process. If however, we’re working on a machine that has a lot of worker processes, it may be more beneficial to find the actual worker process being referred to and just dump that object. The WORKER_PROCESS object will almost undoubtedly contain a PID for the worker process with which we could use with a script to dump just that worker process. We can investigate the current thread and try to find the worker process object. We can then look to see if there is an offset from the object that points to the PID.
We’ll start by dumping the call stack.
Child-SP RetAddr Call Site
000000cd`a86cfb98 000007fb`45ef7385 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
000000cd`a86cfba0 000007fb`45ede8e3 iisw3adm!WORKER_PROCESS::ExecuteWorkItem+0x41
000000cd`a86cfbf0 000007fb`45ede442 iisw3adm!WORK_ITEM::Execute+0x13
000000cd`a86cfc30 000007fb`45edb129 iisw3adm!WORK_QUEUE::ProcessWorkItem+0xa2
000000cd`a86cfc70 000007fb`45edafb7 iisw3adm!WEB_ADMIN_SERVICE::MainWorkerThread+0x35
000000cd`a86cfcb0 000007fb`45ed9aec iisw3adm!WEB_ADMIN_SERVICE::ExecuteService+0xdf
000000cd`a86cfcf0 000007fb`4e8f167e iisw3adm!WASServiceThreadProc+0x94
000000cd`a86cfd60 000007fb`50533501 KERNEL32!BaseThreadInitThunk+0x1a
000000cd`a86cfd90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
This gives us an idea of the calling order, but the public symbols we’re using won’t give us local variables and in this instance they won’t tell us if and what parameters are passed. x86 Calling conventions dictate that in a 64 bit process, the first 4 parameters are stored in the CPU registers RCX, RDX, R8, then R9 in that order. Any parameters in addition to that will be stored on the stack. If this were a 32 bit process, the parameters would most likely all be on the call stack. Knowing this, we can start by searching the call stack and the 4 registers mentioned for the PID we identified. In fact, any of the non null registers displayed with the r command are probably a good place to look, excluding RIP which will point to code.
rax=0000000000000000 rbx=0000000000000000 rcx=000000cda910e520
rdx=000000cda94159d0 rsi=0000000000000001 rdi=000000cda910e520
rip=000007fb45efb2d4 rsp=000000cda86cfb98 rbp=000000cda8709ec0
r8=000000cda86cfb28 r9=000000cda86cfc10 r10=0000000000000000
r11=0000000000000246 r12=0000000000000000 r13=0000000000000000
iopl=0 nv up ei pl zr na po nc
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
000007fb`45efb2d4 48895c2410 mov qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705d41a9cd000000
We can see that RAX and RBX are null, but RCX and RDX are not. We’ll start with the call stack then move on to the other registers. For the call stack, we’ll search between 100 bytes under the top of the stack pointer (RSP) and up to the stack base (RBP). Remember, stacks call stacks start at a higher address and grow down as they move up the stack. We’re searching for a word value because that’s plenty of space for the PID to be stored in and searching for our known PID, 8012.
0:002> s -w @rsp-100 @rbp 0n8012
This returned no results. So, we don’t have the PID on the call stack. Next we’ll search from RCX to 300 Bytes greater than RCX.
0:002> s -w @rcx L300 0n8012
In this case we have two results. We can see how far from RCX the first result is.
Evaluate expression: 72 = 00000000`00000048
We get an offset of 72 bytes, which is very close. We can do a couple things to verify that we can use this offset in the future. First, we can check to see if we have any symbol information near RCX.
0:002> dps @rcx
000000cd`a910e520 000007fb`45ecc2b8 iisw3adm!WORKER_PROCESS::`vftable'
It seems RCX is a pointer to a WORKER_PROCESS virtual function table. This means it is essentially the address for the WORKER_PROCESS object. It also means our offset of RCX+72 is probably correct and could be used consistently. We can be sure with one last check. We need to let the debugger continue, create and pause another worker process, let the debugger break in at our earlier breakpoint, then validate that RCX+72 points to the new PID.
With the worker process paused, we note our new PID is 4356. In Hex that comes to 1104, so that’s what we’re looking for.
Breakpoint 0 hit
000007fb`45efb2d4 48895c2410 mov qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705d41a9cd000000
0:002> dw @rcx+0n72 l1
Clearly, our offset works as it points to the correct new PID. This means we can use our breakpoint and this offset to create a script to dump the hung worker process. Below, I’ve listed the steps to script this with DebugDiag, but it could easily be adapted to work with ADPlus or even CDB and WinDbg directly.
(On a side note, review of the disassembly for the PingResponseTimerExpiredWorkItem showed that it did not have any parameters. RCX was clearly not a parameter, but it’s a common enough register that it is still good to check. It definitely paid off in this instance.)
Steps to get a memory dump using DebugDiag from the faulting worker process when a 5010 is raised.
The assumption is made that the production machine with the issue does not have internet access as this is a common practice. If it’s not the case and internet access is available, you can complete Step 1 on the production machine.
Step 1: Get Symbols for WAS Service
- Open DebugDiag on a production server
- Navigate to the Processes tab
- Sort by the Service Column Descending and identify the line containing the process for the WAS service
- Right click the WAS service and choose create full memory dump
- Zip the dump and copy it to a machine with external internet access
- On the new machine use the LoadSymbols script found here: http://blogs.msdn.com/b/puneetgupta/archive/2010/05/14/debugdiag-script-to-load-all-symbols-in-a-dump-file.aspx to collect symbols for the WAS service
- On the analysis tab of DebugDiag, highlight the line for LoadSymbols
- Click Add Data File
- Navigate to the dump from the prod server and click ok
- Now click Start Analysis
- By default, DebugDiag will save the symbol files to c:\symcache. Copy the contents of that folder to the same path on the production machine. (If you’ve changed the symbol settings in DebugDiag previously, you must make sure you’re pointing to the Microsoft symbol server and make note of your local path)
Step 2: Set up DebugDiag on Production machine
- On the Rules tab, click Add Rule and select a Crash Rule
- Select “A specific NT Service” and click Next
- Highlight the WAS service and click Next
- Click Breakpoints then Add Breakpoint
- In Breakpoint Expression enter: iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
- Under Action Type select Custom
- Add the following script in the window that pops up:
rcx = CausingThread.Register("rcx")
w3wpProcessId = Debugger.ReadDWORD(rcx + 72)
WriteToLog "w3wp process id = " & w3wpProcessId
Debugger.CreateDumpForProcessID w3wpProcessID,"Failed Ping",False
- Set the Action Limit to 5 then click ok
- Click Save & Close
- Click Next all the way through. You can specify the location of the dump file on one of the screens before you click finish.
- Check the log for svchost under c:\program files\DebugDiag and make sure it shows that the breakpoint was set. If it was not set, try restarting the Debug Diagnostic Service. The log should have something like the following:
[9/30/2012 2:46:53 PM] Initializing control script
[9/30/2012 2:46:53 PM] Clearing any existing breakpoints
[9/30/2012 2:46:53 PM]
[9/30/2012 2:46:53 PM] Attempting to set unmanaged breakpoint at iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
[9/30/2012 2:46:53 PM] bpID = 0
[9/30/2012 2:46:53 AM] Current Breakpoint List(BL)
0 e 000007fb`45efb2d4 0001 (0001) 0:**** iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
You can test the above set up by suspending a worker process with Process Explorer and making sure a memory dump is created. That’s all there is to it. Of course, when you have the memory dump, that’s just the beginning as you will have to debug that dump to identify the cause of the hang.