Red alert! My Server is hung – what do I do?


So you have a dump from a hung server and you’re the first person on the scene. Your IT Manager is jumping up and down, the phone is ringing off the hook and people are hovering outside your cube.  It’s game time and the pressure is on!!!  Now what do you do? 


 


Well take a deep breath, get a cup of coffee, and relax because I’m here to help you out!  Let me share what we typically do on our first pass through a hung server kernel debug.  This works for both live debugs and dumps. These are steps you can take and they will find problems!


 


Here’s something else to consider.  If the server is mission critical you will probably want to get a dump vs. a live debug so you can get the server back up and running.  This will take the pressure off because you can then do the debug offline, and if need be, send the dump to other people for review.


 


Before we get started let me state that the following data is completely fabricated and many of the process names and address in this output have been made up.  Do not question odd offsets or alignments.


 


I’m also assuming that you know how to


 


1.       Collect a kernel dump: http://support.microsoft.com/kb/244139


 


2.       Set up the debugger: http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx


 


3.       Know how to use the symbol server: http://support.microsoft.com/kb/311503


 


 


0)      Before I start these types of debugs I like to open a log file.


 


1: kd> .logopen H:\repro\hungserver.log


Opened log file ‘H:\repro\hungserver.log’


 


 


1)      !vm – Look for memory usage.  Generally speaking you want to look at what the current pool or memory usage values are and compare them to the max available.


 


 


1: kd> !vm


 


 


*** Virtual Memory Usage ***


      Physical Memory:      982890 (   3931560 Kb)


      Page File: \??\P:\pagefile.sys


        Current:   3931560 Kb  Free Space:   3742548 Kb


        Minimum:   3931560 Kb  Maximum:      4193280 Kb


      Available Pages:      631300 (   2525200 Kb)


      ResAvail Pages:       888171 (   3552684 Kb)


      Locked IO Pages:         195 (       780 Kb)


      Free System PTEs:     202830 (    811324 Kb) < THIS IS OK


      Free NP PTEs:          32765 (    131060 Kb) < THIS IS OK


      Free Special NP:           0 (         0 Kb)


      Modified Pages:          241 (       964 Kb)


      Modified PF Pages:       241 (       964 Kb)


      NonPagedPool Usage:    11377 (     45508 Kb) < THIS IS OK


      NonPagedPool Max:      65536 (    262144 Kb) 


      PagedPool 0 Usage:      6398 (     25592 Kb)


      PagedPool 1 Usage:      2201 (      8804 Kb)


      PagedPool 2 Usage:      2216 (      8864 Kb)


      PagedPool 3 Usage:      2179 (      8716 Kb)


      PagedPool 4 Usage:      2199 (      8796 Kb)


      PagedPool Usage:       15193 (     60772 Kb) < THIS IS OK


      PagedPool Maximum:     67584 (    270336 Kb)


      Shared Commit:         24569 (     98276 Kb)


      Special Pool:              0 (         0 Kb)


      Shared Process:        12519 (     50076 Kb)


      PagedPool Commit:      15252 (     61008 Kb)


      Driver Commit:          2083 (      8332 Kb)


      Committed pages:      313611 (   1254444 Kb) < THIS IS OK


      Commit limit:        1925815 (   7703260 Kb)


 


Check to see if any apps are using tons of memory.  In this case I don’t see a problem.


 


      Total Private:        239673 (    958692 Kb)


         36b0 EXCEL.EXE        10775 (     43100 Kb) < THIS IS OK, etc


         2ee8 myapploc.exe     10288 (     41152 Kb)


         097c MySSrv.exe        7497 (     29988 Kb)


         0418 MyFun32.exe       6277 (     25108 Kb)


         0474 svchost.exe       6164 (     24656 Kb)


         1be8 ABCDEFGH.EXE      4984 (     19936 Kb)


         0480 IEXPLORE.EXE      4924 (     19696 Kb)


         09c4 ANOTHER.exe       4768 (     19072 Kb)


         19a4 HMMINTER.exe      4207 (     16828 Kb)


         1b30 ohboya.EXE        4146 (     16584 Kb)


         4558 aprocess.EXE      4138 (     16552 Kb)


         30e8 another.exe       3691 (     14764 Kb)


         0924 aservicec.exe     3508 (     14032 Kb)


         0854 RRXXc.exe         3400 (     13600 Kb)


         3458 MYWIN.EXE         3389 (     13556 Kb)


         0d90 FunService.exe    3298 (     13192 Kb)


         1180 CustomAp.exe      3221 (     12884 Kb)


         06ac XYZvrver.exe      2769 (     11076 Kb)


         2cdc ABCDEFGH.exe      2591 (     10364 Kb)


         02f4 lsass.exe         2567 (     10268 Kb)


         21b4 IEXPLORE.EXE      2516 (     10064 Kb)


         3420 Process.exe       2450 (      9800 Kb)


         4cd4 XYZXY.EXE         2305 (      9220 Kb)


         4a30 lookup.EXE        2244 (      8976 Kb)


         4360 Process.exe       2201 (      8804 Kb)


         0564 spoolsv.exe       2166 (      8664 Kb)


         2e5c XYZXYZEXE         2076 (      8304 Kb)


         02bc winlogon.exe      1964 (      7856 Kb)


         4e48 winlogon.exe      1958 (      7832 Kb)


         42bc ABCDEFGH.exe      1943 (      7772 Kb)


         0eb8 svchost.exe       1922 (      7688 Kb)


         3b98 Process.exe       1919 (      7676 Kb)


         4c1c IEXPLORE.EXE      1864 (      7456 Kb)


         17b8 winlogon.exe      1852 (      7408 Kb)


         3124 winlogon.exe      1849 (      7396 Kb)


         14b8 winlogon.exe      1847 (      7388 Kb)


         32cc winlogon.exe      1843 (      7372 Kb)


         1f84 winlogon.exe      1843 (      7372 Kb)


         2ebc winlogon.exe      1842 (      7368 Kb)


         1548 winlogon.exe      1840 (      7360 Kb)


         21c4 PROCESS213.EXE    1833 (      7332 Kb)


         3b58 MYWIN.EXE         1817 (      7268 Kb)


         4b3c winlogon.exe      1816 (      7264 Kb)


 


NOTE if you see high pool values you will want to issue a !poolused 2 and a !poolused 4 to dump out the pool usages so you can see what pool tags are consuming pool.  (We will write a dedicated blog on this topic later.)


 


 


2) !sysptes – See if one of the lists is low (less than 10)


 


 


1: kd> !sysptes


 


All of these are ok


 


System PTE Information


  Total System Ptes 224223


     SysPtes list of size 1 has 225 free


     SysPtes list of size 2 has 57 free


     SysPtes list of size 4 has 136 free


     SysPtes list of size 8 has 59 free


     SysPtes list of size 16 has 95 free


 


    starting PTE: c022b000


    ending PTE:   c03dff78


 


  free blocks: 652   total free: 202831    largest free block: 191973


 


 


3) !defwrites – If throttling, the server is doing nothing other than writing to the disk.


 


 


1: kd> !defwrites


*** Cache Write Throttle Analysis ***


 


      CcTotalDirtyPages:                   187 (     748 Kb)


      CcDirtyPageThreshold:             130560 (  522240 Kb)


      MmAvailablePages:                 631300 ( 2525200 Kb)


      MmThrottleTop:                       450 (    1800 Kb)


      MmThrottleBottom:                     80 (     320 Kb)


      MmModifiedPageListHead.Total:        241 (     964 Kb)


 


Write throttles not engaged  < THIS IS OK. Good = NOT engaged.


 


 


4) !ready to see if we’re holding stuff up


 


 


1: kd> !ready


Processor 0: No threads in READY state  < THIS IS OK


Processor 1: No threads in READY state  < THIS IS OK


 


If we had threads in a ready state you would want to investigate what those threads were and what is running on the processor.


 


 


5) !pcr x; kv on each processor – If they aren’t idle then we could be doing DPCs


 


 


1: kd> !pcr 0  < Dump the processor control registers for CPU 0


KPCR for Processor 0 at ffdff000:


    Major 1 Minor 1


      NtTib.ExceptionList: ffffffff


          NtTib.StackBase: 00000000


         NtTib.StackLimit: 00000000


       NtTib.SubSystemTib: 80042000


            NtTib.Version: 012e7ace


        NtTib.UserPointer: 00000001


            NtTib.SelfTib: 00000000


 


                  SelfPcr: ffdff000


                     Prcb: ffdff120


                     Irql: 00000000


                      IRR: 00000000


                      IDR: ffffffff


            InterruptMode: 00000000


                      IDT: 8003f400


                      GDT: 8003f000


                      TSS: 80042000


 


            CurrentThread: 8056cd00


               NextThread: 00000000


               IdleThread: 8056cd00


 


                DpcQueue: < NO DPCs: Not much to look at then 


    


1: kd> !pcr 1  < Dump the processor control registers for CPU 1


KPCR for Processor 1 at f773f000:


    Major 1 Minor 1


      NtTib.ExceptionList: f5ba1d30


          NtTib.StackBase: 00000000


         NtTib.StackLimit: 00000000


       NtTib.SubSystemTib: f773fef0


            NtTib.Version: 0121925d


        NtTib.UserPointer: 00000002


            NtTib.SelfTib: 7ffda000


 


                  SelfPcr: f773f000


                     Prcb: f773f120


                     Irql: 00000000


                      IRR: 00000000


                      IDR: ffffffff


            InterruptMode: 00000000


                      IDT: f77456e0


                      GDT: f77452e0


                      TSS: f773fef0


 


            CurrentThread: 8963cb90


               NextThread: 00000000


               IdleThread: f7741fa0


 


                DpcQueue: < NO DPCs: Not much to look at then


 


6) !locks – Look for deadlocks and contention


 


 


The following output is of interest.


The thread ID with the <*> next to it means that he has exclusive access to the resource and that all the other threads are waiting on that thread to finish its work. Typically you would !thread that OWNER THREAD ID <*> (e.g., !thread 87bddda0) to see what that thread is doing. If you have two threads that have exclusive access to two different resources, and these threads are in each other’s exclusive waiters list, you have a deadlock.  The following is an example of what a deadlock might look like.  In this case you would want to !thread each owner and evaluate the logic of the code in each stack that allowed the threads to get into this state 


 


1: kd> !locks


**** DUMP OF ALL RESOURCE OBJECTS ****


KD: Scanning for held locks……


 


Resource @ 0x8a50ee98    Shared 4 owning threads


     Threads: 896856d0-01<*> 89686778-01<*> 896862d0-01<*> 89685da0-01<*>


KD: Scanning for held locks……………………………………………………


 


Resource @ 0x896da1bc    Exclusively owned


     Threads: 896e3b20-01<*>


KD: Scanning for held locks..


 


 


Resource @ 0x81234567    Shared 1 owning threads


    Contention Count = 15292


    NumberOfSharedWaiters = 1


    NumberOfExclusiveWaiters = 39


     Threads: 87bddda0-01<*> 806d2020-01 


 


 


     Threads Waiting On Exclusive Access:


              80ced020       80c036f8       80cdc7a0       80c438b0      


              80e6cda0       80f96987       8007fd60       8004dc10      


              80d7b020       80a2dd70       80b89620       80b58020      


              8036eda0       87abc123       80606da0       8056e890      


              802b3630       80cc7590       80d64020       80f7dda0      


              80129580       80b73da0       806d2578       80b505d8      


      


 


KD: Scanning for held locks…………….


 


Resource @ 0x83245678    Exclusively owned


    Contention Count = 4827


    NumberOfExclusiveWaiters = 35


     Threads: 87abc123-01<*>


     Threads Waiting On Exclusive Access:


              803e6aa0       80876020       80240020       80f56588      


              808174f0       80bd6b28       80c3c448       8046d6c8      


              801e8da0       80356518       80b4c978       8069e020      


              80cb9020       87bddda0       80c65020       86daaac0      


              80379020       80fe4020      


 


 


 


8) !process 0 0 – Search for drwtsn32.  This would indicate that we have a process that has crashed and is in the process of being dumped.  This could cause a server hang.  Look at the PEB for drwtsn32 and get its command line to see what process is being dumped.  You should be able to do this by getting its process id and doing a .process PROCESSID;.reload;!PEB


 


The following is how to extract a command line for any process, but it would work for Watson also.


 


1: kd> .process 89f31020 


Implicit process is now 89f31020


1: kd> .reload


Loading Kernel Symbols


………………………………………………………………………………………………………………………….


Loading User Symbols


………………………….


Loading unloaded module list


……………


1: kd> !peb


PEB at 7ffdf000


    InheritedAddressSpace:    No


    ReadImageFileExecOptions: Yes


    BeingDebugged:            No


    ImageBaseAddress:         01000000


    Ldr                       77fc23a0


    Ldr.Initialized:          Yes


    Ldr.InInitializationOrderModuleList: 00171ef8 . 00176c90


    Ldr.InLoadOrderModuleList:           00171e90 . 00176c80


    Ldr.InMemoryOrderModuleList:         00171e98 . 00176c88


            Base TimeStamp                     Module


         1000000 3e80245d Mar 24 05:41:49 2003 \??\P:\WINDOWS\system32\winlogon.exe


        77f40000 3e802494 Mar 25 05:42:44 2003 P:\WINDOWS\system32\ntdll.dll


        77e40000 44c60ec8 Jul 25 08:30:00 2006 P:\WINDOWS\system32\kernel32.dll


        77ba0000 3e802496 Mar 25 05:42:46 2003 P:\WINDOWS\system32\msvcrt.dll


        77da0000 3e802495 Mar 25 05:42:45 2003 P:\WINDOWS\system32\ADVAPI32.dll


        77c50000 40566fc9 Mar 15 23:08:57 2004 P:\WINDOWS\system32\RPCRT4.dll


        77d00000 45e7bafc Mar 02 00:49:48 2007 P:\WINDOWS\system32\USER32.dll


        77c00000 45e7bafc Mar 02 00:49:48 2007 P:\WINDOWS\system32\GDI32.dll


        75970000 3e8024a2 Mar 25 05:42:58 2003 P:\WINDOWS\system32\USERENV.dll


        75810000 3e8024a3 Mar 25 05:42:59 2003 P:\WINDOWS\system32\NDdeApi.dll


        761b0000 3e8024a0 Mar 25 05:42:56 2003 P:\WINDOWS\system32\CRYPT32.dll


       


    SubSystemData:     00000000


    ProcessHeap:       00070000


    ProcessParameters: 00020000


    WindowTitle:  ‘< Name not readable >’


    ImageFile:    ‘\??\P:\WINDOWS\system32\winlogon.exe’


    CommandLine:  ‘winlogon.exe’ < HERE IS THE COMMAND LINE.. No args in this case


 


 


( output is truncated … )


 


9) Look at the handle table size.  If it’s over 10000 you may have trouble.  If you do have a handle leak refer to TalkBackVideo Understanding handle leaks and How to use !htrace to find them


 


 


1: kd> !process 0 0


 


**** NT ACTIVE PROCESS DUMP ****


PROCESS 8a613270  SessionId: none  Cid: 0004    Peb: 00000000  ParentCid: 0000


    DirBase: 0acc0000  ObjectTable: e1001d10  HandleCount: 2510.


    Image: System


 


PROCESS 8a294328  SessionId: none  Cid: 0274    Peb: 7ffdf000  ParentCid: 0004


    DirBase: ef1ac000  ObjectTable: e14ac1d0  HandleCount: 124.


    Image: smss.exe


 


PROCESS 8a103424  SessionId: 0  Cid: 02a4    Peb: 7ffdf000  ParentCid: 0274


    DirBase: ed804000  ObjectTable: e18caa68  HandleCount: 1171.


    Image: csrss.exe


 


PROCESS 8a104343  SessionId: 0  Cid: 02bc    Peb: 7ffdf000  ParentCid: 0274


    DirBase: ed539000  ObjectTable: e18c67b0  HandleCount: 498.


    Image: winlogon.exe


 


PROCESS 8a0f6634  SessionId: 0  Cid: 02e8    Peb: 7ffdf000  ParentCid: 02bc


    DirBase: ece72000  ObjectTable: e1668e40  HandleCount: 568.


    Image: services.exe


 


PROCESS 8a123423  SessionId: 0  Cid: 02f4    Peb: 7ffdf000  ParentCid: 02bc


    DirBase: ecd7a000  ObjectTable: e16684a0  HandleCount: 30000. < This is bad


    Image: lsass.exe


 


PROCESS 89f96453  SessionId: 0  Cid: 03e0    Peb: 7ffdf000  ParentCid: 02e8


    DirBase: eb99c000  ObjectTable: e16bb570  HandleCount: 500.


    Image: svchost.exe


 


PROCESS 8a0c6532  SessionId: 0  Cid: 042c    Peb: 7ffdf000  ParentCid: 02e8


    DirBase: eb6d7000  ObjectTable: e1731170  HandleCount: 156.


    Image: svchost.exe


 


PROCESS 8a0a8d88  SessionId: 0  Cid: 0460    Peb: 7ffdf000  ParentCid: 02e8


    DirBase: eb58f000  ObjectTable: e17372e8  HandleCount: 124.


    Image: svchost.exe


 


PROCESS 89f77678  SessionId: 0  Cid: 0474    Peb: 7ffdf000  ParentCid: 02e8


    DirBase: eb484000  ObjectTable: e17305b8  HandleCount: 1457.


    Image: svchost.exe


 


9) !process 0 0 system – Check the worker threads in the system process (search for srv! to find server worker threads).  What are these threads doing?  These are the server service threads.  Are they blocked on I/O or waiting for a resource?


 


10) 1: kd> !process 0 17 csrss.exe  – Look for 16 LPC server threads.


What are they doing? Are they blocked?


 


11) !stacks 2,  This will dump every call stack on the server.  You may need to go through and evaluate every stack on the server.  Look for critical sections, etc.


 


15) !qlocks  This will allow you to check the stack of all the Queued spin locks on the machine.   For further information on spinlocks refer to the Windows Internals book.


 


1: kd> !qlocks


Key: O = Owner, 1-n = Wait order, blank = not owned/waiting, C = Corrupt


 


                       Processor Number


    Lock Name         0  1    << Nothing to worry about here.


 


KE   – Dispatcher        


MM   – Expansion         


MM   – PFN               


MM   – System Space      


CC   – Vacb              


CC   – Master            


EX   – NonPagedPool      


IO   – Cancel            


EX   – WorkQueue         


IO   – Vpb                


IO   – Database          


IO   – Completion        


NTFS – Struct            


AFD  – WorkQueue         


CC   – Bcb               


MM   – NonPagedPool     


 


16) !process 0 17 winlogon.exe to look for hung LPC calls.  If you find a LPC call calling out of winlogon you can follow the call with the !LPC debugger command. This will allow you to see what the thread is doing in the other process.


 


 


If you have further questions on any of these commands, please refer to the debugger.chm file in the Windows debugger tools install.


 


Good luck and happy debugging.


 


“This debugger is mine, there are many like it but this one is mine!” Jeff Dailey

Comments (5)

  1. Hi All, Debugging a dump from a hung server may not be something you do every day so you may want to

  2. Great post. I will appreciate other posts like this!

  3. gOODiDEA.NET says:

    Other Don’t Tell Me &quot;How&quot;, Tell me &quot;What&quot; Microsoft Network Monitor 3.2 .NET MSDN

  4. steve.thresher says:

    This was a great post with lots of really useful information.