Hang caused by GC – XML Deadlock

In December I blogged about a little tool that i wrote to analyze hangs in dumps, and i showed the following output, but didnt really get into the details of why the process was stuck in here...  



The following threads are GC threads:
	18 19 

The following threads are waiting for the GC to finish: 
	14 16 24 26 27 28 30 31 36 37 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 57 58 59 60 62 63 64 65 66 67 68 69 70 71 72 73 74 77 78 

The GC was triggered by thread: 75 

The GC is working on suspending threads to continue with garbage collection
The following threads can't be suspended because preemptive GC is disabled:
	23 25 33 34 35 38 56 61 

The Finalizer (Thread 20) is not blocked

The issue the customer is running into is a hang during heavy load.  The only way to get out of the hang is to recycle the process (IISReset). 


Debugging the issue:

I have seen this issue before on a few occations and although as you will see later it was currently fixed in the framework, but in my earlier cases on this we ended up not needing a fix since the customers I worked with made code changes that made it such that they no longer were subject to the issue.

So what is going on here?

Thread 75 triggered a garbage collection by making an allocation that would have made Gen 0 go over its allocation budget. 

0:075> kb 2000
ChildEBP RetAddr  Args to Child              
1124de6c 7c822124 77e6bad8 000002e8 00000000 ntdll!KiFastSystemCallRet
1124de70 77e6bad8 000002e8 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
1124dee0 79e718fd 000002e8 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
1124df24 79e718c6 000002e8 ffffffff 00000000 mscorwks!PEImage::LoadImage+0x199
1124df74 79e7187c ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117
1124df84 7a0d0d0f ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0x17
1124dfa8 7a0d5289 ffffffff 000d4558 106cb970 mscorwks!SVR::gc_heap::wait_for_gc_done+0x99
1124dfcc 7a0d5fa2 00000000 00000000 00000020 mscorwks!SVR::GCHeap::GarbageCollectGeneration+0x267
1124e058 7a0d691f 106cb970 00000020 00000000 mscorwks!SVR::gc_heap::try_allocate_more_space+0x1c0
1124e078 7a0d7ecc 106cb970 00000020 00000000 mscorwks!SVR::gc_heap::allocate_more_space+0x2f
1124e098 7a08bd32 106cb970 00000020 00000002 mscorwks!SVR::GCHeap::Alloc+0x74
1124e0b4 79e7b43e 00000020 00000000 00080000 mscorwks!Alloc+0x60
1124e180 79e8f41c 79157f42 1124e230 00000001 mscorwks!AllocateArrayEx+0x1d1
1124e244 7937f5c2 064f60b8 064f60b8 064f60b8 mscorwks!JIT_NewArr1+0x167
1124e27c 5088a509 00000000 00000000 00000000 mscorlib_ni!System.Reflection.RuntimeMethodInfo.GetParameters()+0x4a

Nothing strange there, allocation and garbage collection happens all the time... however a GC is usually extremely fast and in this case we have 45 threads waiting for the GC to finish...

0:014> kb
ChildEBP RetAddr  Args to Child              
01a0fc74 7c822124 77e6bad8 000002e4 00000000 ntdll!KiFastSystemCallRet
01a0fc78 77e6bad8 000002e4 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
01a0fce8 79e718fd 000002e4 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
01a0fd2c 79e718c6 000002e4 ffffffff 00000000 mscorwks!PEImage::LoadImage+0x199
01a0fd7c 79e7187c ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117
01a0fd8c 7a0851cb ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0x17
01a0fd9c 79f40e96 00000000 13d43bb8 0e874858 mscorwks!SVR::GCHeap::WaitUntilGCComplete+0x32
01a0fdd8 79e7385b 00000001 7a0e607b 00000001 mscorwks!Thread::RareDisablePreemptiveGC+0x1a1
01a0fde0 7a0e607b 00000001 13d43838 00000102 mscorwks!GCHolder<1,0,0>::GCHolder<1,0,0>+0x2d
01a0fe2c 7a0e673e 00000000 7a393704 7a114dea mscorwks!Thread::OnThreadTerminate+0x53
01a0fe38 7a114dea 0e874858 13d4385c 00000000 mscorwks!DestroyThread+0x43
01a0fe94 79f79c4f 00000000 00000000 00000000 mscorwks!ThreadpoolMgr::CompletionPortThreadStart+0x33d
01a0ffb8 77e6608b 000c4c00 00000000 00000000 mscorwks!ThreadpoolMgr::intermediateThreadProc+0x49
01a0ffec 00000000 79f79c09 000c4c00 00000000 kernel32!BaseThreadStart+0x34

so for some reason the GC appears to be taking some time...

There are 2 GC threads (one per logical processor), thread 18 and 19 in this case...  Thread 19 is simply waiting for work

0:019> kb
ChildEBP RetAddr  Args to Child              
01e2fd68 7c822124 77e6bad8 000002d0 00000000 ntdll!KiFastSystemCallRet
01e2fd6c 77e6bad8 000002d0 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
01e2fddc 79e718fd 000002d0 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
01e2fe20 79e718c6 000002d0 ffffffff 00000000 mscorwks!PEImage::LoadImage+0x199
01e2fe70 79e7187c ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117
01e2fe80 7a0d8898 ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0x17
01e2fea8 7a0d8987 01e2ff00 77e60eb5 01e2fec8 mscorwks!SVR::gc_heap::gc_thread_function+0x58
01e2ffb8 77e6608b 000d5050 00000000 00000000 mscorwks!SVR::gc_heap::gc_thread_stub+0x9b
01e2ffec 00000000 7a0d88eb 000d5050 00000000 kernel32!BaseThreadStart+0x34

But interestingly enough thread 18 is waiting to suspend all managed threads in order to continue the GC (to avoid for anyone to allocate any more data while it is performing the GC)

0:018> kb
ChildEBP RetAddr  Args to Child              
01defb10 7c821524 77e98ef4 00000f64 01defb34 ntdll!KiFastSystemCallRet
01defb14 77e98ef4 00000f64 01defb34 01defe04 ntdll!NtGetContextThread+0xc
01defb24 7a0de046 00000f64 01defb34 00010002 kernel32!GetThreadContext+0x11
01defe04 7a0defc1 00000f64 1069d328 13aa3858 mscorwks!EnsureThreadIsSuspended+0x3f
01defe4c 7a0e290a 00000000 00000000 13aa3874 mscorwks!Thread::SuspendThread+0xd0
01defe9c 7a086e76 00000000 13aa399c 00000000 mscorwks!Thread::SysSuspendForGC+0x5a6
01deff88 7a0d867b 00000001 00000000 000d4368 mscorwks!SVR::GCHeap::SuspendEE+0x16c
01deffa8 7a0d8987 00000000 13aa39ac 01deffec mscorwks!SVR::gc_heap::gc_thread_function+0x3b
01deffb8 77e6608b 000d4368 00000000 00000000 mscorwks!SVR::gc_heap::gc_thread_stub+0x9b
01deffec 00000000 7a0d88eb 000d4368 00000000 kernel32!BaseThreadStart+0x34

Normally suspending all managed threads happens in a matter of nanoseconds, and pretty much the only thing that could cause the process to be stuck while suspending is if some thread has disabled preemptive GC, i.e. told the GC that it is in a state where it can't be disturbed...

Yun Jin describes PreemptiveGC like this in one of his posts 

Preemptive GC: also very important. In Rotor, this is m_fPreemptiveGCDisabled field of C++ Thread class. It indicates what GC mode the thread is in: "enabled" in the table means the thread is in preemptive mode where GC could preempt this thread at any time; "disabled" means the thread is in cooperative mode where GC has to wait the thread to give up its current work (the work is related to GC objects so it can't allow GC to move the objects around). When the thread is executing managed code (the current IP is in managed code), it is always in cooperative mode; when the thread is in Execution Engine (unmanaged code), EE code could choose to stay in either mode and could switch mode at any time; when a thread are outside of CLR (e.g, calling into native code using interop), it is always in preemptive mode.

In our case, as we can see from the output from my tool the following threads have preemptive GC disabled (which is very uncommon)

23 25 33 34 35 38 56 61

In the !threads output it looks like this (notice the PreEmptive GC column)

0:018> !threads
ThreadCount: 59
UnstartedThread: 0
BackgroundThread: 59
PendingThread: 0
DeadThread: 0
Hosted Runtime: yes
                                      PreEmptive   GC Alloc           Lock
       ID OSID ThreadOBJ    State     GC       Context       Domain   Count APT Exception
  16    1 1008 000d0828   1808220 Enabled  06b79310:06b79db0 000f0728     1 MTA (Threadpool Worker)
  20    2 1e7c 000d62d0      b220 Enabled  00000000:00000000 000cd190     0 MTA (Finalizer)
  21    3 1d3c 000db120      1220 Enabled  00000000:00000000 000cd190     0 Ukn
  22    4 1c38 000ed3c8    80a220 Enabled  00000000:00000000 000cd190     0 MTA (Threadpool Completion Port)
  23    5 19ac 0013a520   180b222 Disabled 06c17754:06c18bd8 00158e70     2 MTA (Threadpool Worker)
  24    6 1fd4 0014d1a0      b220 Enabled  00000000:00000000 000f0728     1 MTA
  25    7 1568 0e81aae0   180b222 Disabled 02cea570:02cec4a8 00158e70     2 MTA (Threadpool Worker)
  26    8 1ad0 0e82bd58      b220 Enabled  00000000:00000000 00158e70     0 MTA
  27    9 1f04 0e829310      b220 Enabled  032b0cb8:032b2938 00158e70     0 MTA
  28    a  ffc 0e820d18      b220 Enabled  070173e4:070193c0 00158e70     1 MTA
  14    b 1ee8 0e874858   1800220 Enabled  02b34750:02b34c78 000cd190     0 Ukn (Threadpool Worker)
  30    d 1080 0e8aff40      b220 Enabled  00000000:00000000 0e87c310     0 MTA
  31    e 11ec 0e8d1f28   8801220 Enabled  02b418e4:02b42c78 000cd190     0 MTA (Threadpool Completion Port)
  33    f 1f1c 0e8e6420   180b222 Disabled 06bc1fa0:06bc3dcc 00158e70     2 MTA (Threadpool Worker)
  34   10 1a6c 0e8e8b20   180b222 Disabled 02c449b8:02c45d74 00158e70     2 MTA (Threadpool Worker)
  35   11  710 0e8ee550   180b222 Disabled 06cdde68:06cdea48 00158e70     2 MTA (Threadpool Worker)
  36    c 1ae0 0e875228   180b220 Enabled  072bd88c:072bf508 000cd190     0 MTA (Threadpool Worker)
  37   12 19c8 00147290   180b220 Enabled  06bb4b70:06bb5dcc 000f0728     1 MTA (Threadpool Worker)
  38   13 18f0 0e8901d8   180b222 Disabled 02c49ce8:02c49d74 00158e70     2 MTA (Threadpool Worker)
  39   14 19b8 0e8f7338   180b220 Enabled  073c6fbc:073c6fe8 00158e70     1 MTA (Threadpool Worker)
  40   15 1308 0e8f8610   180b220 Enabled  02af10a4:02af10ac 00158e70     3 MTA (Threadpool Worker)
  41   16 1f38 0e8f99e8   180b220 Enabled  070c7154:070c761c 000cd190     0 MTA (Threadpool Worker)
  42   17 1be0 0e8facc0   180b220 Enabled  06ad3a74:06ad4cc8 000cd190     0 MTA (Threadpool Worker)
  43   18 1efc 0e8fbf98   180b220 Enabled  0321b11c:0321c938 000cd190     0 MTA (Threadpool Worker)
  44   19 1470 0e8fd160   1801220 Enabled  06ef6770:06ef7178 000cd190     0 MTA (Threadpool Worker)
  45   1a  150 0e8fe438   180b220 Enabled  02a15a34:02a161bc 000cd190     0 MTA (Threadpool Worker)
  46   1b 1b10 0e8ff710   180b220 Enabled  06b216b8:06b22d00 000f0728     1 MTA (Threadpool Worker)
  47   1c 1c8c 10690ac0   180b220 Enabled  02be2430:02be3b00 000f0728     1 MTA (Threadpool Worker)
  48   1d 16c8 10691ad8   180b220 Enabled  06b31520:06b3189c 000f0728     1 MTA (Threadpool Worker)
  49   1e 1a5c 10692e20   180b220 Enabled  02ab3a48:02ab50ac 000f0728     1 MTA (Threadpool Worker)
  50   1f 1908 10694168   180b220 Enabled  06b686fc:06b69db0 000f0728     1 MTA (Threadpool Worker)
  51   20  284 106954d0   180b220 Enabled  0319d1e4:0319e900 000cd190     0 MTA (Threadpool Worker)
  52   21 1d74 10696708   180b220 Enabled  06bac634:06baddcc 000f0728     1 MTA (Threadpool Worker)
  53   22  58c 106979c8   180b220 Enabled  02c14784:02c15d58 000f0728     1 MTA (Threadpool Worker)
  54   23 1860 10698d10   180b220 Enabled  072e1984:072e3508 00158e70     1 MTA (Threadpool Worker)
  55   24 1c9c 1069ae00   180b220 Enabled  071a4784:071a5508 000cd190     0 MTA (Threadpool Worker)
  56   25 15c4 1069d328   180b222 Disabled 06bc7174:06bc7dcc 00158e70     2 MTA (Threadpool Worker)
  57   26 1968 106a09e0   180b220 Enabled  0321da48:0321e938 000cd190     0 MTA (Threadpool Worker)
  58   27 1e10 106a3608   180b220 Enabled  070172c8:070173c0 000cd190     0 MTA (Threadpool Worker)
  59   28 1a18 106a6418   180b220 Enabled  02ac036c:02ac10ac 000f0728     1 MTA (Threadpool Worker)
  60   29 1d2c 106a9148   180b220 Enabled  06a9b5a4:06a9cc90 000cd190     0 MTA (Threadpool Worker)
  61   2a 1b50 106ac0a8   180b222 Disabled 06cdaf94:06cdbda8 00158e70     2 MTA (Threadpool Worker)
  62   2b  96c 106ae8d0   8801220 Enabled  02c00980:02c01d3c 000cd190     0 MTA (Threadpool Completion Port)
  63   2c 1b18 106afa50   180b220 Enabled  0322214c:03222938 000cd190     0 MTA (Threadpool Worker)
  64   2d 1d78 106b2d48   180b220 Enabled  06f26e78:06f26eb4 000cd190     0 MTA (Threadpool Worker)
  65   2e 198c 106b5cb8   180b220 Enabled  06ed8418:06ed8c98 000cd190     0 MTA (Threadpool Worker)
  66   2f 1fcc 106b8b28   180b220 Enabled  0728d400:0728d508 00158e70     1 MTA (Threadpool Worker)
  67   30 1958 106bb998   180b220 Enabled  0304e7e8:03050420 000cd190     0 MTA (Threadpool Worker)
  68   31 1b48 106beb58   180b220 Enabled  031a5b98:031a691c 000cd190     0 MTA (Threadpool Worker)
  69   32 1d80 106c19c8   180b220 Enabled  06b13444:06b14d00 000f0728     1 MTA (Threadpool Worker)
  70   33 17cc 106c4838   180b220 Enabled  0700d468:0700f3a4 000cd190     0 MTA (Threadpool Worker)
  71   34 1424 106c8890   180b220 Enabled  06b6f8bc:06b6fdb0 000f0728     1 MTA (Threadpool Worker)
  72   35 1e00 106bd870   180b220 Enabled  06b8dcbc:06b8ddb0 000f0728     1 MTA (Threadpool Worker)
  73   36 1d08 106c95e0   180b220 Enabled  071a081c:071a1508 000cd190     0 MTA (Threadpool Worker)
  74   37 1fdc 106ca680   180b220 Enabled  02c05810:02c05d3c 000cd190     0 MTA (Threadpool Worker)
  75   38 1b8c 106cb930   180b220 Enabled  0335f888:0335f88c 00158e70     2 MTA (Threadpool Worker)
  76   39 1928 106cc900   880b220 Enabled  072bc2f0:072bd508 000cd190     0 MTA (Threadpool Completion Port)
  77   3a 1fd0 106d7270   8801220 Enabled  02c18ca4:02c19d58 000cd190     0 MTA (Threadpool Completion Port)
  78   3b 1640 106d7908   180b220 Enabled  0294474c:02945a7c 000cd190     0 MTA (Threadpool Worker)

Why have these threads disabled preemptive GC, not allowing the GC to suspend them, and ultimately blocking our process?

All the blocked threads are sitting in this type of callstack... trying to enter a lock (JIT_MonTryEnter).  Normally you would see a thread either owning the lock or waiting in an awarelock like in this post, but here it is just spinning trying to enter the lock...  

0:056> kb 2000
ChildEBP RetAddr  Args to Child              
10b0f4d8 0eb65afc 06bc5448 06bc53f8 22a9c796 mscorwks!JIT_MonTryEnter+0xad
WARNING: Frame IP not in any known module. Following frames may be wrong.
10b0f504 69918f30 10b0f560 06bc5448 06bc5448 0xeb65afc
00000000 00000000 00000000 00000000 00000000 System_Web_Services_ni+0x28f30

unfortunately the managed stack is not very helpful in telling us where the lock was taken

0:056> !clrstack
OS Thread Id: 0x15c4 (56)
ESP       EIP     
10b0f8f8 79e73eac [ContextTransitionFrame: 10b0f8f8] 
10b0f948 79e73eac [GCFrame: 10b0f948] 
10b0faa0 79e73eac [ComMethodFrame: 10b0faa0] 

and if we run !syncblk we have no active sync blocks, so no help there when it comes to finding out who owns this lock...

0:056> !syncblk
Index SyncBlock MonitorHeld Recursion Owning Thread Info  SyncBlock Owner
Total           165
CCW             4
RCW             1
ComClassFactory 0
Free            8

seems like we are stuck between a rock and a hard place here...

I have shown the command !dumpstack before... it shows a mixture of a managed and native callstack but it is a raw stack, meaning that it will pretty much just display any addresses on the stack that happen to be pointing to code.  This means that !dumpstack will not give a true stack trace (i.e. like !clrstack and kb where if functionA calls funcitonB we will see functionA directly below functionB in the stack) and anything shown by !dumpstack may or may not be correct.  Anyways, with that little warning, here comes !dumpstack:)

0:056> !dumpstack
OS Thread Id: 0x15c4 (56)
Current frame: mscorwks!JIT_MonTryEnter+0xad
ChildEBP RetAddr  Caller,Callee
10b0f47c 638c7552 (MethodDesc 0x63a539a0 +0x62 System.Xml.Schema.XmlSchemaSet.Add(System.Xml.Schema.XmlSchemaSet)), 
	calling mscorwks!JIT_MonTryEnter
10b0f480 793463bb (MethodDesc 0x7923bbc8 +0xdb System.Collections.Hashtable..ctor(Int32, Single)), 
	calling mscorwks!JIT_Dbl2IntSSE2
10b0f498 638b36f6 (MethodDesc 0x63a505f0 +0xa6 System.Xml.Schema.SchemaInfo..ctor()), 
	calling (MethodDesc 0x7923bbc8 +0 System.Collections.Hashtable..ctor(Int32, Single))
10b0f49c 638b3719 (MethodDesc 0x63a505f0 +0xc9 System.Xml.Schema.SchemaInfo..ctor()), 
	calling mscorwks!JIT_Writeable_Thunks_Buf
10b0f4d8 0eb65afc (MethodDesc 0xeb756f0 +0x5c CustomComponent.Tools.Web.Services.Extensions.ValidationExtension.ProcessMessage(System.Web.Services.Protocols.SoapMessage)), 
	calling (MethodDesc 0x63a539a0 +0 System.Xml.Schema.XmlSchemaSet.Add(System.Xml.Schema.XmlSchemaSet))
10b0f504 69918f30 (MethodDesc 0x699c4798 +0x3c System.Web.Services.Protocols.SoapMessage.RunExtensions(System.Web.Services.Protocols.SoapExtension[], Boolean))
10b0f518 699227b3 (MethodDesc 0x699c5250 +0x2cb System.Web.Services.Protocols.SoapServerProtocol.Initialize()), 
	calling (MethodDesc 0x699c4798 +0 System.Web.Services.Protocols.SoapMessage.RunExtensions(System.Web.Services.Protocols.SoapExtension[], Boolean))

!dumpstack is a bit tricky to deal with, but in the stack above if we start from the bottom of the part of the stack that i've shown we have SoapServerProtocol.Initialize() which calls SoapMessage.RunExtensions...

this we can trust because !dumpstack is actually telling us that Initialize is calling RunExtensions in this case (assuming that we believe that Initialize was called). 

RunExtensions doesn't tell us what it is calling but that is because it is calling into a custom component 

This is in turn calling into XmlSchemaSet.Add which is calling into JIT_MonTryEnter.  So long story short... the real stacktrack should look like this

CustomComponent.Tools.Web.Services.Extensions.ValidationExtension.ProcessMessage(System.Web.Services.Protocols.SoapMessage)System.Web.Services.Protocols.SoapMessage.RunExtensions(System.Web.Services.Protocols.SoapExtension[], Boolean)

!sos.clrstack just wasn't able to rebuild it because of the state it was in

Looking through the output from ~* e !clrstack I find one stack that is currently in XmlSchemaSet.Add and thus is probably the one holding the lock we are trying to enter here

OS Thread Id: 0x1308 (40)
ESP       EIP     
104ce878 7c82ed54 [HelperMethodFrame: 104ce878] 
104ce8d0 638b46bf System.Xml.Schema.SchemaNames..ctor(System.Xml.XmlNameTable)
104cebd4 638ca6fb System.Xml.Schema.XmlSchemaSet.GetSchemaNames(System.Xml.XmlNameTable)
104cebe0 638c96fc System.Xml.Schema.XmlSchemaSet.PreprocessSchema(System.Xml.Schema.XmlSchema ByRef, System.String)
104cebf8 638c86b7 System.Xml.Schema.XmlSchemaSet.Add(System.String, System.Xml.Schema.XmlSchema)
104cec04 638c7667 System.Xml.Schema.XmlSchemaSet.Add(System.Xml.Schema.XmlSchemaSet)
104cec60 0eb65afc CustomComponent.Tools.Web.Services.Extensions.ValidationExtension.ProcessMessage(System.Web.Services.Protocols.SoapMessage)
104cec8c 69918f30 System.Web.Services.Protocols.SoapMessage.RunExtensions(System.Web.Services.Protocols.SoapExtension[], Boolean)
104ceca4 699227b3 System.Web.Services.Protocols.SoapServerProtocol.Initialize()
104cece8 6990d904 System.Web.Services.Protocols.ServerProtocolFactory.Create(System.Type, System.Web.HttpContext, System.Web.HttpRequest, System.Web.HttpResponse, Boolean ByRef)
104ced28 699263ab System.Web.Services.Protocols.WebServiceHandlerFactory.CoreGetHandler(System.Type, System.Web.HttpContext, System.Web.HttpRequest, System.Web.HttpResponse)
104ced64 69926329 System.Web.Services.Protocols.WebServiceHandlerFactory.GetHandler(System.Web.HttpContext, System.String, System.String, System.String)
104ced88 65fc057c System.Web.HttpApplication.MapHttpHandler(System.Web.HttpContext, System.String, System.Web.VirtualPath, System.String, Boolean)
104cedcc 65fd58cd System.Web.HttpApplication+MapHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
104ceddc 65fc1610 System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
104cee1c 65fd32e0 System.Web.HttpApplication+ApplicationStepManager.ResumeSteps(System.Exception)
104cee6c 65fc0225 System.Web.HttpApplication.System.Web.IHttpAsyncHandler.BeginProcessRequest(System.Web.HttpContext, System.AsyncCallback, System.Object)
104cee88 65fc550b System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
104ceebc 65fc5212 System.Web.HttpRuntime.ProcessRequestNoDemand(System.Web.HttpWorkerRequest)
104ceec8 65fc3587 System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)
104cf078 79f35ee8 [ContextTransitionFrame: 104cf078] 
104cf0c8 79f35ee8 [GCFrame: 104cf0c8] 
104cf220 79f35ee8 [ComMethodFrame: 104cf220] 

0:040> kb
ChildEBP RetAddr  Args to Child              
104ce5d4 7c822124 77e6bad8 000002e8 00000000 ntdll!KiFastSystemCallRet
104ce5d8 77e6bad8 000002e8 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
104ce648 79e718fd 000002e8 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
104ce68c 79e718c6 000002e8 ffffffff 00000000 mscorwks!PEImage::LoadImage+0x199
104ce6dc 79e7187c ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117
104ce6ec 7a0d0d0f ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0x17
104ce710 7a0d5dfa ffffffff 00001037 000d4368 mscorwks!SVR::gc_heap::wait_for_gc_done+0x99
104ce788 7a0d691f 0e8f8650 00000014 00000000 mscorwks!SVR::gc_heap::try_allocate_more_space+0x17
104ce7a8 7a0d7ecc 0e8f8650 00000014 00000000 mscorwks!SVR::gc_heap::allocate_more_space+0x2f
104ce7c8 7a08bd32 0e8f8650 00000014 00000002 mscorwks!SVR::GCHeap::Alloc+0x74
104ce7e4 79e754ff 00000014 00000000 00080000 mscorwks!Alloc+0x60
104ce824 79e755c1 639f59e0 02382edc 02af0090 mscorwks!FastAllocateObject+0x38
104ce8c8 638b46bf 02af0e60 00000000 00000000 mscorwks!JIT_NewFast+0x9e
104ce8cc 02af0e60 00000000 00000000 00000000 System_Xml_ni+0x1146bf
WARNING: Frame IP not in any known module. Following frames may be wrong.
104ce8d0 00000000 00000000 00000000 00000000 0x2af0e60

But unfortunately it can't give up this lock and finish what it is doing because it is waiting for the GC to finish, so effectively we are in a deadlock.

Thread 40 holds a lock but to release it it needs the GC to complete.  The GC can't complete because it can't suspend thread 56 (and other threads) that have preemptive GC disabled.  Thread 56 can't enable preemptive GC until it gets the lock owned by thread 40.

I should add, that the only times I have seen this it has been in the lock in XmlSchemaSet.Add, and under heavy load when multiple threads were trying to access the same XmlSchemaSet.


So what can we do about this?  Well, a hotfix just came out (KB946644) that will fix this problem so if you are running into this issue you can call into support and ask for that hotfix. 

In my earlier cases, the customers have locked around the XmlSchemaSet in the custom component since according to the msdn documentation any instance members of XmlSchemaSet are not guaranteed to be threadsafe.  This resolved the issue...



Comments (14)

  1. ASP.NET ASP.NET Wiki Beta [Via: Scott Hanselman ] Sharepoint Dev Tip: The SharePoint University of…

  2. Link Listing – February 10, 2008

  3. Marty says:

    Tess, is that a CLR hotfix or an XML hotfix? We are seeing the same root problem (gc thread cant suspend due to other threads w/ preemptive disabled waiting on crit secs owned by gc thread) but XML is not in the stacks at all.

  4. Tess says:

    this hotfix is specifically for this issue… you might be running into the gc/loaderlock deadlock described here though


    there are two versions of it, 1. you load up mixed mode (c++) dlls and you either have managed code in dllmain or other managed entry points like static arrays etc.  or 2. you are loading up a dll that references a strong named mixed mode dll and you block in the policy resolution, and the load is done using a native loading method like createobject (something that grabs the loaderlock)

    in case #1 you have to recompile the dll with /noentry and follow the articles referenced in that post to make sure you have no entry points.  in case #2 you should manually load the referenced dll using assembly.load in application_start for example

    Sorry for the convoluted answer, the space is a little short in the comments:)  I can write a post on case #2 soon if you think you are running into that, but hopefully this helps you in the meantime…

  5. Marty says:

    From looking at the stacks, I know 1 thread (#31) is in gc trying to suspend 2 other threads (#17,30) that are in preemptive disabled mode that are trying to enter crit secs owned by the gc thread (#31), hence the deadlock. (note: the non-gc threads waiting on the crit secs are in exception handlers at the time). This is happening in Microsoft CRM 3.0 (which is an asp.net 1.1 app). I am trying to better understand if this can be caused by the crm app code or if that hotfix was for the CLR itself. It looks like an EE crit sec, and my understanding of preemptive is that it can only be set from the CLR unmanaged code. So based on that I am guessing its a CLR bug, but I could be missing something. There is a US MS Case open, would love if you could take a look at the dump I uploaded 🙂 Waiting for TAMs to get thier acts together.

  6. Tess says:

    sounds interesting, the support engineer you are working with probably debugs as much as I 🙂 and is probably specializing in CRM, but send me the case number and support engineers name or email address via the contact me section and I’ll talk to him or her to see if they need a second opinion…

  7. Paul K says:

    Tess, we are having exactly the same issue as described in your article and are also working with MS support to resolve it. I’ll send you the details and would appreciate if you could give the person we’re working with your opinion as they don’t seem to be able to find the hotfix you mentioned in the article and we’re running out of options to try on our side. This is how the stack looks in our case:

    2f8384b4 0a06f23b System_Xml!System.Xml.Schema.SchemaNames..ctor(System.Xml.XmlNameTable)+0x6d1

    0a45e7fc 0a043bf0 System_Xml!System.Xml.Schema.XmlSchemaSet.GetSchemaNames(System.Xml.XmlNameTable)+0x43

    0a45e7fc 0a043b98 System_Xml!System.Xml.Schema.XmlSchemaSet.PreprocessSchema(System.Xml.Schema.XmlSchema ByRef, System.String)+0x20

    0a45e840 0a04f35a System_Xml!System.Xml.Schema.XmlSchemaSet.Add(System.String, System.Xml.Schema.XmlSchema)+0x28

    0a45e840 09f6209a System_Xml!System.Xml.Schema.XmlSchemaSet.Add(System.Xml.Schema.XmlSchemaSet)+0x19a

    As you can see it’s almost the exact replica of what you have in the article.

  8. Tess says:

    Hi Paul,

    The article doesn’t seem to be public yet but the hotfix is available, if they search on the KB 946644,  if they can’t find it they can contact me.


  9. Paul K says:

    Hi Tess,

    It seems like our support person already talked to you and indeed, the hotfix for KB 946644 helped in our case (SP1 by itself didn’t). I can’t say that the situation is completely reolved though as we’re till getting strange errors related to schema validation (102 errors out of 33k+ requests) that look like the following:

    System.Xml.Schema.XmlSchemaValidationException: The ‘Service’ element is not declared. at System.Xml.Schema.XmlSchemaValidator.SendValidationEvent(XmlSchemaValidationException e, XmlSeverityType severity)

    at System.Xml.Schema.XmlSchemaValidator.SendValidationEvent(String code, String arg)

    at System.Xml.Schema.XmlSchemaValidator.ThrowDeclNotFoundWarningOrError(Boolean declFound)

    at System.Xml.Schema.XmlSchemaValidator.ValidateElement(String localName, String namespaceUri, XmlSchemaInfo schemaInfo, String xsiType, String xsiNil, String xsiSchemaLocation, String xsiNoNamespaceSchemaLocation)

    at System.Xml.XsdValidatingReader.ProcessElementEvent()

    at System.Xml.XsdValidatingReader.ProcessReaderEvent()

    at System.Xml.XsdValidatingReader.Read()

    at System.Xml.XmlLoader.LoadNode(Boolean skipOverWhitespace)

    at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)

    at System.Xml.XmlLoader.Load(XmlDocument doc, XmlReader reader, Boolean preserveWhitespace)

    at System.Xml.XmlDocument.Load(XmlReader reader)

    Surely enough, the ‘Service’ element IS declared and the same request is executed just fine right before and right after the one that fails. The element may be different in different errors, but it’s always the same error and it’s just few lines of code away from the place where we had problems with XMLSchemaSet.Add() call previously. Can it be possibly related to the hotfix? It appears that the schema object gets corrupted in some rare cases (internally), which causes it to loose information about some of the elements that are properly declared in the schema.

    Also, we do see abnormal GC patterns that were not observed under 1.1 on the same box with the same application (last time tested three weeks ago). This includes large "% time in GC" (even without any external requests), number of Gen 0 to Gen 1 collections is close to 2:1 rather than to recommended 10:1, Gen 2 Heap size is significantly larger than Gen 0 or Gen 1, number of allocated bytes / sec and number of Gen 0 promoted bytes / sec is large even with zero users and no incoming requests and so on. I sent all the details with graphs and correlations to Michael Noto, who is handling our case (SRX080xxxxx0861).

    This difference between 1.1 and 2.0 appears to be somewhat similar to what was reported long time ago by "philippe" in one of the comments to your High CPU in GC article (http://blogs.msdn.com/tess/archive/2006/06/22/643309.aspx). Any insight as to what might be causing it? Thanks much.


  10. Tess says:

    Hi Paul,

    off the top of my head I can’t say that I’ve seen this before, and I am currently on vacation so I don’t have access to my usual resources.

    Based on what I know about the hotfix I don’t think it is related though as the hotfix deals with blocking the thread for GCs but of course you can never be sure if it causes sideeffects.

    You might want to set up a breakpoint and dump on the first occurrence of this  System.Xml.Schema.XmlSchemaValidationException so that you can examine the structures, but I would suggest that it is set up in such a way that the breakpoint is disabled after it’s been hit once if they are very frequent.

  11. I very frequently get emails like the one I got this morning: "Tess, It sounds like the hotfix for kb946644

  12. Mark Weinheimer says:

    I tried to open a support case to get hotfix KB 946644, and was told that this hotfix has been discontinued.  Do you know why this would be the case, or if the problem is being addressed in some other way in a future hotfix?

    It would be nice to try the hotfix, at least for experimental purposes, to see whether the problem it addresses really is the cause of our symptoms.  Is there any other way for us to get a copy?

  13. Tess says:

    I had a look and I can’t see that it has been discontinued,  from the looks of it it will also be included in SP1.  I think the confusion might be in that the kb is not released yet, but the hotfix still seems to be there.  Unfortunately I can’t post the hotfix here, but if you contact your support specialist again they can contact me if they want.

  14. Mikhail Fedarovich says:

    Hi Tess,

    Our team has faced with similar deadlock recently. After several weeks of head's pain we finally found the reason. As far as I didn't find any similar information, I want to share it. Possibly it will help somebody and safe time & nerves.

    It was reproduced on .NET framework v4.0.

    Information about threads prove that we faced with similar situation.

    0:000> !threads

    ThreadCount:      21

    UnstartedThread:  0

    BackgroundThread: 21

    PendingThread:    0

    DeadThread:       0

    Hosted Runtime:   no

                                      PreEmptive   GC Alloc                Lock       ID  OSID ThreadOBJ    State GC           Context       Domain   Count APT Exception

     11    6   d94 0fa8f048      b220 Enabled  6ef8e9c8:6ef8e9c8 005c7b40     0 MTA (GC) System.ExecutionEngineException (03551120)

     19   11  13d0 0f58ca78      b220 Disabled 6ef8d70c:6ef8e9b0 005c7b40     2 MTA

    0:011> !threadpool

    CPU utilization: 81%

    Worker Thread: Total: 9 Running: 1 Idle: 3 MaxLimit: 2047 MinLimit: 4

    Work Request in Queue: 1

       AsyncTimerCallbackCompletion TimerInfo@0b7e9db8


    Number of Timers: 1


    Completion Port Thread:Total: 0 Free: 0 MaxFree: 8 CurrentLimit: 0 MaxLimit: 1000 MinLimit: 4

    Dumps of threads #11 and #19 shows that they use the same instance of one object 3c762ae8. However it is “impossible” because this object is born and die in thread #19. And there is no ways how execution can be split on two threads. Except one. The hole is event that is raised inside thread. Framework send execution of handler to another thread.

    As result thread #19 raise event. Then handler in thread #11 do his work, allocate memory, but memory is fragmented and GC start working in this thread. GC try to stop all threads but can’t do it because thread #19 is already blocked by #11.

    In our case tablet was quite simple – we replaced event by delegate and hole was closed.

    0:011> !dumpstack

    OS Thread Id: 0xd94 (11)

    Current frame: clr!MethodDesc::GetSig+0x3b

    ChildEBP RetAddr  Caller,Callee

    2471efd8 73e0370b clr!Thread::StackWalkFrames+0xc1, calling clr!__security_check_cookie

    2471efe0 73f69583 clr!MethodDesc::ReturnsObject+0x24, calling clr!MetaSig::MetaSig

    2471f058 3c923948 (MethodDesc 3c762ae8 +0x158 SomeNamespace.SomeAlgoObject`1[[System.Double, mscorlib]].DoSmth(…)), calling 3c8ba030

    2471f080 73f6cbf6 clr!Thread::HandledJITCase+0xc4, calling clr!Thread::StackWalkFramesEx

    0:019> !dumpstack

    OS Thread Id: 0x13d0 (19)

    Current frame: 3c8ba034

    ChildEBP RetAddr  Caller,Callee

    2810e4c4 3c923948 (MethodDesc 3c762ae8 +0x158 SomeNamespace.SomeAlgoObject`1[[System.Double, mscorlib]].DoSmth(…), calling 3c8ba030

    2810e53c 3c923695 (MethodDesc 3c762aa8 +0x65 SomeNamespace.SomeAlgoObject`1[[System.Double, mscorlib]].DoWork(…)), calling (MethodDesc 3c762ae8 +0 SomeNamespace.SomeAlgoObject`1[[System.Double, mscorlib]].DoSmth(…))

Skip to main content