The mysterious case of the 1 minute pauses

I spend a lot of time debugging issues on our service - we have a big team and periodically we hit some really weird issues that get escalated to me. In this particular case, we deployed our new sprint, full of exciting features to our dogfood server, and after a day or two, people started complaining of sluggishness.

A quick check of the dashboard showed our availability was slowly getting worse:

Our DRIs had been tracking this issue but couldn't get a usable PerfView log - so I got called in to help out - and I saw the same behavior - i.e. the ETL traces that we auto-collect were not helpful - mostly because they either didn't show high CPU or because the symbols were messed up  - the dreaded ?!? in PerfView (that happens sometimes with dynamically generated code, like XML Serializers or compiled regex) but this was worse than usual.

So I hopped onto the machines and captured a more detail log of requests/sec and queued requests:



And lo and behold, we have a one minute gap where nothing happens, followed by a massive increase in queued requests, then everything goes back to normal. I was pretty sure I was onto something so I figured I had to get a trace during that exact minute. So I fired up PerfView and waited - once I saw requests/sec plummet to zero, I clicked "Collect" and this is what I saw in the CPU Stacks view - just one thread doing work (and lot of it - basically burning an entire core):


That was still puzzling - we have one thread getting called by an ETW callback and doing rundown (rundown is basically the process that tells the CLR to dump type information in the ETW stream so that profilers can make sense of things like dynamic types/assemblies) and absolutely nothing else going on... I decided a dump might be helpful, so I waited again and collected a reflected process dump with ProcDump.

I opened it up and saw tons of these types of threads:


Except for one thread whose stack didn't fully resolve but did match the thread id of my PerfView above (i.e the rundown thread):

clr!operator delete[]+0x20

So here's what we know so far:

  • All threads are stuck waiting for a GC
  • One thread is burning an entire core doing rundown, enumerating all of the types in the process.

The questions that remain are:

  1. Why is the rundown taking so long?
  2. Why is it holding up the GC threads?
  3. Why is there a rundown at all?

Since I have a process dump, I am going to look at the types in there:

0:092> !DumpDomain
normal output followed by thousands of these guys:
Assembly: 0000025879e43270 (Dynamic) []
ClassLoader: 000002587c924da0
SecurityDescriptor: 0000025879cd5220
 Module Name
00007ff939994ea8 Dynamic Module

Assembly: 0000025879e44230 (Dynamic) []
ClassLoader: 000002587c924e50
SecurityDescriptor: 0000025879cd5d60
 Module Name
00007ff939995b08 Dynamic Module

So that confirms the dynamic types going crazy theory - now we just have to figure out what they are and what is causing the problem - so I pick a couple of random modules and dump them (I also notice there's a cool -MT option to !DumpModule)

0:092> !DumpModule -MT 00007ff93bc7f7a0
Name: Unknown Module
Attributes: Reflection Supports
Types defined in this module
MT TypeDef Name
00007ff93bc7fed0 0x02000002 <Unloaded Type>
00007ff93bcc0068 0x02000003 <Unloaded Type>
00007ff93bcc0150 0x02000004 <Unloaded Type>
00007ff93bcc0260 0x02000005 <Unloaded Type>
00007ff93bcc03b0 0x02000006 <Unloaded Type>

Then I dump the types themselves and lo and behold, it's a bunch of XML Serializers:

0:092> !DumpMT -MD 00007ff93bcc03b0
EEClass: 00007ff93bcbdac8
Module: 00007ff93bc7f7a0
Name: <Unloaded Type>
mdToken: 0000000002000006
File: Unknown Module
BaseSize: 0x28
ComponentSize: 0x0
Slots in VTable: 12
Number of IFaces in IFaceMap: 0
MethodDesc Table
 Entry MethodDesc JIT Name
00007ff987692c20 00007ff9871f7de0 PreJIT System.Object.ToString()
00007ff987696f40 00007ff9871f7de8 PreJIT System.Object.Equals(System.Object)
00007ff9876960a0 00007ff9871f7e10 PreJIT System.Object.GetHashCode()
00007ff9876bb7b0 00007ff9871f7e28 PreJIT System.Object.Finalize()
00007ff93b0f1e58 00007ff93bcc0358 NONE Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.get_Reader()
00007ff93b0f2280 00007ff93bcc0360 JIT Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.get_Writer()
00007ff93b0f1e68 00007ff93bcc0368 NONE Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.get_ReadMethods()
00007ff93b0f22c0 00007ff93bcc0370 JIT Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.get_WriteMethods()
00007ff93b0f1e78 00007ff93bcc0378 NONE Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.get_TypedSerializers()
00007ff93b0f1e80 00007ff93bcc0380 NONE Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.CanSerialize(System.Type)
00007ff93b0f1e88 00007ff93bcc0388 NONE Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract.GetSerializer(System.Type)
00007ff93b0f2260 00007ff93bcc0390 JIT Microsoft.Xml.Serialization.GeneratedAssembly.XmlSerializerContract..ctor()

But wait, those serializers are supposed to be cached right? Well, yes and no, in this case, we're using the constructor  that does not use the built-in cache so a new serializer is created on every call to Serialize which means that we're generating a new assembly every couple of seconds...

But why does the CLR rundown block the GC? Well, that was a bug and that was fixed last fall! We just hadn't picked up the new build yet.

But wait, why are we even doing rundown? That's because we're running a super sweet background profiler which wakes up every so often and collects some data that gets aggregated later.



Comments (4)

  1. Mike DePouw says:

    Very interesting, thanks for sharing! Wish I could go that deep!

    If it was a bug in the CLR, why wasn’t it always a problem? Why did it just start?

    What was the fix? Upgrade .Net version? I assume there would concerns b/c that could break other stuff and would require more intense testing?

    1. We fixed the root cause by reusing the XmlSerializers – we will eventually get the new version of the CLR.

  2. Dave Rendón says:

    Very useful, thank you, do you have any best practices for enabling on-premises server profiler? Seems you can download support code for on-premises servers but MS is not responsible for maintaining that code, or for responding to issues and feature requests related to the code.

    1. I don’t really have any tips there – we don’t have it enabled on our on-premises servers either because we don’t have the right telemetry pipeline in place to monitor

Skip to main content