The curious case of a Windows Service consuming 3% of CPU time

Hello,

Recently I had a nice case which I thought I should share. My hope is that the steps might be useful for the beginner/intermediate level debugging fans out there.

My customer's complaint was that an in-house developed Windows Service was steadily consuming 3% of CPU time. To start troubleshooting we have created a dump file via Task Manager.

To see for how long the process was alive I've used the ".time" command :

0:000> .time
Debug session time: Thu Nov  3 09:54:01.000 2016 (UTC + 3:00)
System Uptime: 140 days 19:23:12.533
Process Uptime: 0 days 7:54:01.000
Kernel time: 0 days 0:00:35.000
User time: 0 days 7:30:28.000

 

The output states that the process was running for 0 days, 7 hours, 54 minutes and 1 second. Also, the process has consumed 7 hours 31 minutes and 3 second processor time during its lifetime (Kernel Time + User Time).

Let's convert both of the life and CPU-consumption times into seconds. Hence, we have 28441 for Up-Time and 27063 for CPU-Time. CPU-Time divided by Up-Time and multiplied by 100 will result into CPU utilization. This calculates a 95.15 % CPU utilization.

But, in Task Manager each process's CPU usage is displayed as CPU usage with respect to all of the CPUs in the system. Hence, we need to learn the number of cores in this server. We get this information via the "!cupid" command. Let's do it :

0:000> !cpuid
CP  F/M/S  Manufacturer     MHz
0  6,13,7  <unavailable>   2700
1  6,13,7  <unavailable>   2700
2  6,13,7  <unavailable>   2700
3  6,13,7  <unavailable>   2700
4  6,13,7  <unavailable>   2700
5  6,13,7  <unavailable>   2700
6  6,13,7  <unavailable>   2700
7  6,13,7  <unavailable>   2700
8  6,13,7  <unavailable>   2700
9  6,13,7  <unavailable>   2700
10  6,13,7  <unavailable>   2700
11  6,13,7  <unavailable>   2700
12  6,13,7  <unavailable>   2700
13  6,13,7  <unavailable>   2700
14  6,13,7  <unavailable>   2700
15  6,13,7  <unavailable>   2700
16  6,13,7  <unavailable>   2700
17  6,13,7  <unavailable>   2700
18  6,13,7  <unavailable>   2700
19  6,13,7  <unavailable>   2700
20  6,13,7  <unavailable>   2700
21  6,13,7  <unavailable>   2700
22  6,13,7  <unavailable>   2700
23  6,13,7  <unavailable>   2700
24  6,13,7  <unavailable>   2700
25  6,13,7  <unavailable>   2700
26  6,13,7  <unavailable>   2700
27  6,13,7  <unavailable>   2700
28  6,13,7  <unavailable>   2700
29  6,13,7  <unavailable>   2700
30  6,13,7  <unavailable>   2700
31  6,13,7  <unavailable>   2700

There are 32 cores in the computer. A 100% CPU usage of a single core by a process would be showed in the CPU column in Task Manager as 1 / 32 x 100. This equals to 3.125%. However, in this case the service process consumes roughly 95.15% of a single core. Then, we need to find out the 95.15 percent of 3.125. And, this is 3,125 x 0,9515 = 2,9734. Which might be rounded up as 3 % in Task Manager.

This leaves us with the probability of one thread of the process being in a tight loop. Let's verify. The "!runaway" command lists the threads of a process and their CPU usage time. The list contains only the threads those are running at the time when the dump was generated. It does not contain any thread that has executed and exited before the dump was made.

0:000> !runaway
User Mode Time
Thread       Time
  89:e370      0 days 7:29:37.128
34:3fc0      0 days 0:00:05.304
45:9b20      0 days 0:00:02.636
84:8830      0 days 0:00:00.546
93:21340      0 days 0:00:00.390
91:1b3f4      0 days 0:00:00.265
94:21264      0 days 0:00:00.234
59:179f4      0 days 0:00:00.234
75:1edc4      0 days 0:00:00.218
68:20aec      0 days 0:00:00.218

[Truncated the complete output for brevity]

The top consumer is the thread with Debugger ID 89. It has used 7 hours 29 minutes and 37 seconds of CPU time. This is almost equal to the total CPU usage time of the process. Let's switch to that thread :

0:089> ~89s
XYZ_ABC_PQR!XYZ.ABC.PQR.ResourcePool.PoolManagerFunc()+0xb4:
000007fe`99834b14 85c0            test    eax,eax

and see what it is doing :

0:089> !clrstack -n
OS Thread Id: 0xe370 (89)
Child SP               IP Call Site
0000000023d2e590 000007fe99834b14 XYZ_ABC_PQR!XYZ.ABC.PQR.ResourcePool.PoolManagerFunc()
0000000023d2e5f0 000007fe99543cb5 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
0000000023d2e750 000007fe995408b9 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
0000000023d2e780 000007fe995407ef System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0000000023d2e7d0 000007fe9953c130 System.Threading.ThreadHelper.ThreadStart()
0000000023d2eb38 000007fef8739e03 [GCFrame: 0000000023d2eb38]
0000000023d2ee68 000007fef8739e03 [DebuggerU2MCatchHandlerFrame: 0000000023d2ee68]
0000000023d2f048 000007fef8739e03 [ContextTransitionFrame: 0000000023d2f048]
0000000023d2f238 000007fef8739e03 [DebuggerU2MCatchHandlerFrame: 0000000023d2f238]

There is no function name that suggests any wait. Hence, the thread should be spinning inside XYZ_ABC_PQR!XYZ.ABC.PQR.ResourcePool.PoolManagerFunc(). We need to understand what this function is doing. Let's find the module information first :

0:089> lm
start             end                 module name
00000000`69370000 00000000`693b0000   XYZ_ABC_PQR C (no symbols)

[Truncated the complete output for brevity]

 

And, then write the module into a file.

0:089> !savemodule 00000000`69370000 F:\work\blog\modules\XYZ.ABC.PQR.dll
3 sections in file
section 0 - VA=2000, VASize=38904, FileAddr=200, FileSize=38a00
section 1 - VA=3c000, VASize=3d0, FileAddr=38c00, FileSize=400
section 2 - VA=3e000, VASize=c, FileAddr=39000, FileSize=200

 

Finally, open the saved module XYZ.ABC.PQR.dll within ILSpy (or any .NET reflector tool) and browse the PoolManagerFunc :

private static void PoolManagerFunc()
{
while (ResourcePool.Initialized)
{
if (ResourcePool.availableResources.Count <= ResourcePool.POOL_MIN_SIZE)
{
// Open a resource and add it to the pool. Then, sleep for 50 milliseconds. Removed the actual code!
Thread.Sleep(50);
}
}
}

The code above suggests that if the number of resources in the pool is less than the configured minimum then create a new resource and add it to the pool.

However, if there are at least a minimum number of resources in the pool then the loop "while (ResourcePool.Initialized)" and the control statement "if (ResourcePool.availableResources.Count <= ResourcePool.POOL_MIN_SIZE)"  are executed all the time.

To have an idea of the number of resources in that pool we simply need to dump out the ResourcePool class instance. Since, that class instance is declared as static we have to dump it with the  command "!dumpclass".

Before that we have to find out the class address of XYZ.ABC.PQR.ResourcePool. We will first learn the MethodTable value using the XYZ.ABC.PQR.ResourcePool.PoolManagerFunc function. Let's re-visit the following output :

0:089> !clrstack -n
OS Thread Id: 0xe370 (89)
Child SP               IP Call Site
0000000023d2e590 000007fe99834b14 XYZ_ABC_PQR!XYZ.ABC.PQR.ResourcePool.PoolManagerFunc()

[Truncated the complete output for brevity]

Asking the SOS extension the MethodDescriptor value of the IP (Instruction Pointer) :

0:089> !ip2md 000007fe99834b14
MethodDesc:   000007fe998e80c0
Method Name:  XYZ.ABC.PQR.ResourcePool.PoolManagerFunc()
Class:        000007fe99912808
MethodTable:  000007fe998e8140
mdToken:      00000000060001d3
Module:       000007fe998e2220
IsJitted:     yes
CodeAddr:     000007fe99834a60
Transparency: Critical

So the class address is 000007fe99912808.

0:089> !dumpclass 000007fe99912808
Class Name:      XYZ.ABC.PQR.ResourcePool
mdToken:         0000000002000091
File:            C:\Windows\Microsoft.NET\Framework64\v4.0.30319\Temporary ASP.NET Files\www\c2274130\f70100df\assembly\dl3\0d903943\00c83a5a_1060d101\XYZ.ABC.PQR.ResourcePool.dll
Parent Class:    000007fe990f9730
Module:          000007fe998e2220
Method Table:    000007fe998e8140
Vtable Slots:    4
Total Method Slots:  5
Class Attributes:    181  Abstract,
Transparency:        Critical
NumInstanceFields:   0
NumStaticFields:     b
MT    Field   Offset                 Type VT     Attr            Value Name
[Truncated the complete output for brevity]

000007fe998e81d0  400085f      110 XYZ.ABC.PQR]]  0   static 00000006bf2e6880 availableResources

000007fe990df9e8  4000862       f4         System.Int32  1   static               20 POOL_MIN_SIZE
000007fe990df9e8  4000863       f8         System.Int32  1   static              500 POOL_MAX_SIZE
000007fe990dc918  4000865      101       System.Boolean  1   static                1 Initialized

 

Then dump the "availableResources" member's content :

0:089> !DumpObj 00000006bf2e6880
Name:        System.Collections.Generic.Queue`1[[XYZ.ABC.PQR.CustomResource, XYZ.ABC.PQR]]
MethodTable: 000007fe998e81d0
EEClass:     000007fe9990e458
Size:        48(0x30) bytes
File:        C:\Windows\Microsoft.Net\assembly\GAC_MSIL\System\v4.0_4.0.0.0__b77a5c561934e089\System.dll
Fields:
MT    Field   Offset                 Type VT     Attr            Value Name
000007fe990dbd20  40007fa        8      System.Object[]  0 instance 000000083f30bbe0 _array
000007fe990df9e8  40007fb       18         System.Int32  1 instance                1 _head
000007fe990df9e8  40007fc       1c         System.Int32  1 instance               23 _tail
000007fe990df9e8  40007fd       20         System.Int32  1 instance               22 _size
000007fe990df9e8  40007fe       24         System.Int32  1 instance           108510 _version
000007fe990d63a8  40007ff       10        System.Object  0 instance 0000000000000000 _syncRoot
000007fe990dbd20  4000800        8      System.Object[]  0   static  <no information>

Hence, there are currently 22 resources in the pool. The code fragment that is responsible for new resource creation, pool addition and sleeping will only execute if the number of resources drops below the minimum (which is 20, right now) else the whole time the thread will execute a tight loop. And, that will cause the 3% CPU usage.

To fix the problem the implementation of XYZ.ABC.PQR.ResourcePool.PoolManagerFunc() should be changed such that a new resource is added when the number of resources drops below the lower threshold and not polling the number of resources in a tight loop.

Good luck!

 

REFERENCES

.time (Display System Time) !cupid !runaway SOS.dll (SOS Debugging Extension) How to create a user-mode process dump file in Windows Vista and in Windows 7