I had this interesting issue reported where the instance count would increase from 10 to 50 over the course of the month. This used to happen with the exact similar load and number of users. This was really perplexing for the customer’s Azure application developers and hence they reported the issue to Microsoft Azure Support.
At the very first analysis we found they used to manually increase the instance count as the existing instances used to hit CPU around 90% and forming a plateau there, and hence becoming more or less unresponsive. So it was relatively simple to isolate the cause of the increase in instance count. The crucial thing was to find the cause of the high CPU.
High CPU on the instance is generally caused by the application code. So we took memory dump on one of the instance when it gets into high CPU situation. The way we analyze is like the one I did for a situation detailed in this blog http://blogs.msdn.com/b/cie/archive/2013/11/28/windows-azure-worker-role-showing-high-cpu.aspx
Process to collect dumps
a) RDP to the instance running the Cloud Service.
b) From task manager check which process is taking most CPU and is staying there without coming down.
c) Right click and collect a Full Crash Dump and repeat this every 1 minute for say 5 minutes. So we will have 5 dump files.
d) Once you have the files you can analyze it or create a ticket with Microsoft for an engineer to help analyze.
In this case it was W3WP process so we collected process dumps on it. In the first 2 memory dumps I didn’t find any high CPU.
I am not delving into the details on how I did it as it merits a separate discussion. In all the dumps I see the CPU moving between 81% to 100%. Most of the calls that are stuck are as following
The common function in all the call stacks is
MoviePlayer.TableStorage.GetListRangeEntity[[System.__Canon, mscorlib]](System.Collections.Generic.List`1<MoviePlayer.QueryTableStorage>, System.String)
It is executing the following code
The collection passed is as below.
0000000f58a9f1f0 System.Collections.Generic.List`1[[MoviePlayer.CategoryList, MoviePlayerDistList]]
The collection is as follows and contains 726 objects
MT Field Offset Type VT Attr Value Name
00007ff96af01250 4000cd1 8 System.Object 0 instance 0000000f58b9a7c8 _items
00007ff96af037c8 4000cd2 18 System.Int32 1 instance 726 _size
00007ff96af037c8 4000cd3 1c System.Int32 1 instance 726 _version
00007ff96af011b8 4000cd4 10 System.Object 0 instance 0000000000000000 _syncRoot
Looking at the size of this object.
sizeof(0000000f58b40940) = 137048 (0x21758) bytes (MoviePlayer.CategoryList)
This object MoviePlayer.CategoryList size is greater than 85,000 bytes. Any object greater than 85,000 bytes will not get allocated in the normal SOH heap but will go to the LOH (Large Object Heap). Details around LOH and GC can be found in the articles
If I look at the process uptime 1:45:31.000 = 6331 seconds in the first dump. If I look at the number of time GC has run it’s very high for Gen 2. It’s almost like Gen 2 collection is attempted in every two seconds.
.NET CLR Memory
Bytes in All Heaps 84,495,440
GEN 0 Collections 55,143
GEN 1 Collections 13,746
GEN 2 Collections 3,463
# Induced GCs 0
# of Pinned Objects 2
Sync Blocks in use 121
Finalization Survivors 0
Total Commited Bytes 502,095,872
Total Reserved Bytes 18,253,578,240
GEN 0 Heap Size 26,423,840
GEN 1 Heap Size 2,989,384
GEN 2 Heap Size 52,896,880
LOH Size 2,185,336
% Time in GC 7.90%
So the Action Plan for this issue was to reduce the size of the object for MoviePlayer.CategoryList. Since most outside developer support engineer roles are not familiar with post mortem analysis the following could be used to find the size of the object using .Net or Visual Studio
After implementing the suggestions the CPU grows in linear fashion with load and not exponential. The CPU stopped hitting >90% and staying there, so there was no need to spawn additional instances of the role to server users.
Hope this article helps in understanding one of the fundamental causes of frequent GC leading to high CPU. It’s not just Azure but could happen on premises application as well.
Cloud Integration Engineer