I was looking at a memory dump recently for an issue where the process would grow to over 1 GB and return OutOfMemory Exceptions.
Debugging the issue:
A full user dump of the w3wp.exe process was taken with debug diag 1.1 when memory was really high.
If you are reading this and have no clue what debug diag is, how to get dumps with it, or use leak tracking, here is the download and the usage instructions.
The dump itself was 1.34 GB which is basically the memory we have committed (used) so it was pretty big, and the first thing we need to check for in this case is whether we are dealing with an issue with high .NET GC memory usage (lots of .net objects), .NET loader heap usage (lots of dlls) or native memory usage (native heaps or dlls)
I opened the dump in windbg and loaded sos (.loadby sos mscorwks) to take a peak at the managed memory
Number of GC Heaps: 8
GC Heap Size 0xffd12a8(268,243,624)
!eeheap –gc showed that the size of the .net GC heaps was only 268 MB so at first sight it didn’t really look like a .NET leak, which left us with a loader heap issue or a native leak.
——————– Usage SUMMARY ————————–
TotSize ( KB) Pct(Tots) Pct(Busy) Usage
3521d000 ( 870516) : 41.51% 46.09% : RegionUsageIsVAD
cb9d000 ( 208500) : 09.94% 00.00% : RegionUsageFree
8018000 ( 131168) : 06.25% 06.95% : RegionUsageImage
8c00000 ( 143360) : 06.84% 07.59% : RegionUsageStack
8c000 ( 560) : 00.03% 00.03% : RegionUsageTeb
2d58f000 ( 742972) : 35.43% 39.34% : RegionUsageHeap
0 ( 0) : 00.00% 00.00% : RegionUsagePageHeap
1000 ( 4) : 00.00% 00.00% : RegionUsagePeb
1000 ( 4) : 00.00% 00.00% : RegionUsageProcessParametrs
1000 ( 4) : 00.00% 00.00% : RegionUsageEnvironmentBlock
Tot: 7fff0000 (2097088 KB) Busy: 73453000 (1888588 KB)
!address –summary shows that most of the memory is used by RegionUsageIsVAD and RegionUsageHeap, i.e. virtual allocs and heap allocs. Now .NET memory goes into RegionUsageIsVAD but with this much RegionUsageHeap I suspected a native leak or high native memory usage. In order to figure out what was causing this high NT Heap usage I asked for a debug diag dump with leak tracking enabled.
Leak tracking in debug diag will basically track any allocations and de allocations you make, and at the end of the run you will end up with a dump containing information about what stacks allocated the memory currently in use in the dump.
In this case they had already ran debug diag with leak tracking enabled which was great, but I will show you a few gotchas with leak tracking that they had unfortunately ran into.
When you have used leak tracking, you need to run the dump through advanced analysis-Memory preassure analysis in debug diag to parse the leak tracking results. Debug diag will then give you information at the top about the biggest offenders, along with some heap usage information and sample stacks for allocations.
In this case debug diag was showing the following:
|Warning||gdiplus.dll is responsible for 35.86 MBytes worth of outstanding allocations. The following are the top 2 memory consuming functions:
gdiplus!GpMemoryBitmap::AllocBitmapData+4b: 35.37 MBytes worth of outstanding allocations.
gdiplus!GpMalloc+12: 496.51 KBytes worth of outstanding allocations.
|If this is unexpected, please contact the vendor of this module, Microsoft Corporation, for further assistance with this issue.|
|Warning||UNKNOWN_MODULE is responsible for 30.11 MBytes worth of outstanding allocations. The following are the top 2 memory consuming functions:
0x1aab1569: 30.11 MBytes worth of outstanding allocations. </>
|If this is unexpected, please contact the vendor of this module for further assistance with this issue|
Ok, so 35 MB worth of GDI+ objects and 30 MB from some unknown module. But we are dealing with a ~1.4 GB dump so this seem like a tiny drop of water in a huge ocean.
A couple of questions may appear in your head here (at least they do in mine when I see this)
Why didn’t the report show more allocations?
What is the GDI+ stuff?
and what the heck is the Unknown module?
Why didn’t the report show more allocations?
If we look at the report we first have a section with machine name, operating system, process name etc. Here we also have data about the process uptime, which in this case is ~17 hrs. So this memory (1.4 GB) we have in the process was built up over 17 hours.
System Up-Time 3 day(s) 06:47:23
Process Up-Time 17:15:10
If we then scroll down to the Leak Analysis report section we find
Number of allocations 14,959 allocations
Total outstanding handle count 180 handles
Total size of allocations 88.38 MBytes
Tracking duration 00:07:01
So this means that leak tracking has only been turned on for 7 minutes out of the 17 hrs that the process was up. During the time leak tracking was turned on we allocated 88.38 MB and 180 handles. Debug diag will not be able to report “leaks” that happened before leak tracking was turned on, so these 35 MB for GDI+ should be compared to the 88 MB of total allocations.
In reality, given that leak tracking was turned on for such a short amount of time we can’t really affirmatively say much at all about the results here since the size of the allocations are so small compared to the size of the process. We would need to rerun debug diag leak tracking, preferably from the start of the process to say anything for sure.
But, if we ignore this and extrapolate freely, assuming that the allocation pattern is uniform, we can assume that a lot of the memory in the dump is GDI+ items, and a lot comes from this “Unknown module”. I wouldn’t recommend this as proof, but it’s worth a try spinning on this thread to see if we can get any more info, until we get a new memory dump with leak tracking turned on for a longer time.
What is the GDI+ stuff?
Well, looking at the message debug diag gave us, it seems like most of the GDI+ objects come from this method: gdiplus!GpMemoryBitmap::AllocBitmapData. The name implies that we are creating a bitmap, which can either be done explicitly from some native component, or indirectly by creating a System.Drawing.Bitmap.
Using System.Drawing objects from a service is unsupported because of the nature of such applications (i.e. running under the context of a user context that might not be logged, and the potential for desktop heap exhaustion)
Classes within the System.Drawing namespace are not supported for use within a Windows or ASP.NET service. Attempting to use these classes from within one of these application types may produce unexpected problems, such as diminished service performance and run-time exceptions.
But…, I am not going to harp on the unsuitability of System.Drawing objects in a server app. What I would say though, is that if you choose to go this unsupported route, you need to make extremely sure that you got your ducks in a row and dispose of your System.Drawing objects appropriately to avoid handle leaks, desktop heap exhaustion etc.
Going back to the dump in windbg, we can take a look at the heap to see what types and how many System.Drawing objects we have here
MT Count TotalSize Class Name
7ae7a658 1 12 System.Drawing.SR
7ae73bf0 1 16 System.Drawing.SizeF
7ae73ae8 1 16 System.Drawing.Size
7ae73cf4 1 24 System.Drawing.Rectangle
7ae74194 4 48 System.Drawing.Printing.PrintingPermission
7ae73e6c 3 48 System.Drawing.Point
7ae7a28c 4 64 System.Drawing.ToolboxBitmapAttribute
7ae84314 13 208 System.Drawing.TextureBrush
7ae78784 12 288 System.Drawing.Imaging.ImageFormat
7ae7580c 22 704 System.Drawing.Graphics
7ae778e8 27 972 System.Drawing.Pen
7ae78308 299 5980 System.Drawing.Internal.GPStream
7ae78190 356 8544 System.Drawing.Bitmap
7ae75d7c 389 9336 System.Drawing.Color
7ae7747c 593 21348 System.Drawing.SolidBrush
7ae75684 1424 28480 System.Drawing.FontFamily
7ae742b0 1424 56960 System.Drawing.Font
Total 4574 objects
Among other things we can see that we have 356 Bitmaps, so since those are still in memory, unless they are disposed they will still hold on to a lot of native GDI+ mem. Well how much will of course depend on how big the respective bitmaps are.
We’ll get back to why these are sticking around in a moment.
What the heck is the Unknown module?
If you see unknown modules in debug diag, it means that debug diag couldn’t resolve the symbols for the module. This often happens if it is a .net method.
Clicking on the unknown module in the report header takes us to some sample stacks like this one. So some kind of data related stuff.
System.Data.Common.DataAdapter.FillFromReader(System.Data.DataSet, System.Data.DataTable, System.String, System.Data.ProviderBase.DataReaderContainer, Int32, Int32, System.Data.DataColumn, System.Object)
System.Data.Common.DataAdapter.Fill(System.Data.DataSet, System.String, System.Data.IDataReader, Int32, Int32)
System.Data.Common.DbDataAdapter.FillInternal(System.Data.DataSet, System.Data.DataTable, Int32, Int32, System.String, System.Data.IDbCommand, System.Data.CommandBehavior)
Let’s go back to windbg to figure out why these bitmaps are still around
Address MT Size
00e4d510 7ae78190 24
00e4d9a4 7ae78190 24
00e7d4a8 7ae78190 24
0106d20c 7ae78190 24
01074bf0 7ae78190 24
01074d1c 7ae78190 24
0108e08c 7ae78190 24
0108f3b4 7ae78190 24
0109185c 7ae78190 24
01091874 7ae78190 24
0:000> !gcroot 00e4d510
The bitmap that I happened to pick and do !gcroot on (to figure out why it was not collected), was rooted in the finalize queue, meaning that it is ready for finalization but waiting for the finalizer to finalize it.
The other bitmaps follow the exact same pattern, so the next step is to figure out what the finalizer is doing and figure out what is stopping it from finalizing these bitmaps.
Hosted Runtime: no
PreEmptive GC Alloc Lock
ID OSID ThreadOBJ State GC Context Domain Count APT Exception
1 1 190c 0015b598 8808220 Enabled 00000000:00000000 0014b380 0 Ukn (Threadpool Completion Port)
14 2 257c 0016c6d8 b220 Enabled 00000000:00000000 0019c920 0 MTA (Finalizer)
eax=00166878 ebx=00000000 ecx=00167208 edx=00000000 esi=40000ee0 edi=00000000
eip=77f88f13 esp=19a0f12c ebp=19a0f19c iopl=0 nv up ei pl nz na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=0038 gs=0000 efl=00000202
77f88f13 c20c00 ret 0Ch
OS Thread Id: 0x257c (14)
19a0f444 77f88f13 [InlinedCallFrame: 19a0f444] System.Data.Common.UnsafeNativeMethods.OCIHandleFree(IntPtr, HTYPE)
19a0f440 6589a74c System.Data.OracleClient.TracedNativeMethods.OCIHandleFree(IntPtr, HTYPE)
19a0f480 658717b7 System.Data.OracleClient.OciHandle.Release()
19a0f4b4 6587183d System.Data.OracleClient.OciHandle.ReleaseHandle()
19a0f6c8 79e88f63 [GCFrame: 19a0f6c8]
19a0f8c0 79e88f63 [GCFrame: 19a0f8c0]
19a0f92c 79e88f63 [HelperMethodFrame_1OBJ: 19a0f92c] System.Runtime.InteropServices.SafeHandle.InternalFinalize()
19a0f984 7934dc52 System.Runtime.InteropServices.SafeHandle.Dispose(Boolean)
19a0f988 7934db81 System.Runtime.InteropServices.SafeHandle.Finalize()
19a0fc38 79f7b546 [ContextTransitionFrame: 19a0fc38]
19a0fd14 79f7b546 [GCFrame: 19a0fd14]
19a0f128 77f8f295 NTDLL!NtWaitForSingleObject+0xb
19a0f19c 77f87f26 NTDLL!RtlpWaitForCriticalSection+0x9e
19a0f1a4 6111e091 NTDLL!RtlEnterCriticalSection+0x46
WARNING: Stack unwind information not available. Following frames may be wrong.
19a0f1b4 6067819d ORACORE9!sltsmna+0x11
19a0f3fc 60677e82 oraclient9!kpufhndl0+0x30d
19a0f410 6060121a oraclient9!kpufhndl+0x12
19a0f420 25f9c85f oraclient9!OCIHandleFree+0x1a
19a0f430 6589a74c oci!OCICharSetToUnicode+0xaaae
19a0f478 658717b7 System_Data_OracleClient_ni!_load_config_used+0x39554
19a0f4ac 6587183d System_Data_OracleClient_ni!_load_config_used+0x105bf
19a0f4c0 79e88ee4 System_Data_OracleClient_ni!_load_config_used+0x10645
19a0f540 79e88e31 mscorwks!CallDescrWorkerWithHandler+0xa3
19a0f678 79e88d19 mscorwks!MethodDesc::CallDescr+0x19c
19a0f690 79eeea68 mscorwks!MethodDesc::CallTargetWorker+0x20
19a0f6b0 79f01004 mscorwks!MethodDescCallSite::Call_RetBool+0x1c
19a0f7b0 79f00f8a mscorwks!SafeHandle::RunReleaseMethod+0x75
19a0f8b0 79f00eb6 mscorwks!SafeHandle::Release+0xf2
19a0f8dc 79f357b7 mscorwks!SafeHandle::Dispose+0x2b
19a0f97c 7934dc52 mscorwks!SafeHandle::Finalize+0xab
19a0f9ac 79f7b546 mscorlib_ni+0x28dc52
Although I don’t have the oracle symbols so the oracore9, oraclient9 and oci methods are incorrect, we can see that the finalizer is currently trying to finalize an OciHandle, and in order to do so it needs to enter a critical section which is currently locked.
Until it has finalized this OciHandle it can’t finalize anything else, so our Bitmaps can’t be garbage collected.
If we want to go a bit further we can find out who owns the critical section and have a look at what it’s doing.
CritSec +40000ee0 at 40000EE0
!locks will give us a list of all the locks in the process and any locks that have a lock count higher than 0 are currently locked. In this case, the only one that fits that criteria is the one above, so this is the critsec that we are waiting for.
It is owned by the thread with the OSID 1ca0 so we can move to this thread to find out what it is working on.
eax=00000380 ebx=ffffffff ecx=29820e2c edx=00000000 esi=7fffffff edi=00000102
eip=77f88f13 esp=20b798dc ebp=20b79914 iopl=0 nv up ei ng nz ac pe cy
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000297
77f88f13 c20c00 ret 0Ch
20b798d8 74fd1394 NTDLL!NtWaitForSingleObject+0xb
20b79914 74fd160d msafd!SockWaitForSingleObject+0x1a8
20b79990 750313fa msafd!WSPRecv+0x1e9
20b799cc 61915574 ws2_32!WSARecv+0x8a
WARNING: Stack unwind information not available. Following frames may be wrong.
20b79a00 61912f74 orantcp9!nttini+0x4554
20b79a28 615050a3 orantcp9!nttini+0x1f54
20b79a94 614d68b0 oran9!ztvull+0x17263
20b7a2fc 614c12c8 oran9!nsrdr+0x140
20b7a3ac 614905ba oran9!nsdo+0xd38
20b7a3f4 60538544 oran9!nioqrc+0x3ea
20b7ad18 61493acd oracommon9!ttcdrv+0x404
20b7ad30 606ba7db oran9!nioqwa+0x2d
20b7cde0 606b06fd oraclient9!upicui2+0x75b
20b7ce14 6069e727 oraclient9!upirtr+0xbd
20b7ce50 6068266b oraclient9!kpurcs+0x97
20b7ef3c 60683b41 oraclient9!kpufGetRcvInfo+0xe2b
20b7ef94 60601cc4 oraclient9!kpufGetRcvInfo+0x2301
20b7efb4 25f9cf91 oraclient9!OCIStmtFetch+0x24
20b7f008 6589b9b3 oci!OCICharSetToUnicode+0xb1e0
20b7f028 6589b9b3 System_Data_OracleClient_ni!_load_config_used+0x3a7bb
OS Thread Id: 0x1ca0 (69)
20b7f020 77f88f13 [NDirectMethodFrameStandaloneCleanup: 20b7f020] System.Data.Common.UnsafeNativeMethods.OCIStmtFetch(System.Data.OracleClient.OciHandle, System.Data.OracleClient.OciHandle, UInt32, FETCH, MODE)
20b7f03c 6589b9b3 System.Data.OracleClient.TracedNativeMethods.OCIStmtFetch(System.Data.OracleClient.OciHandle, System.Data.OracleClient.OciHandle, Int32, FETCH, MODE)
20b7f05c 6588596d System.Data.OracleClient.OracleDataReader.ReadInternal()
20b7f0a8 658856de System.Data.OracleClient.OracleDataReader.Read()
20b7f0d8 652330ad System.Data.Common.DataAdapter.FillLoadDataRow(System.Data.ProviderBase.SchemaMapping)
20b7f110 65232e82 System.Data.Common.DataAdapter.FillFromReader(System.Data.DataSet, System.Data.DataTable, System.String, System.Data.ProviderBase.DataReaderContainer, Int32, Int32, System.Data.DataColumn, System.Object)
20b7f168 65232cf4 System.Data.Common.DataAdapter.Fill(System.Data.DataTable, System.Data.IDataReader, Int32, Int32)
20b7f1b4 6524068e System.Data.Common.DbDataAdapter.FillInternal(System.Data.DataSet, System.Data.DataTable, Int32, Int32, System.String, System.Data.IDbCommand, System.Data.CommandBehavior)
20b7f20c 652404d3 System.Data.Common.DbDataAdapter.Fill(System.Data.DataTable, Int32, Int32, System.Data.IDbCommand, System.Data.CommandBehavior)
20b7f24c 6524029c System.Data.Common.DbDataAdapter.Fill(System.Data.DataTable)
<removing customer specific code as it is not important for the discussion>
Again, the oracle methods are incorrect in this stack since we don’t have symbols, but in essence we are getting some data from an Oracle db, waiting for it to return the data.
What do we have so far?
We know that we have high memory usage, and that there is a big potential for this high memory usage to come from bitmaps.
We also know that there are indeed some System.Drawing.Bitmaps in the process that are waiting to be finalized, which means that these bitmaps are not explicitly disposed in code after we are finished with them. If they were explicitly disposed they would not have to be finalized, and thus we would not have to wait for the finalizer in order to release the underlying GDI+ resources.
We also know that the finalizer is blocked while trying to release some Oracle OCIHandle and that in order to unblock the request on thread 69 has to finish.
This in turn also means that we know that the applications code is not explicitly closing Oracle connections, command objects etc. when it’s finished using them. As with the Bitmaps, if it was explicitly closing them, they would not have to be finalized on the finalizer.
In a nutshell:
To answer the 1 GB question, to dispose or not to dispose? If you create objects that have pointers to native resources, like connections, bitmaps, files etc. you need to dispose them explicitly when you are done with them, if you don’t want to end up in a situation like this one.
MSDN will usually tell you if you need to dispose or not, like in the case of the bitmap dispose method (derived from Image dispose)
Until next time,