Debugging with the Right Tools

Wow, it’s been almost a year since I last blogged J We just shipped CLR V4.0. Yay!


An internal email thread prompted me to write this blog entry – one very powerful tool I wanted to point out when you need to debug/investigate issues is your debugger (if your debugger is also windbg/cdb that is J since that’s what I use and that’s what I will talk about in this article).


For those of you who are interested in investigating memory related issues at all, whether it’s because you don’t like your current app’s memory usage or you simply want to improve, learning to use the debugger is invaluable. If you haven’t started using windbg/cdb, I would strongly encourage you to – I promise you won’t regret it.


I’ve talked about using SoS before. We’ve added some more SoS commands in V4.0 and some of those are related to GC like !AnalyzeOOM, !GCWhere and !FindRoots. You can read about them on the MSDN page. But I wanted to talk about some techniques that may not be obvious from reading reference material.


What do you want to know about GC when your program is? 2 of the most common things are


1) Why are GCs being triggered?

2) Why do GCs take this much time?


And the answers are, respectively:


1) The difference of the heap before and after a GC

2) The survivors from that GC


Let me explain those in detail. Each generation has its own allocation budget which I explained here. It’s a value that we set and when exceeded we’d want to trigger a GC. If we do a collection and discovered that there’s a lot of memory survived we would set this allocation budget very big which means you’d need to allocate more into that generation for us to want to trigger a GC. The rationale is that then we’d have a chance to reclaim some sizeable space back next time we do a GC. Otherwise we’d do all this work for a GC and not be able to find much dead memory.


So if you do a !dumpheap right before a GC you see what objects you have; then you do a !dumpheap right after that GC you see some of those objects disappeared. It’s those objects that triggered this GC. Why? ‘cause if we didn’t have any objects disappearing, we wouldn’t be doing GC (‘cause we’d get no dead space back). When you hear people say “you are doing gen2 GCs because you are churning gen2” this is preciously the explanation for that.


As of now, the CLR GC doesn’t compact the LOH. So if you want to look at the LOH you can see exactly which parts of the memory disappeared. Here’s a sample output of a heap range before and after a gen2 GC. Before the gen2 GC (I formatted the !dumpheap with the methodtable pointer replaced with their readable names):


Address              MethodTable          Size


00000007f7ffe1e0     System.Byte[]        2,080,000

00000007f81f9ee0     System.String        100,000

00000007f8212580     System.Byte[]        140,000


After this gen2 GC for the same heap range:


Address              MethodTable          Size


00000007f7ffe1e0     Free                 2,320,000


So those 3 objects were collected. As the application author those objects may very well ring a bell for you why they were allocated so you can go from there see if/how you can do some optimizations.


The survivors from the collection are what you see in the output of !dumpheap at the end of that GC.


In this MSDN article I talked about how to set a breakpoint at the end of a gen2 GC. To set a bp at the beginning of a gen2 GC simply replace RestartEE with SuspendEE. If you want to do this for gen0 GCs, replace 2 with 0. Note that this applies to only non concurrent/background GCs though since concurrent/background GCs would not have the EE suspended for the most part.


Often I see people quickly go search for some other tools when they really could whip out the debugger and set a few breakpoints to see what things look like. Of course there are many useful tools out there – and they tend to aggregate info for you so you don’t have to do that yourself – but sometimes it’s so convenient to just use the most direct and handy tool which is your debugger.


Another classic example is when people ask me for help with debugging memory leaks and the easiest and quickest solution is to set a conditional bp on kernel32!VirtualAlloc for native memory leaks. I’ve seen this over and over again and this is how the conversation would usually go:


someone, “Maoni, I have a managed app that’s leaking memory. Can you help me with it?”

me, “Is it a managed memory leak?”

someone, “I don’t know.”

me, “How big is your managed heap? Can you do the SoS !dumpheap command?”

someone, “it’s only X MBs” [but his whole process uses much more than that]

me, “Does it grow? If you run your app for a while does the managed heap size change?”

someone, “No” (or “Not much, not nearly as much as the whole memory usage increases”)

me, “Can you take a look at your virtual memory address space by doing !address?” [!address gives you info on each virutal memory chunk. It’s also a windbg/cdb debugger extension but it’s in ext.dll  which comes with the debugger package.]

someone, “It looks like there’re a lot of chunks of size S…”

me, “Can you set a conditional bp on kernel32!VirtualAlloc when the allocate size is S?” [Which means: bp kernel32!virtualalloc “.if (@edx==S) {kp} .else {g}”]

[some time elapses]

someone, “It looks like they all have the same callstack!!!”


At this point it’s clear what causes their memory leak.


If you need to see continuous output of some commands without disturbing the process too much you can always log it to a file (.logopen c:\mydebugsession.txt) and analysis the log after the process runs for a while.


As I mentioned there are tools that will aggregate this info for you (eg, the DebugDiag tool you can find on which shows you the aggregated callstack from all the VirtualAlloc calls) but if you’ve never used those tools it may take a while to set them up and learn how to use them. The debugger is something you have right there and use all the time so it’s something you can fairly conveniently solve with a debugger I tend to use that.


Comments (10)

  1. Naveen says:

    How about CLR-ETW? I think that is really handy with getting call-stacks of allocations, which is non-intrusive. Here is a simple post of using CLR-ETW to track silverlight memory usage.

  2. maoni says:

    Naveen, ETW is an extremely useful tool but again different tools have different purposes – it’s much harder to use ETW events – you have to start xperf (or other tools to collect ETW events); you need to convert it to csv and you need to parse it.

    My point is when you can just use the debugger, which is something you already need to use anyway, sometimes it can be very handy to use that.

    BTW, just as a note, ETW is "lightweight" – it’s not to say that it’s always non intrusive when you use it. You can certainly observe perf degrade if you are collecting events that are too frequent.

  3. Regarding using a conditional break point on VirtualAlloc, we often use this technique to find unmanaged memory leaks as we use a number of native COM libraries in our managed application.

    However, we have also found it necessary to break on RtlAllocateHeap ( in ntdll.dll ) as not all native allocations go through VirtualAlloc.

  4. maoni says:

    Andrew, correct. The NT heap memory of course is allocated with VirtualAlloc but not each RtlAllocateHeap will call VirtualAlloc since they preallocate.

  5. liviu trifoi says:

    One small question:

    Doing an AppDomain.Unload are there any situations when memory is decomitted or the heap compacted?

    Because from what I’ve tested the objects in an AppDomain that has been unloaded only get to be marked as unreachable but the memory usage (private bytes) does not decrease. The memory usage will only decreases after the garbage collection kicks in (this will be some time later than the AppDomain.Unload call) and reclaims the memory for those unreachable objects.

    Is this correct?

  6. Mark says:

    It's been over 6 years, and still … "As of now, the CLR GC doesn’t compact the LOH."

    And worse, Microsoft will not supply the interface for us to call on our own when we are willing to take the time hit for the LOH compaction. Such a simple function, such a simple request, but Microsoft refuses to supply it.


    There are no "performance reasons" that justify requiring us to restart our application to recover usefulness of LOH once it has been fragmented.

    Even though there are numerous discussions of ways to bend over backwards in your code design to work around LOH fragmentation, it only accomplishes a delay of the inevitable. And, besides, I have no control over how the third party libraries I use manage their memory allocations.

  7. maoni says:

    liviu trifoi: You are correct – we don't proactively do a full GC at the time you call Unload – the AD unload logic will wait till the next full GC happens (which will get rid of all the objects that belonged to that AD (unless they are agile)) and then will release all native memory we allocated for this AD.

    Mark: I understand your frustration. And just to make it clear – we are not refusing to provide this feature to you. It is indeed scheduled for our future releases. We are also making the LOH allocator better so it will help with the fragmentation issue.

  8. Dmitri says:

    Mark and Maoni –

    It seems that GC creators policy has always been somewhat of a "nanny state" paradigm – we will take care of you, and internal (to GC team) optimization is better than anything done by developer.  Largely proven true when abstracted, but there are always corner cases in particular situations, and you do find yourself longing for more power.  

    I am actually quite surprised by Maoni's statement that they will open up LOH compaction. Let's just hope Patrick doesn't override you on this.

    And many thanks for you blog!

  9. xiaoyou.coding says:

    Mark, I also faced the problem of fragments of LOH. Since we are using a out-of-proc COM server to fetch data into our c# clients, most of the intermediate marshaling memory chunks are large objects and we are not able to change the interface. Fortunately, the extensive COM interops only occur when refreshing data, so, we know when it happens. During the refreshing process, we force to trigger full GC when the accumulated LOH objects' size goes up to a threshold(like 100M), then, the fragments can be controlled and LOH size will not go too high. Otherwise, for each refresh of a big dataset, nearly 1G memory may be used.

  10. zdenek says:

    not gonna say what i called my function – but essentially it takes an expected allocation in items, normally used to set capacity, and if less than 8000 returns the same value (on assumption that largest native storage should be 10 byte extended floating point at least on standard intel machines) and everything else will return 10k, 20k 50k 100k 200k and so on… you would think that allocating so much extra memory is detrimental to the outcome but in many cases it actually allowed reuse. Problem is that i can only control my code this way.