Why doesn’t HeapValidate detect corruption in the managed heap?

A customer had a program that was corrupting the managed heap by p/invoking incorrectly. The problem didn’t show up until the next garbage collection pass, at which point the CLR got all freaked-out-like. “According to Knowledge Base article 286470, the GFlags tool is supposed to catch heap corruption, but it doesn’t catch squat.”

Depending on your point of view, this is either a case of the customer not understanding what things mean in context or of the KB article author looking at the world through kernel-colored glasses.

The GFlags tool, pageheap, full pageheap, and the Heap­Validate function all operate on heaps, but the sense of the word heap here is “heaps created by the Heap­Create function.” If your program does a Virtual­Alloc and then carves out sub-allocations from it, well, it’s not like GFlags and Heap­Validate are psychic and can magically reverse-engineer your code in order to understand your custom heap implementation and be able to determine whether your custom heap is corrupted.

Clearly no such function could be written, because that’s even harder than the Halting Problem! One property of a non-corrupted heap is that it will not send the heap manager into an infinite loop. Therefore, proving that the heap is not corrupted, given no information about the heap implementation other than the code itself, would require proving that the next heap call will return. And that’s just one of the things the imaginary ValidateAnyHeap function would have to do. (We try to limit ourselves to one impossible thing at a time.)

The Heap­Validate function only knows how to validate heaps created by the Heap­Create function. It does not have magic insight into custom heap implementations. The GFlags program modifies the behavior of heaps created by the Heap­Create function, because it naturally does not know what debugging features you’ve added to your custom heap implementation, so it doesn’t know what it needs to do to turn them on and off.

As far as the kernel folks are concerned, “heap” means “something created by the Heap­Create function.” Anything else is just an imposter.

If you are looking for corruption in a custom heap implementation, then you need to go ask the authors of that custom heap implementation if they provided any debugging facilities for that heap.

Comments (17)
  1. Rick C says:

    "the CLR got all freaked-out-like."

    That must be a technical term.

  2. Silly says:

    Sounds to me like HeapValidate is at least one step lower leval than them managed heaps.

    [It's not higher or lower. It's a sibling. -Raymond]
  3. Joshua says:

    In reading this, I thought of my slab allocator. It would be dead-set-sure to throw any analyzing heap validator into a tizzy.

  4. JDP says:

    Well, sure. That's like saying "I think my German essay would throw any English spell-checker into a tizzy"

  5. Evan says:

    Which is worse: getting all freaked-out-like, or getting thrown into a tizzy?

    These are important question.

  6. "We try to limit ourselves to one impossible thing at a time."

    Now you tell me! For years I've been trying to make a habit of believing as many as six impossible things before breakfast!

  7. Matt says:

    "that's even harder than the Halting Problem"

    Even harder?

  8. JDP says:

    As explained in the very next sentence!

  9. HiTechHiTouch says:

    I'm confused.

    The problem statement said "managed heap" and "CLR".  I immediately assumed that the customer was talking about .NET and that the heap was being created/used by the language environment.  Since this comes from Microsoft, one might think the MS debug tools would assist with this problem.

    What I think must be happening is that there is a Kernel provided heap (debugged with GFlags settings), which is not the same heap provided by .NET/runtime.  This would make the problem be one of "kernel colored glasses". (Follow-up: why doesn't .NET/rt use a kernel created heap?)

    A useful response would be information about debugging the runtime heap provided/used by .NET and the language environment.

    Nowhere in the problem statement did the customer say they were doing an explicit VirualAlloc and carving sub-allocations, i.e. using a custom heap implementation.  Thus most of Raymond's comments, while true, seem mis-addressed.

    [The CLR is one example of a custom heap implementation that calls VirtualAlloc and carves out sub-allocations. (This should be obvious since the CLR uses a moving GC.) The customer was using a tool designed for HeapCreate heaps and expecting it to work for non-HeapCreate heaps. -Raymond]
  10. Mike says:

    Althought the halting problem is totally solvable given deterministic transitions (which we presumably have) and non infinite memory (which unless MS Research are punching above their weight we can also assume).

    Just sayin'


  11. ChrisR says:

    @HiTechHiTouch:  The customer is a programmer, presumably professional.  It's reasonable to expect them to understand that the .NET runtime may or may not use the heap that GFlags can set options for.  It isn't necessarily a case of kernel-colored glasses for GFlags; in fact I'd say it's more likely a lazy and/or ignorant programmer asking the question.

    Or would you suggest that text be added to the GFlags documentation listing all the different heaps that it can't help debug?

  12. Matt says:

    @Mike: "The halting problem is totally solvable given deterministic transitions and non infinite memory".

    Sure. Using an algorithm provably not faster than O(2^n) where n is the number of bits of memory.

    You don't need a very big N before the difference between "takes O(2^N) operations" and "takes infinite time" is just a matter of academic semantics.

  13. Smitty says:

    @ChrisR:  If the -NET runtime uses the heap, it should know how to use it correctly.  I agree totally with @HiTechHiTouch, this issue hasn't really been addressed by Raymond, and in fact smacks a little of the elitism, or snobbery that I expect to find on a Solaris forum.

    [The point is that the .NET runtime doesn't use the heap (or more precisely, does not use a heap created by Heap­Create), so using a tool for debugging corruption in heaps created by Heap­Create is useless if the corruption is not in a heap created by Heap­Create. -Raymond]
  14. ChrisR says:

    @Smitty:  I guess you missed the part where Raymond wrote that the customer misused a P/Invoke to corrupt the managed heap themselves.

  15. ErikF says:

    @Smitty: The original question, as asked, was similar to "My SQL database is showing corruption, but fsck shows no problems! Why?"

  16. Silly says:

    @ErikF. From what I now understand (after taking off my unmanaged glasses), is the original question is more like "My SQL database is showing corruption, but Oracle shows no problems! Why?"

  17. 640k says:

    Everyone that doesn't understand that this is a .NET issue, raise your hand.

Comments are closed.