Troubleshooting heap corruption issues

One of my customers was noticing that his worker process crashed repeatedly (W3SVC error 1011). While troubleshooting this issue, we had collected crash dumps using debugdiag tool. After looking into the dumps I saw that we were running into heap corruption issues.

0:000> !heapvalidate 0
ERROR: Unable to resolve structure ntdll!_HEAP_FREE_ENTRY (0 == m_TargetLength).
WARNING: Unable to read HEAP_FREE_ENTRY at 0xffffffff1ae78411 (heap 0x00080000, FreeLists[0]).
ERROR: Unable to resolve structure ntdll!_HEAP_FREE_ENTRY (0 == m_TargetLength).
WARNING: Unable to read HEAP_FREE_ENTRY at 0xffffffff00180689 (heap 0x00180000, FreeLists[0]).
<Truncated the rest of the output>
NN error(s) found.

The dump file seems to indicate the problem was due to heap corruption. What’s a heap corruption?

Heap corruption occurs when a thread allocates a block of heap memory of a given size and then writes to memory addresses that are beyond the requested size of the heap block. Another common cause of heap corruption is writing to a block of memory that has been freed (old pointer reuse) or freeing a block of memory that was already freed.

Debugging heap corruption issues is not an easy task because the thread that causes the exception is not usually the thread that caused the corruption. To find the thread that caused the corruption, you must use Pageheap.exe. Pageheap.exe is a software validation layer between the application and the system. Pageheap.exe verifies all dynamic memory applications. You can enable Pageheap.exe in NORMAL or FULL mode; you can also enable Pageheap.exe for specific targeted DLLs.

The unfortunate thing about heap corruption is that the actual “problem” has occurred prior to the crash. So it is necessary to turn on a setting called PageHeap in order to catch the true cause of the crash. Hence we will have to create a crash rule on the W3WP.exe/aspnet_wp.exe process and then turn on Normal PageHeap to determine the crashing issue.

Since we are facing a heap corruption issue (AV exception), let’s have Pageheap flag enabled on the server and get the dumps. The steps are as follows,
1. Open DebugDiag
2. On the Rules tab, click Add Rule
3. Select Crash and click Next
4. Select "A specific process" and click Next
5. Under Advanced Settings, click PageHeap Flags
6. Select Enable Full PageHeap
7. Click OK, click Yes on the prompt that’s shown.
8. Click Next through the rest of the wizard
9. Reset IIS, this is very important for getting the correct dumps.

The dumps will be captured automatically the next time the issue is happened. Please don’t restart IIS or the server till we have the dumps.

When the problem occurs again, it should crash the process when the actual problematic dll is causing the issue. We will need to review this dump file to check what caused the heap corruption.

NOTE: Turning Pageheap on will cause significant performance overhead and may cause the server to become unresponsive, but we need to proceed further to get to a conclusion on this issue. Please have a discussion with your team and decide upon a time when you can turn on page heap.

The above rule will add a Dword GlobalFlags in the following registry hive
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\w3wp.exe]