Heap Corruption – a W3WP case study

This blog post has been authored by Yawei Wang, a Premier Field Engineer with Microsoft Singapore. He describes a very common and important type of issue – heap corruption – and how you can successfully debug it using the Debugging Tools for Windows.


Background

Heap corruption is by nature a complicated issue to troubleshoot and in some cases luck is as important as debugging knowledge as well. It normally happened intermittently and we cannot reproduce it within a short time frame. The best way to understand how it works is through a real case analysis. Today we are going to unveil the secret a W3WP heap corruption case together.

Symptoms

The background is that a customer reported an intermittent crash of W3WP process on IIS server when the server was under heavy load. Sometimes it would respond with HTTP 500 (IIS 5.0 / 6.0, IIS 7.x) but sometimes the process was just terminated unexpectedly. It’s a generic second chance access violation (0xC0000005) exception (see What is a First Chance Exception?) and we were able to get the crash dump with DebugDiag.

Troubleshooting

From the dump, we can find the thread was crashed on CUSTOM_ERROR_TABLE::FindCustomError. FindCustomError finds the applicable custom error entry for a given status / subcode and return it back to client. Here is the call stack:

0:033> knL
# ChildEBP RetAddr
00 06cee2cc 5a49fb48 w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x18
01 06cee428 5a42392b w3core!ISAPI_REQUEST::GetCustomError+0x8e

1b 06ceffb8 7c82482f msvcrt!_threadstartex+0x74
1c 06ceffec 00000000 kernel32!BaseThreadStart+0x34

The calls stack was quite clean without any third-party components to suspect. It’s also very rare that the crash happened within IIS module and as you may not know w3core.dll is the core component in IIS 6 (same as iiscore.dll in IIS 7) there can’t be an issue in it as IIS 6 has been released for years and the number of calls to FindCustomError can be numerous. If there should be a issue, it wouldn’t survive for a minute.

So, how did it happen? Let’s first take a look at the register status: ESI is null, which seems to be the direct culprit.

eax=06ce0000 ebx=06ceedb4 ecx=017c4774 edx=000006e2 esi=00000000 edi=06ceedb4
eip=5a49fbd2 esp=06cee2c4 ebp=06cee2cc iopl=0 nv up ei ng nz ac po cy
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010293
w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x18:
5a49fbd2 668b4608 mov ax,word ptr [esi+8] ds:0023:00000008=????

To understand why ESI was null we have to analyze the below assembly. We see the address stored in ECX was null which caused the C++ exception.   

0:033> u
w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x18
5a49fbd2 668b4608 mov ax,word ptr [esi+8] <======esi is null
5a49fbd6 663b4508 cmp ax,word ptr [ebp+8]
5a49fbda 7404 je w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x22 (5a49fbe0)
5a49fbdc 8b36 mov esi,dword ptr [esi]
5a49fbde ebea jmp w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x34 (5a49fbca)
5a49fbe0 668b460a mov ax,word ptr [esi+0Ah]
5a49fbe4 663b450c cmp ax,word ptr [ebp+0Ch]
5a49fbe8 0f841c780000 je w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x3f (5a4a740a)

Here is the disassembly of the function w3core!CUSTOM_ERROR_TABLE::FindCustomError:

5a49fbac 8bff mov edi,edi
5a49fbae 55 push ebp
5a49fbaf 8bec mov ebp,esp
5a49fbb1 56 push esi
5a49fbb2 57 push edi
5a49fbb3 8b7d10 mov edi,dword ptr [ebp+10h]
5a49fbb6 85ff test edi,edi
5a49fbb8 0f8467fdffff je w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x59 5a49f925)
5a49fbbe 837d1400 cmp dword ptr [ebp+14h],0
5a49fbc2 0f845dfdffff je w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x59 (5a49f925)
5a49fbc8 8b31 mov esi,dword ptr [ecx ]<=======poi(ecx) -> esi ; poi(ecx) = null
5a49fbca 3bf1 cmp esi,ecx
5a49fbcc 0f8448fdffff je w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x38 (5a49f91a)

Why ECX was NULL? We know “this call” calling convention (used for calling C++ non-static member functions) will pass “this” pointer in ECX. As you see, The ‘dt’ command will show the structure name. Although the debugger says “this” is stored in EDX, it's wrong! It should be in ECX which is null.

0:033> dt this

Local var @ edx Type CUSTOM_ERROR_TABLE*

0:033> dd poi(ecx) L1

01dc7540 00000000

Then we dumped the CUSTOM_ERROR_ENTRY and would like to see why the object CUSTOM_ERROR_TABLE was null.

0:033> !list "-t ntdll!_LIST_ENTRY.Flink -e -x /"dt w3core!CUSTOM_ERROR_ENTRY @$extret/" 017c4774"
dt w3core!CUSTOM_ERROR_ENTRY @$extret
+0x000 _listEntry : _LIST_ENTRY [ 0x1dc7540 - 0x1d60620 ]
+0x008 _StatusCode : 0xbda0
+0x00a _SubError : 0x1db
+0x00c _strError : STRU
+0x03c _fIsFile : 0n0

dt w3core!CUSTOM_ERROR_ENTRY @$extret
+0x000 _listEntry : _LIST_ENTRY [ 0x0 - 0x0 ]
+0x008 _StatusCode : 0
+0x00a _SubError : 0
+0x00c _strError : STRU
+0x03c _fIsFile : 0n0    

0:033> ? 0xbda0
Evaluate expression: 48544 = 0000bda0

Obviously, the custom error list is completely corrupted so we can't get actual custom error 500.100 (Internal Server Error - ASP error).

0:033> .frame 0

00 06cee2cc 5a49fb48 w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x18

0:033> dv
this = 0x000006e2
StatusCode = 0x1f4<=====500
SubError = 0x64<======100
pfIsFile = 0x06ceedb4
pstrError = 0x06cee334

When we check the list entry address 0x1d60620, we can see that it belongs to heap 0x2b0000, which is the MSVCRT (C Runtime) heap.

0:033> !address 0x1d60620
Usage: Heap
Allocation Base: 01d50000
Base Address: 01d50000
End Address: 01e50000
Region Size: 00100000
Type: 00020000 MEM_PRIVATE
State: 00001000 MEM_COMMIT
Protect: 00000004 PAGE_READWRITE
More info: heap containing the address: !heap 0x2b0000
More info: heap entry containing the address: !heap -x 0x1d60620
Heap 3 - 0x002b0000
Heap Name msvcrt!_crtheap
Heap Description This heap is used by msvcrt
Reserved memory 3.13 MBytes
Committed memory 1.64 MBytes (52.38% of reserved)
Uncommitted memory 1.49 MBytes (47.63% of reserved)
Number of heap segments 3 segments
Number of uncommitted ranges 1 range(s)
Size of largest uncommitted range 1.43 MBytes
Calculated heap fragmentation 3.94%

At the current stage, we believe that it is a typical heap corruption. While debugging heap corruption issues is not an easy task because the thread that caused the exception was not usually the thread that caused the corruption (FindCustomError is the victim in this case), we still can use the full PageHeap feature to capture another round of IIS crash dump.

With the assistance of pageheap.exe, we managed to capture another round of IIS crash dump and this time we found the culprit:

FAILED_INSTRUCTION_ADDRESS:
safeagent+1ae10
1001ae10 ?? ???
SYMBOL_STACK_INDEX: 0
SYMBOL_NAME: safeagent!unloaded+1ae10
FOLLOWUP_NAME: wintriag
MODULE_NAME: safeagent

IMAGE_NAME: safeagent.dll

DEBUG_FLR_IMAGE_TIMESTAMP: 45f51b03

FAILURE_BUCKET_ID: BAD_INSTRUCTION_PTR_c0000005_safeagent.dll!unloaded

BUCKET_ID: APPLICATION_FAULT_BAD_INSTRUCTION_PTR_INVALID_POINTER_READ_BAD_IP_safeagent!unloaded+1ae10

Followup: wintriag

The call stack was truncated, due to the module being unloaded:

0:007> knL
# ChildEBP RetAddr
WARNING: Frame IP not in any known module. Following frames may be wrong.
00 088bc188 00000000 <Unloaded_safeagent.dll>+0x1ae10

Conclusion

In this case, we can conclude that the third party module (safeagent.dll) was involved in this issue. The next step was to contact the provider of this module for further investigation.


Edited and posted by Arvind Shyamsundar , MSPFE Editor