Swallowing SEH Exceptions is EVIL!

This case was a good case-in-point of why crashing can sometimes be a good thing – at least it helps you find and fix your bugs! This issue was reported as ‘my application just randomly disappears’.  We had the developer run this app under procdump with unhandled exceptions being caught – ‘procdump -ma -e Test.exe’. They could still randomly reproduce the issue when loading the app, but we never were able to capture a crash dump. Even after enabling page-heap to trap memory buffer overwrites. Frustrating, time consuming…expensive.

At that point, we just enabled first-chance exception dumping – ‘procdump -n 100 -ma -e 1 Test.exe’. This generated many, many dump files. Did I mention this was a mixed-mode managed app that contains both managed and native code? Managed code can potentially throw a lot of first-chance exceptions – even a lot of AV first chance exceptions as the managed NullReferenceException starts out a first-chance AV.

After looking at a lot of dump files I was able to notice a pattern of native code FCE.

The Windbg callstack looks like:

This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(c3c.c88): Access violation - code c0000005 (first/second chance not available)
eax=0bd3f000 ebx=7ffffff6 ecx=00000073 edx=00000007 esi=01282138 edi=0bd3eff0
eip=6ad71891 esp=0019eae4 ebp=0019ef68 iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010206
msvcr100!_woutput_l+0x93b:
6ad71891 66833800        cmp     word ptr [eax],0         ds:0023:0bd3f000=????

0:000> k
ChildEBP RetAddr 
0019ef68 6ad71af1 msvcr100!_woutput_l+0x93b [output.c @ 1625]
0019efac 6ad71b38 msvcr100!_vsnwprintf_l+0x8e [vswprint.c @ 155]
0019efc8 01281043 msvcr100!_vsnwprintf+0x18 [vswprint.c @ 227]
0019f000 012810d0 Test!EvilFunc+0x43 [c:\test.cpp @ 19]
0019f834 0128125a Test!main+0x60 [c:\test.cpp @ 38]
0019f878 7726ee6c Test!__tmainCRTStartup+0x122 [crtexe.c @ 555]

This is my after the fact repro code of course, developers are never kind enough to identify bad code as EvilFunc. This little sample shows the two bugs they had interacting to produce this scenario:

#define _CRT_SECURE_NO_WARNINGS
#include <stdio.h>
#include <stdarg.h>
#include <tchar.h>
#include <string.h>

// Build /Eha to catch SEH Exceptions
void EvilFunc(wchar_t * pWChar, ...)
{
    wchar_t sz[1024] = { 0 };
    int nBufLen = (sizeof(sz) / sizeof(sz[0])) - 1;
    va_list args;

    va_start(args, pWChar);
    int nBuf = 0;

    try
    {
        nBuf = _vsnwprintf(sz, nBufLen, pWChar, args);
        // Do something with sz...
    }
    catch(...)
    {
    }

    va_end(args);
}

int main()
{
    long count = 1;

    while(true)
    {
        char* szUserName = _strdup("Steve Horne");
        EvilFunc(_T("The UserName is : %s"), szUserName));
        printf("loop count %d\n", count++);

        free(szUserName);
    }

    return 0;
}

EvilFunc is using the ‘…’ vararg syntax in C/C++. The first parameter they were passing was a wide-char – they didn’t explicitly declare it wchar_t*, they were using one of the OLE macros which compiled to wchar_t* – I’ve simplified things down to the minimum repro here.

The first bug is passing in a char* as the … variable params. The compiler doesn’t type check the … so it was happily calling  _vsnwprintf with a wide-char format string and char string, just like the developer told it to.

The page-heap enabled crash dump shows the problem with that. They walked off the end of the char* string in CRT code (it was looking for a wchar_t two-byte NULL string termination) and started accessing  memory unintentionally:

0:000> r
eax=0bd3f000 ebx=7ffffff6 ecx=00000073 edx=00000007 esi=01282138 edi=0bd3eff0
eip=6ad71891 esp=0019eae4 ebp=0019ef68 iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010206
msvcr100!_woutput_l+0x93b:
6ad71891 66833800        cmp word ptr [eax],0 ds:0023:0bd3f000=????

0:000> db @eax
0bd3f000  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????
0bd3f010  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????
0bd3f020  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

0:000> db @eax-10
0bd3eff0  53 74 65 76 65 20 48 6f-72 6e 65 00 d0 d0 d0 d0  Steve Horne.....
0bd3f000  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????
0bd3f010  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

This was corrupting the heap after the _strdup allocation – unless you turn on the wonderful page-heap which book-ends the allocation with non-writable memory so the overwrite faults.

The second bug, or issue is the catch(…) filter used with the /Eha compiler option. This allows the catch(…) to catch all SEH exceptions, completely hiding the AV from procdump or any other tool trying to trap unhandled exceptions. I’m not positive why this was done, but my suspicion is that they couldn’t figure out why this code kept crashing and just added the catch for SEH exceptions to keep the process up. Not a good idea. It’s far, far, far better to have you app crash and fix that crash than to mask a heap corrupting bug.

Heap corruption is an insidious evil. When you finally do get a crash or hang due to a heap corruption, it’s impossible to look back in time and find the cause of the heap corruption – even if you are able to figure out that your issue is actually caused by heap corruption.  Whatever code corrupted the heap is long past executed and not in any current stack frames. Page-heap can easily help you track down these types of buffer over-run issues – but not if you hide the exceptions from the debugging tools!

This repro requires an older CRT –  VS 2010 was the version being used here. Later versions of the CRT do not reproduce this buffer overwrite due to changes in the CRT implementation.