Experience Weird Access Violation? Maybe a Race Condition Problem

Recently, I am working on a native memory crash issue because of access violation. I have seen a lot of access violation which are caused by heap corruption before, while this issue looks so weird which even make me feel it could be a hardware (CPU) problem. Finally, I am able to confirm it is a race condition issue and would like to share with you about how thread safe can influence an application.

Open the dump file in windbg and switch to the exception context. The AV happens inside oleaut32!SysAllocString when referencing 0x3. Because 0x3 is not a valid memory address, AV happens.

0:077> .ecxr
eax=00000003 ebx=009a075c ecx=00408940 edx=00000005 esi=024ef3f0 edi=777a4642
eip=777a4660 esp=024ef348 ebp=024ef348 iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010206
oleaut32!SysAllocString+0x15:
777a4660 668b08 mov cx,word ptr [eax] ds:0023:00000003=????
0:077> kL
# ChildEBP RetAddr
00 024ef348 004f2655 oleaut32!SysAllocString+0x15
01 024ef3d4 004f2756 AE!CFAC::_getAttr+0x396
02 024ef41c 761704e8 AE!CFAC::GetAttributes+0x85
03 024ef440 761d5311 rpcrt4!Invoke+0x2a
04 024ef848 7748aec1 rpcrt4!NdrStubCall2+0x2d6
05 024ef890 7748d876 ole32!CStdStubBuffer_Invoke+0x3c
06 024ef8d8 7748ddd0 ole32!SyncStubInvoke+0x3c
07 024ef924 773a8a43 ole32!StubInvoke+0xb9
08 024efa00 773a8938 ole32!CCtxComChnl::ContextInvoke+0xfa
09 024efa1c 7748a44c ole32!MTAInvoke+0x1a
0a 024efa4c 7748db41 ole32!AppInvoke+0xab
0b 024efb2c 7748a45c ole32!ComInvokeWithLockAndIPID+0x372
0c 024efb78 7617063c ole32!ThreadInvoke+0x302

The following is the source code for SysAllocString.

 STDAPI_(BSTR)
SysAllocString(__in_z_opt const OLECHAR* psz)
{
    if(psz == NULL)
      return NULL;

    return SysAllocStringLen(psz, (DWORD)STRLEN(psz));
}

Looks like psz is just the eax, how can we verify that? Let's have a looks at part of the disassembly code for AE!CFAC::_getAttr shown below. We can see the return value of _bstr_t::GetBSTR is stored in stack and passed to SysAllocString.

004f2643 8d4dc8 lea ecx,[ebp-38h]
004f2646 e8a862f1ff call AE!_bstr_t::GetBSTR (004088f3)
004f264b ff30 push dword ptr [eax]
004f264d 8b3da8437100 mov edi,dword ptr [AE!_imp__SysAllocString (007143a8)]
004f2653 ffd7 call edi

We can verify the stack value based on ebp and offset, and confirm psz is really 0x3. Apparently, it is not a valid memory address.

0:077> ? poi(0x024ef348 + 0x8)
Evaluate expression: 3 = 00000003

Because psz comes from _bstr_t::GetBSTR, suppose _bstr_t is not a valid value. We can find it in ebp-38 position based on the above disassembly code.

 inline BSTR& _bstr_t::GetBSTR()  
{
    if (m_Data == NULL) {
        m_Data = new Data_t(0, FALSE);
        if (m_Data == NULL) {
            _com_issue_error(E_OUTOFMEMORY);
        }
    }
    return m_Data->GetWString();
}

inline wchar_t*& _bstr_t::Data_t::GetWString() throw()
{
    return m_wstr;
}

Based on the above source code, _bstr_t::GetBSTR is actually returning m_pData->m_wstr.

0:077> ? 0x024ef3d4 - 0x38
Evaluate expression: 38728604 = 024ef39c
0:077> dt 024ef39c AE!_bstr_t
"" +0x000 m_Data : 0x09d4fef8 _bstr_t::Data_t
0:077> dt 0x09d4fef8 _bstr_t::Data_t
AE!_bstr_t::Data_t
+0x000 m_wstr : 0x03050011 ""
+0x004 m_str : (null)
+0x008 m_RefCount : 0

What confuses me is m_wstr's value is 0x03050011, while based on the AV symptom, it should be 0x3. Firstly, I don't think it is a bug of _bstr_t::GetBSTR or _bstr_t::Data_t::GetWString, they are just some very simple functions which have been used by lots of applications. What else? if there is no problem for _bstr_t::GetBSTR, another problem is the instrument "lea ecx,[ebp-38h]". This instrument has grab a local variable from stack and store it to ecx (this pointer), if there is a hardware issue for CPU, it definitely can result in bogus ecx value.
Any else? Yes, hardware is one potential root cause while it is not so convincing, especially I was told that the application has been run on the machine for 2 years without problem. Let's check the source code of AE!CFAC::_getAttr and see any more clues.

 CAttribute* pAttr = g_CAttributeManager.GetAttribute(attrID);
if(pAttr != NULL)
{
    ...
 CAutoReadLock lock(g_CAttributeRWLock); 
   _bstr_t value;
  value = *pAttr; 
    pVarData->bstrVal = SysAllocString(value.GetBSTR());
 ...
}

Interesting, there is a read lock before calling SysAllocString which means there could be potential race condition hole. Finally, after further code review, I found g_CAttributeManager is a global variable which can be read or write by multiple threads, and there is really another function performing write operation without write lock. So, the whole story is:

  1. "value = *pAttr" has got expected _bstr_t object in current thread, it is hosting a valid m_Data member
  2. Another thread has destroyed the pAttr we referenced above, the inner m_Data is invalid then
  3. Another thread has created a new CAttribute object, and its address is same with the pAttr above
  4. The current thread continue to execute based on invalid m_Data and AV happens.

Race condition is very difficult to identify the root cause and I am lucky to find the problem based on code review. Hope this can help you if you experience similar problems.