Exception and Debug Event, the feedback from OS

Exception and Debug Event, the feedback from OS

This section will firstly brief exception related tech, and then use example to demonstrate how to use exception to troubleshoot effectively.

Exception Brief

Exception is a mechanism to control code’s execution flow. In normal situation, the code executes consequently, like the following:

*p=11;

printf(“%d”,*p);

It should print 11. But how about if p points to an invalid memory address? Then the line to assign value to *p will trigger access violation exception, and the following line to print may not execute any more.

For applications, if the behavior does not follow the expectation, exception is likely a direct cause because this is the most obvious and common way that changes the execution flow. In most cases, troubleshooting problem is just the same meaning of troubleshooting exception.

In original Chinese version, I discuss how the OS plays an important role on exception handling and dispatching. Also I brief how different programming language leverages the SEH to support the exception handing mechanism. I will skip such introduction here because the following two articles cover them all:

A Crash Course on the Depths of Win32™ Structured Exception Handling

https://www.microsoft.com/msj/0197/Exception/Exception.aspx

RaiseException

https://msdn.microsoft.com/library/default.asp?url=/library/en-us/debug/base/raiseexception.asp

Case study, how to let C++ dump the callstack as the C#

For the application created by C# or Java, when exception occurs, they are able to dump the call stack where the exception comes from. However, for C++, we have to use debugger to get the callstack. Now the customer wants to achieve stack dump in C++. Any good idea?

My solution is to use SEH, due to that local variable’s destructor will be executed during stack unwind when exception occurs. The sample code worked fine in VC6+Win2k3 platform. However, when I retry the sample, the same code behaves strangely in VC2005 + Win2k3 SP1. If I compile in debug mode, it works fine. However, in release mode, the application quits silently. For the whole story, I saved in my MSN blog (English), please refer to:

SEH,DEP, Compiler,FS:[0] and PE format

https://eparg.spaces.msn.com/blog/cns!59BFC22C0E7E1A76!712.entry

Case study, Why Dr. Watson cannot save the dump file.

Problem Description:

The customer reports their VC application crashes randomly. To obtain detailed info, the customer registers Dr. Watson so that when exception occurs next time, we can get the dump file. However, when the problem reoccurs, Dr. Watson saves nothing.

Background Info:

In Chinese version, I provided brief info about what dump file is, and what info we can find in dump. Related info can be found at:

Description of the Dr. Watson for Windows (Drwtsn32.exe) Tool

https://support.microsoft.com/?id=308538

Specifying the Debugger for Unhandled User Mode Exceptions

https://support.microsoft.com/?id=121434

INFO: Choosing the Debugger That the System Will Spawn

https://support.microsoft.com/?id=103861

Generally speaking, by setting the AeDebug registry key, we can lunch the debugger when application crashes. If we choose Dr. Watson as the debugger, the default behavior is generating the dump file.

Problem Analysis:

Back to the case, the customer fails get the dump file, possible causes:

1. The Dr. Watson’s bug. It works abnormally.

2. The customer’s application does not crash, it just exits like calling ExitProcess.

To perform test against point 1, I provided the following sample code for testing:

int *p=0;

    *p=0;

With above code, Dr. Watson captured the dump file successfully on the customer side. So Dr. Watson works fine. It seems that the crash exclaimed by the customer is not really caused by unhandled exception. Maybe the customer calls ExitProcess unexpectedly. Thus during information capturing, we should not limited in unhandled exception. What we need to check is how the process disappears, maybe normal quit, maybe unhandled exception.

One possible way to figure out is to run the application in windbg. However, manual operation is troublesome. It would be nice if there is some automatic way. Windows provides a registry, which allows an application starts under debugger. With this setting, when the specified process starts, OS starts the debugger firstly, and pass in the target process and command line to debugger, then debugger starts the target process to debug. This option is very useful especially when we cannot start the process manually, like Windows Service, which starts ahead the user logon:

How to debug Windows services

https://support.microsoft.com/?kbid=824344

Some malicious software uses this way to attach silent process. This method is also called IFEO (Image File Execution Option) hijacking in China.

In windbg folder, there is a script called adplus.vbs. We can use it to launch windbg to obtain the dump file. Here we will use the script:

How to use ADPlus to troubleshoot "hangs" and "crashes"

https://support.microsoft.com/kb/286350/EN-US/

Use adplus /? to obtain detailed info.

The Actions:

With above analysis, the detailed actions are:

1. In customer’s machine, create the key named by the problematic process under Image File Execution Options

2. Under the key, create a string value called Debugger.

3. Set the value to Debugger= C:\Debuggers\autodump.bat

4. Edit the C:\Debuggers\autodump.bat as the following:

cscript.exe C:\Debuggers\adplus.vbs -crash -o C:\dumps -quiet -sc %1

Based on above setting, when the application starts, the OS launches cscript.exe to execute the adplus.vbs script. The –sc switch in adplus.vbs specify the target process path, -crash means we will monitor for application’s quit, -o specifies the dump output folder, -quiet disables prompt. We can use notepad.exe as test to check if dump is generated when notepad.exe quits.

Based on above setting, when the problem reoccurs, we get two dump files in c:\dumps folder, called:

PID-0__Spawned0__1st_chance_Process_Shut_Down__full_178C_DateTime_0928.dmp

PID-0__Spawned0__2nd_chance_CPlusPlusEH__full_178C_2006-06-21_DateTime_0928.dmp

Pay attention to the second filename. The name indicates the 2nd chance C++ exception does happen. Open the dump in windbg, check the callstack, it shows that the customer throws some C++ exception in code, but forgets to capture that. By adding corresponding catch block, the issue gets fixed.

The solution is nice, but why Dr. Watson cannot get the dump?

The Dr. Watson’s behavior still confuses me. Since it is unhandled exception, why Dr.Watson cannot capture the dump file? Firstly I created two different applications to double verify the behavior of Dr. Watson:

int _tmain(int argc, _TCHAR* argv[])

{

    throw 1;

    return 0;

}

int _tmain(int argc, _TCHAR* argv[])

{

    int *p=0;

    *p=0;

    return 0;

}

For the first one, Dr. Watson does not save the dump. For the second, Dr. Watson saves the dump. It looks like the behavior is related to the exception type.

Recall the detailed crash behavior for above two applications when the Auto key is set to 0 under AeDebug. On my side, the message boxes for crash are:

---------------------------

Microsoft Visual C++ Debug Library

---------------------------

Debug Error!

Program: d:\xiongli\today\exceptioninject\debug\exceptioninject.exe

This application has requested the Runtime to terminate it in an unusual way.

Please contact the application's support team for more information.

(Press Retry to debug the application)

---------------------------

Abort Retry Ignore

---------------------------

---------------------------

exceptioninject.exe - Application Error

---------------------------

The instruction at "0x00411908" referenced memory at "0x00000000". The memory could not be "written".

Click on OK to terminate the program

Click on CANCEL to debug the program

---------------------------

OK Cancel

---------------------------

The behaviors are totally different! And the behavior is related to the compilation mode.

SetUnhandledExceptionFilter API is used to modify the default unhandled exception handler. Here, when C++ initialize the CRT, it passes CRT’s implementation (msvcrt!CxxUnhandledExceptionFilter). When unhandled exception occurs, the function checks the exception code. If it is a C++ exception, it shows up the first dialog, otherwise it bypass it to the default handler (ernel32!UnhandledExceptionFilter) provided by the OS. For the 1st situation, the callstack is:

USER32!MessageBoxA

MSVCR80D!__crtMessageBoxA

MSVCR80D!__crtMessageWindowA

MSVCR80D!_VCrtDbgReportA

MSVCR80D!_CrtDbgReportV

MSVCR80D!_CrtDbgReport

MSVCR80D!_NMSG_WRITE

MSVCR80D!abort

MSVCR80D!terminate

MSVCR80D!__CxxUnhandledExceptionFilter

kernel32!UnhandledExceptionFilter

MSVCR80D!_XcptFilter

For the second, it is

ntdll!KiFastSystemCallRet

ntdll!ZwRaiseHardError+0xc

kernel32!UnhandledExceptionFilter+0x4b4

release_crash!_XcptFilter+0x2e

release_crash!mainCRTStartup+0x1aa

release_crash!_except_handler3+0x61

ntdll!ExecuteHandler2+0x26

ntdll!ExecuteHandler+0x24

ntdll!KiUserExceptionDispatcher+0xe

release_crash!main+0x28

release_crash!mainCRTStartup+0x170

kernel32!BaseProcessStart+0x23

For detailed info, please refer to:

SetUnhandledExceptionFilter

https://msdn.microsoft.com/library/default.asp?url=/library/en-us/debug/base/setunhandledexceptionfilter.asp

UnhandledExceptionFilter

https://msdn.microsoft.com/library/default.asp?url=/library/en-us/debug/base/unhandledexceptionfilter.asp

Does above analysis help explain the Dr. Watson’s behavior? To be honest, I do not think so. I think it is due to Dr. Watons’s special handling on different exception types. The detailed research can be found at:

https://eparg.spaces.msn.com/blog/cns!59BFC22C0E7E1A76!1213.entry

Debug Event – communication between the OS and the debugger

Notification, also called Debug Event, it is a mechanism for OS to notify debugger when some thing happens. Similar as exception handing, OS dispatches the notification when some thing happens if the debugger is attached. Unlike exception, the notification can only be monitored by the debugger, not the target process. Also, there is no 1st chance and 2nd chance differences. In windbg’s help file, all the notifications are listed in the Controlling Exceptions and Events topic. Common notifications are DLL loading, unloading, thread creation and existing.

With exception and notification, we can capture the key for issue.

Case study, VB6’s version.

Customer’s VB6 application is not able to open data file created by Access 2003 in developer machine. It works fine for data created by Access 97. In other machines, both Access 2003 and 97 work fine.

They way to think is direct. Since it occurs in a specified machine, it means the issue is about the environment, not the code. Since it is about Access version, it should be related to the DAO’s version. By checking the modules loaded by the EXE, I found dao350.dll was loaded instead of dao360.dll. The next step is to figure out why dao350.dll gets loaded instead of dao360.

DAO is a COM component. It is likely created by COM API. A simple way is to trace the execution of the COM API like CoCreateInstanceEx with wt command, like I did in ShellExecute case. However, if we really try that, the wt command may execute for a whole day. It would be better if we can find a more workable way. Since we will trace until the library loading, why not set breakpoint at LoadLibrary to check how the dao350.dll gets loaded?

It is a very good way to set breakpoint on LoadLibrary because:

1. DLL loading is not necessary through LoadLibrary. Native API like ntdll!LdrLoadDLL may load the module directly.

2. If there are hundreds of DLLs to be loaded, breaking into LoadLibrary is troublesome, even if we can set conditional breakpoint to filder.

The better way is to leverage notification. During module load, OS sends notification to the debugger. In Windbg, we can use wide char to match and filter the DLL filename. It is easy to operate. Firstly, use “sxe ld:dao*.dll” command intercept the module load notification. When the filename is dao*.dll, the debugger breaks. (For windbg detailed usage, we will cover in next sections). The result in debugger is:

0:008> sxe ld:dao*.dll

ModLoad: 1b740000 1b7c8000 C:\Program Files\Common Files\Microsoft Shared\DAO\DAO360.DLL

eax=00000001 ebx=00000000 ecx=0013e301 edx=00000000 esi=7ffdf000 edi=20000000

eip=7c82ed54 esp=0013e300 ebp=0013e344 iopl=0 nv up ei pl zr na po nc

cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246

ntdll!KiFastSystemCallRet:

7c82ed54 c3 ret

ntdll!KiFastSystemCallRet

ntdll!NtMapViewOfSection

ntdll!LdrpMapViewOfDllSection

ntdll!LdrpMapDll

ntdll!LdrpLoadDll

ntdll!LdrLoadDll

0013e9c4 776ab4d0 0013ea40 00000000 00000008 kernel32!LoadLibraryExW

ole32!CClassCache::CDllPathEntry::LoadDll

ole32!CClassCache::CDllPathEntry::Create_rl

ole32!CClassCache::CClassEntry::CreateDllClassEntry_rl

ole32!CClassCache::GetClassObjectActivator

ole32!CClassCache::GetClassObject

ole32!CServerContextActivator::GetClassObject

ole32!ActivationPropertiesIn::DelegateGetClassObject

ole32!CApartmentActivator::GetClassObject

ole32!CProcessActivator::GCOCallback

ole32!CProcessActivator::AttemptActivation

ole32!CProcessActivator::ActivateByContext

ole32!CProcessActivator::GetClassObject

ole32!ActivationPropertiesIn::DelegateGetClassObject

ole32!CClientContextActivator::GetClassObject

ole32!ActivationPropertiesIn::DelegateGetClassObject

ole32!ICoGetClassObject

ole32!CComActivator::DoGetClassObject

ole32!CoGetClassObject                               

VB6!VBCoGetClassObject

VB6!_DBErrCreateDao36DBEngine

By checking the parameter of the LoadLibraryExW, it shows

0:000> du 0013ea40

0013ea40 "C:\Program Files\Common Files\Mi"

0013ea80 "crosoft Shared\DAO\DAO360.DLL"

With above information, we see:

1. DAO360 is not created by CoCreateInstanceEx. Instead it is created by CoGetClassObject. If we trace CoCreateInstanceEx, it wastes time.

2. COM invocation starts from VB6!_DBErrCreateDao36DBEngine function. We should check the function in detail.

With previous DLL hell’s lesson, here the first thing is to check VB6.EXE’s version since the function resides in VB6. Compared with normal condition, the workable module version is 6.00.9782, while the problematic one is 6.00.8176. By installation of VS6 SP6, the issue gets fixed.

Discussions:

(In Chinese version, I discussed how to analysis the dump even if the dump is not captured at the first place when exception happened. I have to skip here.)

Exit proactively for unhandled exception

In some situation, the developer exits the application proactively when unhandled exception occurs, instead of waiting for the OS to terminate it. COM+, ASP.NET use this kind of tech. A Chinese C2C software called taobao wangwang (also named ali wangwang) uses this kind of tech too. The benefits are:

1. We can define the UI for the crash.

2. We can save the unhandled exception info for postpone analysis.

3. To avoid the interference of the debugger, guarantee the immediate recycle, and try the necessary rescue operation like restarting the process.

It is easy to implement. One way is to use the __try and __except clause. The other way is to use SetUnhandledExceptionFilter API. For the study of taobao wangwang, please refer to:

https://eparg.spaces.msn.com/blog/cns!59BFC22C0E7E1A76!817.entry (Chinese)

Based on my analysis, taobao uses SetUnhandledExceptionFilter to capture unhandled exception, and use MiniDumpWriteDump API to capture the dump proactively.

With this tech, the debugger is hard to get the dump for crash directly. Some additional configuration and windbg command is necessary

How To Obtain a Userdump When COM+ Failfasts

https://support.microsoft.com/?id=287643

How to find the faulting stack in a process dump file that COM+ obtains

https://support.microsoft.com/?id=317317

How to troubleshoot UnhandledExceptionFilter

Based on MSDN, UnahandledExceptionFilter will be invoked only if the debugger is not attached. Thus we can use UnahandledExceptionFilter to bypass the trace of debugger, to protect some sensitive code. To avoid debugger’s check, there are two ways at least:

1. The target uses IsDebuggerPresent API to check if the debugger is attached. If so, it refuses to execute the sensitive code.

2. Put the sensitive code to a function, and register the function as UnHandledExceptionFilter. To execute the sensitive code, just trigger an exception manually. Due to the design of exception handling, it avoids the debugger’s trace.

For the first way is easy to by pass. Look at the implementation of IsDebuggerPresent:

:000> uf kernel32!IsDebuggerPresent

kernel32!IsDebuggerPresent:

  281 77e64860 64a118000000 mov eax,fs:[00000018]

  282 77e64866 8b4030 mov eax,[eax+0x30]

  282 77e64869 0fb64002 movzx eax,byte ptr [eax+0x2]

  283 77e6486d c3 ret

IsDebuggerPresent checks the flag in FS register. (FS:[18]]:30 saves PEB of current process). In debugger, we can change any of the register easily. Here we just need to change value of [[FS:[18]]:30]:2 to 0 to cheat IsDebuggerPresent to return false.

For the second way, changing [[FS:[18]]:30]:2 does not work because the judgment is based on the result of a kernel call. However, it does not mean impossible. Kwan Kyun Kim provides a way to cheat:

How to debug UnhandleExceptionHandler

https://eparg.spaces.msn.com/blog/cns!59BFC22C0E7E1A76!1208.entry

Next I will discuss memory, including Heap, Stack, and the lovely heap corruption and pageheap.