Hello - Matthew here again. Today I'll be discussing in detail hang scenario #1 that Tate first mentioned a few blogs posts ago. From a debugging perspective, in an ideal world an application would always provide some kind of feedback when a failure occurs. The reality is that sometimes an application just doesn't do anything in response to a user-initiated action (such as a mouse-click) that results in a failure of some kind. When this happens, how can you determine what is going wrong?
In general, Process Monitor from Sysinternals is a great tool for gathering clues about this kind of failure. If the failure is due to a file system or registry problem, Process Explorer will catch the problematic event. Or if you suspect that the problem is due to something failing over the network, a network trace could help.
Let's assume that the techniques described above do not move us any closer to understanding the failure. What next? If you don't know what the app does in the background, and you don't have source code for the app, how do you proceed? Let's answer those questions by debugging a sample application...
Here's what we know:
1. Clicking the “Button 1” button is supposed to display a useful dialog box
2. For certain users, clicking “Button 1” results in no observable response
3. You don't have access to the source code or debug symbols for the app
4. No one seems to know what the “Button 1” button does behind the scenes
5. The developer of the application hasn't been heard from in 3 years
We need to understand what exactly happens when the user clicks the “Button 1” button. Every window has a WindowProc function that receives messages upon user input. Since a button is considered a "control", the WindowProc for the main application window will get a WM_COMMAND message indicating that the button was clicked. To differentiate which specific control sent the message, each WM_COMMAND message has a control identifier associated with it.
So we need to:
a) Find the control identifier for “Button 1”
b) Find the WindowProc for the main application window
c) Determine where WM_COMMAND messages for the “Button 1” button are handled
d) Figure out what is failing in that code
Let’s get started...
Find the control identifier for “Button 1”
Spy++ (available as part of Visual Studio) is the tool for this job. Click Spy -> Log Messages. Use the Finder Tool to select the main window of ntdbghang1.exe. On the messages tab, click “Clear All” then select “WM_COMMAND”. Click “OK” and SPY++ will start logging. Click “Button 1” in ntdbghang1.exe and you should see one line of text appear in the output window. For good measure, click “Button 2” also, and you should see a second line of text. At this point, your SPY++ window should look something like this....
We can tell from this output, that the control identifier for “Button 1” is 257 (0x101), and the control identifier for “Button 2” is 258 (0x102). We’ll need this information later.
Find the WindowProc for the main application window
We can find the WindowProc address by using SPY++ again. Click Spy -> Find Window and use the Finder Tool to select the main window of ntdbghang1.exe. Make sure “Show properties” is selected, and then click “OK”. You will get a dialog that looks like this....
Note the value of Window Proc. This is the address of the window procedure for the main window (the exact value may differ on your system). We know that this function is a WindowProc, which means it is of the form...
LRESULT CALLBACK WindowProc(
The meaning of wParam and lParam differ depending on what uMsg is being processed. When a WM_COMMAND message is passed, the low-word of wParam is the control identifier, which we determined to be 0x101. The high-word of wParam is the control notification code, which in the case of a button being clicked is BN_CLICKED (literally 0).
So, we specifically are interested in the case where:
uMsg = WM_COMMAND (literally 0x111)
wParam = 0x101
While looking at assembly language for the WindowProc, the stack frame will look like this...
ebp = “old ebp”
ebp+4 = “return address”
ebp+8 = hwnd
ebp+c = uMsg
ebp+10 = wParam
ebp+14 = lParam
Determine where WM_COMMAND messages for the “Button 1” button are handled
Once you have the address of the WindowProc, we can examine the assembly code for that function using Windbg. Launch windbg.exe and select “File -> Attach to a Process”. Select “ntdbghang1.exe” from the list and click “OK”. When the debugger breaks in, you can unassemble the beginning of the WindowProc by typing “u <address>”. Based on the address on my system, that command will be “u 01002830”. To unassemble more just type “u” again. I’m now going to unassemble the relevant code and explain what it means along the way...
0:001> u 01002830
The first 3 instructions are a function prologue, just getting everything set up.
01002830 8bff mov edi,edi
01002832 55 push ebp
01002833 8bec mov ebp,esp
Here is the code that moves the value of uMsg into ecx
01002835 8b4d0c mov ecx,dword ptr [ebp+0Ch]
Now the app enters the assembly equivalent of a switch statement.
We are interested in the check for uMsg = WM_COMMAND
if uMsg = WM_CREATE (1) goto 01002893
01002838 49 dec ecx
01002839 7458 je ntdbghang1+0x2893 (01002893)
if uMsg = WM_ DESTROY (2) goto 01002889
0100283b 49 dec ecx
0100283c 744b je ntdbghang1+0x2889 (01002889)
if uMsg = WM_CLOSE (0x10) goto 01002889
0100283e 83e90e sub ecx,0Eh
01002841 743b je ntdbghang1+0x287e (0100287e)
if uMsg = WM_COMMAND (0x111) goto 01002853
01002843 b801010000 mov eax,101h
01002848 2bc8 sub ecx,eax
0100284a 7407 je ntdbghang1+0x2853 (01002853)
From the above switch statement we see that when uMsg = WM_COMMAND, execution moves to 01002853. So let’s go there...
0:001> u 01002853
Move the value of wParam into edx
01002853 8b5510 mov edx,dword ptr [ebp+10h]
If LOWORD(wParam) == 0x101 (control ID for button 1) goto 0100286f
01002856 0fb7ca movzx ecx,dx
01002859 2bc8 sub ecx,eax
0100285b 7412 je ntdbghang1+0x286f (0100286f)
From the above assembly we can see that right away we are checking for the control ID for button 1. This is the code path we want to follow....
0:001> u 0100286f
If HIWORD(wparam) != BN_CLICKED (0) then goto 0100289b
Otherwise, call the function at 010027f6
0100286f c1ea10 shr edx,10h
01002872 6685d2 test dx,dx
01002875 7524 jne ntdbghang1+0x289b (0100289b)
01002877 e87affffff call ntdbghang1+0x27f6 (010027f6)
Now it is clear that when the BN_CLICKED occurs for control id 0x101, we are going to execute the function at address 010027f6.
Figure out what is failing
So now we know what code is going to run every time the button is clicked (the function at 010027f6). Let’s examine that code and figure out what may be failing. We’ll use “uf” to unassemble the entire function, as it is rather short. During this first pass we’ll make up names for our local variables and functions that are called, and then we’ll revisit and figure out what they are.
0:001> uf 010027f6
010027f6 8bff mov edi,edi
010027f8 55 push ebp
010027f9 8bec mov ebp,esp
010027fb 51 push ecx
010027fc 56 push esi
call func1(0x20) [func1 address at 01002dca]
010027fd 6a20 push 20h
010027ff c745fc10000000 mov dword ptr [ebp-4],10h
01002806 e8bf050000 call ntdbghang1+0x2dca (01002dca)
Store the result of func1 in esi (call it localvar2)
0100280b 8bf0 mov esi,eax
Call func2(localvar2, &localvar1) [func2 address at 0100b484]
0100280d 59 pop ecx
0100280e 8d45fc lea eax,[ebp-4]
01002811 50 push eax
01002812 56 push esi
01002813 e86c8c0000 call ntdbghang1+0xb484 (0100b484)
If func2 returns 0, then goto 01002821
01002818 85c0 test eax,eax
0100281a 7405 je ntdbghang1+0x2821 (01002821)
Call func3() [func3 address at 0100278d]
0100281c e86cffffff call ntdbghang1+0x278d (0100278d)
Call func4(localvar2) [func4 address at 01002ce3]
01002821 56 push esi
01002822 e8bc040000 call ntdbghang1+0x2ce3 (01002ce3)
Clean up and exit the function
01002827 59 pop ecx
01002828 5e pop esi
01002829 c9 leave
0100282a c3 ret
So let’s translate this to some pseudo-code...
localvar1 = 0x10;
localvar2 = func1(0x20);
So the behavior of the function will differ, depending on the results of func2. Let’s see if we can figure out what func2 does...
0:001> u 0100b484
0100b484 ff2568110001 jmp dword ptr [ntdbghang1+0x1168 (01001168)]
0100b48a cc int 3
0100b48b cc int 3
0100b48c cc int 3
0:001> dps 01001168 L1
01001168 70b88cb1 WINSPOOL!GetDefaultPrinterW
GetDefaultPrinterW is a public API function. Here’s the function prototype from MSDN...
LPTSTR pszBuffer, // printer name buffer
LPDWORD pcchBuffer // size of name buffer
This fits our pseudo-code, since the function takes two parameters and returns a BOOL. Let’s update our pseudo code to match what we know, replacing func2 with GetDefaultPrinter, etc...
DWORD cchBuffer = 0x10;
LPWSTR pszBuffer = func1(0x20);
Based on the above, we can make a couple of assumptions. Func1 is likely an allocator function of some kind (such as malloc) and func4 is likely a memory free function (such as free). Since the app is using the Unicode version of GetDefaultPrinter (it ends with a W) it make sense that func1 is an allocator asking for 0x20 bytes, and the size of the buffer, in characters, passed to GetDefaultPrinterW is 0x10. So that accounts for all of the functions, except for func3. We could unassemble func3 to see what it does, but that may not be needed. From the pseudo-code we’ve already created, we can see that a failure in GetDefaultPrinterW will prevent the button click function from doing anything except allocating and freeing some memory. So GetDefaultPrinterW is a likely potential point of failure.
Let’s test this theory out. We’ll set a breakpoint on GetDefaultPrinterW and see if it is failing upon the button click.
0:001> bp WINSPOOL!GetDefaultPrinterW
<Now click Button 1>
Breakpoint 0 hit
eax=0006fb84 ebx=00000000 ecx=00000020 edx=00dc0e98 esi=00dc0e70 edi=0006fc0c
eip=70b88cb1 esp=0006fb74 ebp=0006fb88 iopl=0 nv up ei pl nz na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202
70b88cb1 8bff mov edi,edi
// Dump out the call stack...
ChildEBP RetAddr Args to Child
0006fb70 01002818 00dc0e70 0006fb84 00000111 WINSPOOL!GetDefaultPrinterW
WARNING: Stack unwind information not available. Following frames may be wrong.
0006fb88 0100287c 0006fbbc 75d41a10 00320f78 ntdbghang1+0x2818
0006fb90 75d41a10 00320f78 00000111 00000101 ntdbghang1+0x287c
0006fbbc 75d41ae8 01002830 00320f78 00000111 USER32!InternalCallWinProc+0x23
0006fc34 75d4286a 00000000 01002830 00320f78 USER32!UserCallWinProcCheckWow+0x14b
0006fc74 75d42bba 00a90b80 0095ee08 00000101 USER32!SendMessageWorker+0x4b7
// Go to the return address. Note that eax=0, meaning GetDefaultPrinterW returned FALSE.
eax=00000000 ebx=00000000 ecx=76f22033 edx=00e10178 esi=00dc0e70 edi=0006fc0c
eip=01002818 esp=0006fb80 ebp=0006fb88 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
01002818 85c0 test eax,eax
// Check the last error...
LastErrorValue: (Win32) 0x7a (122) - The data area passed to a system call is too small.
// Resume execution
The error returned was 0x7a = ERROR_INSUFFICIENT_BUFFER. It looks like the buffer of 0x20 bytes being passed to GetDefaultPrinterW is too small to hold the name of the default printer for that user. This explains why the app works for some users and fails for others. A quick change of the default printer name to something less than 0x10 characters (16 decimal) will work around this issue.
To wrap things up, here is the C source for the Button1_OnClick function (at address 010027f6 in the assembly) so you can see how this was actually written...
DWORD cch = 16;
pPrinterName = (LPTSTR) malloc(16 * sizeof(TCHAR));
Look for an upcoming blog post that shows another way this problem could have been approached – using the “wt” command in the debugger. I hope you found this useful and can apply to real-world debugging scenarios. Please post any questions / comments!