Tarik Soulami: Anatomy of a debugging investigation

9780735662780xThis guest post is by Tarik Soulami, a Principal Development Lead at Microsoft and author of Inside Windows Debugging: A Practical Guide to Debugging and Tracing Strategies in Windows, which is getting very positive reviews on Amazon. Enjoy!

Tarik here. When you think of scripts, you don’t usually think of threads, synchronization or deadlocks. However, that was precisely the curious bug I had to investigate a few weeks ago. The script I was trying to write had a simple goal: parse the standard and error outputs from another existing command-line tool (a console application), and write the processed (modified) text to the command prompt.

The repro case

My console application tool was fairly complex, but I was able to simulate the hang condition even with the following simple program:

    for (int n = 0; n < 500; n++)
        fprintf(stdout, "%s\n", "This is an output sent to stdout.");
        fprintf(stderr, "%s\n", "This is an output sent to stderr.");

To help isolate the problem further, I went ahead and removed all the parsing logic that my Visual Basic script contained. Even with this simplified configuration where the output was redirected as-is, I continued to get the hang behavior, as shown in the following output:

C:\ScriptHang>cscript test.vbs consoleapp.exe
TRACE: — Output begin —
This is an output sent to stdout.
This is an output sent to stdout.
This is an output sent to stdout.

This is an outpu^C

Here is the simplified script that I was using for this test experiment. It sure looked innocent enough!

Test.vbs: captures and redirects output of another command-line application

Option Explicit
Dim result

If WScript.Arguments.Count <> 1 Then
    WScript.StdOut.WriteLine("USAGE: cscript.exe test.vbs <ConsoleAppPath>")
End If

WScript.StdOut.WriteLine("TRACE: — Output begin —")

result = Run(WScript.Arguments.Item(0))

WScript.StdOut.WriteLine("TRACE: —- Output end —-")
WScript.StdOut.WriteLine("TRACE: Exit code: " & result)


‘ Execute the console program and capture its stdout/stderr output
Function Run (cmdLine)
    Dim shellObj, execObj, status    
    Set shellObj = CreateObject("WScript.Shell")
    Set execObj = shellObj.Exec(cmdLine)

    If execObj.ProcessID = 0 And execObj.Status = 1 Then
        ‘ Error launching the executable.
        Run = -1
        Exit Function
    End If

        ‘ Process stdout.
        Do Until execObj.StdOut.AtEndOfStream

        ‘ Process stderr.
        Do Until execObj.StdErr.AtEndOfStream
    Loop Until (execObj.Status <> 0 And execObj.StdOut.AtEndOfStream And

    ‘ Return the exit code
    Run = execObj.ExitCode
End Function

Choosing the debugging tool/environment

I must admit that I was both surprised and stumped at first. Why would a fairly simple script like this one hang? My immediate reaction was to attach the WinDbg user-mode debugger to the script host process (cscript.exe) and see what it was waiting on. Unfortunately, that wasn’t terribly useful. All I could see was a call to ReadFile, presumably related to the script’s call in progress to read from the console application’s stdout stream (as evidenced by the truncated message coming out of the script when it stopped printing to the console). However, there was no telling what was preventing this call from completing:

windbg.exe -pn cscript.exe
0:004> .symfix
0:004> .reload
0:004> ~*k
ChildEBP RetAddr
0029e4f4 775262c4 ntdll!KiFastSystemCallRet
0029e4f8 757de43e ntdll!ZwReadFile+0xc
0029e55c 76ee9bba KERNELBASE!ReadFile+0x118
0029e5a4 73643804 kernel32!ReadFileImplementation+0xf0
0029e5d0 73643890 scrrun!CFileStream::Read+0x3d
0029e678 7364394a scrrun!CTextStream::ReadBlock+0x86
0029e684 73643b85 scrrun!CTextStream::GetCharCore+0x15
0029e698 73643b36 scrrun!CTextStream::PeekChar0+0x2f
0029e6b0 76f93e75 scrrun!CTextStream::get_AtEndOfStream+0x60
0029e6cc 76f93cef OLEAUT32!DispCallFunc+0x165

Attaching WinDbg to the consoleapp.exe test tool showed a similar call stack, except the main thread in the process was stuck inside a call to the WriteFile Win32 API, which was in turn invoked by one of the fprintf calls in the C++ test program:

windbg.exe -pn consoleapp.exe
0:004> .symfix
0:004> .reload
0:004> ~*k
ChildEBP RetAddr
0015efbc 775a6a74 ntdll!KiFastSystemCallRet
0015efc0 757775d4 ntdll!ZwWriteFile+0xc
0015f024 770a543c KERNELBASE!WriteFile+0x113
0015f040 76064236 kernel32!WriteFileImplementation+0x76
0015f5fc 760640eb msvcrt!_write_nolock+0x3fb
0015f640 760645f9 msvcrt!_write+0x9f
0015f660 7605c835 msvcrt!_flsbuf+0xc0
0015f66c 7605d0ef msvcrt!_get_printf_count_output+0x41
0015f67c 7605d0ba msvcrt!_get_printf_count_output+0xa8
0015f90c 76063ef9 msvcrt!_output_l+0xb57
0015f954 000c121c msvcrt!fprintf+0x123
0015f97c 000c1390 consoleapp!wmain+0x37 [c:\scripthang\main.cpp @ 12]

At this point, I knew I had to use a kernel debugger so that I could peek at the kernel side of these two threads and see precisely what file objects they were waiting on. Because I didn’t have a second machine to use as a host kernel mode debugger, I went for the next best thing: local (live) kernel debugging.

Digging deeper with the debugger

After enabling kernel debugging on the machine (and rebooting), I re-ran the scenario once again and used local kernel debugging to examine the previous two call stacks. I started with the call in the script host process:

kd.exe -kl
lkd> .symfix
lkd> .reload
lkd> !process 0 0 cscript.exe
PROCESS 89c21030 SessionId: 1 Cid: 0510 Peb: 7ffd7000 ParentCid: 0b34
    Image: cscript.exe
lkd> .process /r /p 89c21030
lkd> !process 89c21030 7
PROCESS 89c21030 SessionId: 1 Cid: 0510 Peb: 7ffd7000 ParentCid: 0b34
    Image: cscript.exe
        THREAD 8a9aad48 Cid 0510.08ac Teb: 7ffdf000 Win32Thread: fe8a2dd8 WAIT: (Executive)
            89bdbfdc NotificationEvent
        ChildEBP RetAddr Args to Child
        b48cab70 82c8a65d 84fa1888 807c9308 807c6120 nt!KiSwapContext+0x26
        b48caba8 82c894a7 84fa1948 84fa1888 866676bc nt!KiSwapThread+0x266
        b48cabd0 82c830cf 84fa1888 84fa1948 00000000 nt!KiCommitThreadWait+0x1df
        b48cac48 82e36b81 866676bc 00000000 b48cad01 nt!KeWaitForSingleObject+0x393
        b48cac7c 82e6f4e8 00000103 86667660 00000001 nt!IopSynchronousServiceTail+0x270
        b48cad08 82c4a27a 86731e30 a4419508 00000000 nt!NtReadFile+0x644
        b48cad08 775a7094 86731e30 a4419508 00000000 nt!KiFastCallEntry+0x12a
        0016e834 775a62c4 7577e43e 0000016c 00000000 ntdll!KiFastSystemCallRet
        0016e838 7577e43e 0000016c 00000000 00000000 ntdll!ZwReadFile+0xc
        0016e89c 77099bba 0000016c 0016e934 00000080 KERNELBASE!ReadFile+0x118

In the previous listing, the handle to the file object being read from is 0x16c (the first argument to the ReadFile API call), and it is a named pipe:

lkd> !handle 0x16c
PROCESS 885cc780 SessionId: 1 Cid: 1dd4 Peb: 7ffd5000 ParentCid: 1c88
    Image: cscript.exe
016c: Object: 86667660 GrantedAccess: 00120189 Entry: 8ed9f2d8
Object: 86667660 Type: (84eab9b8) File
    ObjectHeader: 86667648 (new version)
        HandleCount: 1 PointerCount: 3
lkd> !fileobj 86667660
Device Object: 0x89abded8
Flags: 0x40082
       Synchronous IO
       Named Pipe
       Handle Created
File Object is currently busy and has 0 waiters.

It turns out the thread is waiting on the internal event of a named pipe file object to get signaled after the I/O read request was posted. This can also be verified by comparing the event object stored in the named pipe object structure with the notification event encountered earlier in the thread’s call stack:

lkd> dt nt!_FILE_OBJECT

   +0x05c Event            : _KEVENT…
lkd> dd 86667660+0x5c
Evaluate expression: -1984053284 = 89bdbfdc

Because the named pipe is currently empty, the thread blocks and waits for new bytes to be written to the named pipe stream.

Note  The blocking behavior in the named pipes case (NPFS file system) isn’t really surprising. However, you should also remember that even in the case of NTFS file handles, synchronous I/O read or write requests are handled similarly and involve a two-phase process. First, the request is queued up and the thread blocks, and then when the I/O is actually completed (hopefully very shortly thereafter in the case of NTFS), the internal event handle of the file object is signaled to unblock the requesting thread, which then completes the call and returns to user mode.

The question now is why consoleapp.exe isn’t writing any new data to its stdout stream (keeping the script waiting for new input). For that, let’s look at the state of its main thread (which we had already seen was blocked in a WriteFile call). It’s easy again to see that we have a very similar situation, where WriteFile is “pended” but cannot complete because the named pipe became full.

lkd> !process 0 0 consoleapp.exe
PROCESS 885cc780 SessionId: 1 Cid: 1e8c Peb: 7ffd9000 ParentCid: 2260
    Image: consoleapp.exe
lkd> .process /r /p 885cc780
lkd> !process 885cc780 7

        ChildEBP RetAddr Args to Child
        98758b60 82c8a65d 858edd08 807c9308 807c6120 nt!KiSwapContext+0x26
        98758b98 82c894a7 858eddc8 858edd08 85908184 nt!KiSwapThread+0x266
        98758bc0 82c830cf 858edd08 858eddc8 00000000 nt!KiCommitThreadWait+0x1df
        98758c38 82e36b81 85908184 00000000 885cc701 nt!KeWaitForSingleObject+0x393
        98758c6c 82e7c840 00000103 85908128 00000001 nt!IopSynchronousServiceTail+0x270
        98758d08 82c4a27a 86731e30 00000000 00000000 nt!NtWriteFile+0x6e8
        98758d08 775a7094 86731e30 00000000 00000000 nt!KiFastCallEntry+0x12a
        0021ef5c 775a6a74 757775d4 00000168 00000000 ntdll!KiFastSystemCallRet
        0021ef60 757775d4 00000168 00000000 00000000 ntdll!ZwWriteFile+0xc

lkd> !handle 00000168

0168: Object: 85908128 GrantedAccess: 00120196 (Inherit) Entry: a5d392d0
Object: 85908128 Type: (84eab9b8) File
    ObjectHeader: 85908110 (new version)
        HandleCount: 1 PointerCount: 3
lkd> !fileobj 85908128
Device Object: 0x86731e30
Flags: 0x40082
       Synchronous IO
       Named Pipe
       Handle Created
File Object is currently busy and has 0 waiters.

Everything starts to make sense now. It appears that the stderr pipe of the console process is full (this happens when the internal pipe buffer fills up with unprocessed data), and so the test tool isn’t able to write any more characters into the pipe until a client has pumped data off of it. However, cscript.exe hasn’t gotten around to reading from stderr just yet (the second loop in test.vbs). At the same time, cscript.exe has read all the characters from the stdout stream of the tool and is waiting on the test tool to emit more data into the empty pipe. This results in a deadlock!

Potential mitigations

Clearly, the console application tool isn’t at fault here. It is simply writing messages to its stdout and stderr streams. The problem lies within the script stream object implementation. The AtEndOfStream property was unfortunately designed as a blocking call when it could arguably have been designed to first peek the pipe (using for instance PeekNamedPipe) to see if there was any data in the pipe’s buffer before reading those available bytes. This issue has in fact been documented in a Microsoft KB article; see http://support.microsoft.com/kb/960246.

If you were writing your own client program (C++ or C#, for example) to parse the output of the console program, you could easily do just that (peek the pipe first before reading). But when using VBScript or JScript, you are unfortunately stuck with the built-in implementation of the scripting engine stream object (unless you want to implement and deploy your own COM scripting object to all the client machines).

A simple approach that works around this problem in the scripting engine’s stream object is to redirect the data from the stdout and stderr streams of the console application to temporary files on the file system. Once the files are written to disk, the test script could then parse and modify the data from those files as needed. This solution essentially buffers the streams manually instead of relying on the scripting object to redirect the standard and error output streams from the child process on the fly.

C:\ScriptHang>cscript test.vbs "cmd.exe /c consoleapp.exe 1>stdout.txt 2>stderr.txt"
TRACE: — Output begin —
TRACE: —- Output end —-
TRACE: Exit code: 0
echo After executing the command, the script can now safely process stdout.txt and stderr.txt


This case study demonstrated several important steps in a typical debugging investigation:

1. Isolate a good repro case:   When I started noticing the deadlock behavior in the script from this case study, it wasn’t actually occurring every time I ran the script, but it was depending on the specific command output from the console tool that I was passing to the script. Simplifying the console application to continuously write interleaving stderr and stdout messages highlighted the conditions under which the hang occurred, making for a 100% repro rate as I increased the number of iterations in the console application.

2. Pick the debugging environment that you’ll use to examine the scenario:   In this case, the script was hung in native code, outside the Visual Basic script code that I had written in Visual Studio. In fact, the hung script was waiting on an event in kernel mode! WinDbg is a great debugger to use when examining native code, and because I also wanted to dig deeper into what was happening on the kernel side of the fence, I ended up using live (local) kernel debugging. Even when debugging high-level code such as a script program, low-level debugging techniques can still sometimes prove very useful.

3. Understand the code paths that you’re dealing with:   A debugger leads you to the door step of the failure, but it’s still up to you to unveil the mystery by understanding how the code you’re debugging is supposed to work. In this case study, it was important to understand several system architecture fundamentals such as how named pipes work internally, how I/O requests are processed by the Windows OS, and many more aspects of synchronization. It was also useful to search the Internet for similar issues, which in this case led me right to a known KB article on the MSDN website.

4. Determine possible bug fixes:   The final step once you’ve understood the nature of the bug that you’re dealing with is to think about potential solutions. The more you understand about the bug, of course, the easier it is going to be to come up with sound approaches, so this step usually runs in parallel with the previous one.

Comments (2)

  1. Joseph Miller says:

    Great post!

  2. Brahim says:

    Nice ! great tuto