Case Study: Debug distributed application using packet analyzer


Packet analyzers are powerful tools to troubleshoot remote communications between distributed applications. Microsoft are creating various new frameworks to lower the bar/cost for the developers to build distributed applications in Windows ecosystem, while those new technologies introduces additional layers and abstractions on top of existing APIs. And sometimes, if certain error happens in the lower stacks, the error information could be easily lost/translated by layers of wrapper codes in higher stacks, and this could make debugging such error extremely difficult. By using packet analyzers, one can see the raw information being transferred over the wire; therefore, this bottom-up approach could give developers a better understanding of what actually is going on first, and then, knowing what it is would make debugging in higher level more efficient later on.

The Bug:

The following C# code segment is to obtain the hardware plug-and-play information from a remote machine:

ConnectionOptions options = new ConnectionOptions(

    null,

    "administrator",

    "!!123abc",

    "NTLMDOMAIN:10.177.17.6",

    ImpersonationLevel.Impersonate,

    AuthenticationLevel.PacketPrivacy,

    true,

    null,

TimeSpan.MaxValue);

 

ManagementScope scope = new ManagementScope(

    @"\\10.177.17.6\root\cimv2",

options);

 

scope.Connect();

 

ManagementClass targetClass = new ManagementClass(

    scope,

    new ManagementPath("Win32_PnPEntity"),

    null);

 

ManagementObjectCollection collection = targetClass.GetInstances(); 

foreach (ManagementBaseObject wmiObject in collection)

{

   ...

}

The above codes use WMI .NET with 3.5 SP1 runtime. It connect to the target machine (10.177.17.6) using that machine’s local administrator account, and it also specifies NTLM as the authentication protocol (NTLMDOMAIN).  After the connection is established, it remotely enumerates all the instances from WMI Win32_PnPEntity class.

This seemingly simple code works fine in most of time; however, in one particular situation: if the target machine is running under Windows Preinstallation Environment 3.0, the code throws an COMException at some random time during the foreach loop. And the exception code can be either RPC_S_CALL_FAILED_DNE (0x800706BF The remote procedure call failed) or RPC_S_CALL_FAILED (0x800706BE The remote procedure call failed and did not execute).

The Debugging:

Firstly, some knowledge about how Microsoft technologies are layered can help to understand why the exception talks about RPC failure and why it is the type of COMException:

The WMI .NET is just a wrapper above native WMI framework, and the remote communication feature of WMI is built on top of DCOM, and then again, DCOM utilize MSRPC as its inter-process communication infrastructure. So this exception is basically trying to tell the developers that, there is an error on the RPC level detected by DCOM stack, and then that error is reported to the .NET framework layer, and lastly .NET framework raises that error as COMException back to the user code.

If Windbg is attached to the codes, the output shows that error 0x80070721 and 0x80070005 always happened before that COMException

(dd8.a6c): Unknown exception - code 80070721 (first chance)
(dd8.a6c): Unknown exception - code 80070005 (first chance)
(dd8.a6c): Unknown exception - code 800706bf (first chance)

0x8007005 is access denied, and 0x80070721 is less known, use !error command

0:000> !error 0x80070721
Error code: (HRESULT) 0x80070721 (2147944225) - A security package specific error occurred.

It looks like some security related problem has happened right before the client codes failed. That’s all the information there, and nothing furthermore can be investigated in the debugger (unless some serious reverse engineering are done, which is beyond this case study). It seems the debugging is getting into a dead-end, and now it is the time to start the bottom-up approach.

Note: Another possible approach is to setup the debugger to use symbols from Microsoft Reference Source Server, which would provide source codes access for .NET framework.

Use any packet analyzers (such as WireShark or Microsoft Network Monitor) and start a capture session, and repro this bug again, then save the capture file. The content of the capture file is shown below

No.     Time        Source                Destination           Protocol Info

    406 5.078086    10.177.17.15          10.177.17.6           DCERPC   Bind: call_id: 3 ISystemActivator V0.0, NTLMSSP_NEGOTIATE
    407 5.078721    10.177.17.6           10.177.17.15          DCERPC   Bind_ack: call_id: 3, NTLMSSP_CHALLENGE accept max_xmit: 5840 max_recv: 5840
    408 5.078987    10.177.17.15          10.177.17.6           DCERPC   AUTH3: call_id: 3, NTLMSSP_AUTH, User: 10.177.17.6\administrator

At the beginning of our capture file, the RPC layer established TCP connection and started to authenticate with the remote machine (10.177.17.6). The first 3 frames (highlighted in gray) are the standard 3-leg NTLM authentication protocol, NEGOTIATE –> CHALLENGE –> AUTH, see [PWS00] chapter 7 for details. As recorded in frame 408 (AUTH), the credential used for authentication is the local administrator account on the target machine. Everything is as expected.

    409 5.079045    10.177.17.15          10.177.17.6           ISystemActivator RemoteCreateInstance request[Malformed Packet] 
    411 5.081677    10.177.17.6           10.177.17.15          ISystemActivator RemoteCreateInstance response[Malformed Packet]

    ...

    517 5.334018    10.177.17.15          10.177.17.6           DCERPC   Request: call_id: 7 opnum: 6 ctx_id: 0 IEnumWbemClassObject V0
    518 5.334248    10.177.17.6           10.177.17.15          TCP      49157 > 59916 [ACK] Seq=221 Ack=789 Win=130560 Len=0
    519 5.334574    10.177.17.6           10.177.17.15          DCERPC   Response: call_id: 7 ctx_id: 0 IEnumWbemClassObject V0

    ...

After the initial authentication, the RPC conversation is established, and the DCOM starts to do its own stuff, such as remote activation (frame 409 - 411), reference count keeping, object lifecycle management etc., and then it is followed by WMI communication, such as IEnumWbemClassObject::Clone (frame 517 – 519, opnum 6 is Clone) and etc.. Most of those packages are about standard operations and have no interests to this bug, so they are not shown in this blog.

Note: see [EC98] and [MS-WMI] for more information about DCOM and WMI

Then, what happened next is most interesting

    520 5.334778    10.177.17.15          10.177.17.6           DCERPC   Alter_context: call_id: 8, 2 context items, 1st IRemUnknown2 V0.0, NTLMSSP_NEGOTIATE
    521 5.334891    10.177.17.6           10.177.17.15          DCERPC   Alter_context_resp: call_id: 8, NTLMSSP_CHALLENGE accept max_xmit: 5840 max_recv: 5840
    522 5.335016    10.177.17.15          10.177.17.6           DCERPC   AUTH3: call_id: 8, NTLMSSP_AUTH, User: REDMOND\jiangyue
    523 5.335047    10.177.17.15          10.177.17.6           IRemUnknown2 RemQueryInterface request IID[1]=IManagedObject
    524 5.335231    10.177.17.6           10.177.17.15          TCP      49157 > 59916 [ACK] Seq=821 Ack=1543 Win=129792 Len=0
    525 5.335313    10.177.17.6           10.177.17.15          DCERPC   Fault: call_id: 8 ctx_id: 1 status: Unknown (0x00000721)
    526 5.335379    10.177.17.15          10.177.17.6           IRemUnknown2 RemQueryInterface request IID[1]=IProvideClassInfo
    527 5.335574    10.177.17.6           10.177.17.15          DCERPC   Fault: call_id: 9 ctx_id: 1 status: nca_s_fault_access_denied

    528 5.335574    10.177.17.6           10.177.17.15          TCP      49157 > 59916 [FIN, ACK] Seq=885 Ack=1687 Win=131328 Len=0
    529 5.335600    10.177.17.15          10.177.17.6           TCP      59916 > 49157 [ACK] Seq=1687 Ack=886 Win=130304 Len=0
    530 5.336227    10.177.17.15          10.177.17.6           TCP      59916 > 49157 [FIN, ACK] Seq=1687 Ack=886 Win=130304 Len=0
    531 5.336356    10.177.17.6           10.177.17.15          TCP      49157 > 59916 [ACK] Seq=886 Ack=1688 Win=131328 Len=0

As highlighted by yellow color, RPC decided to re-authenticate with target machine using a different set of credential redmond\jiangyue (frame 520 - 522), and this credential is belong to the security token of current calling thread. However, redmond\jiangyue is not a valid credential on the target machine.  Therefore, when the client machine called QueryInterface on the remote object for IManagedObject, the RPC layer on the remote machine responded with Fault message containing error code 0x00000721 (frame 525). Then again, the client tried to make another request for IProviderClassInfo to remote PRC layer, this time it responded with another Fault message containing error code 0x00000005 (access_denied). Those two errors matched the two exceptions seen by Windbg earlier on. After received two consecutive failed attempt, the RPC layer on the remote machine decided to send out the FIN message and close the current TCP connection, as indicated in frame 528 – 531.

    532 5.336862    10.177.17.15          10.177.17.6           TCP      59917 > 49157 [SYN] Seq=0 Win=8192 Len=0 MSS=1460 WS=8 SACK_PERM=1
    533 5.336943    10.177.17.6           10.177.17.15          TCP      49157 > 59917 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1460 WS=8 SACK_PERM=1
    534 5.337055    10.177.17.15          10.177.17.6           TCP      59917 > 49157 [ACK] Seq=1 Ack=1 Win=131328 Len=0

    535 5.337389    10.177.17.15          10.177.17.6           DCERPC   Bind: call_id: 11, 2 context items, 1st IEnumWbemClassObject V0.0, NTLMSSP_NEGOTIATE
    536 5.337595    10.177.17.6           10.177.17.15          DCERPC   Bind_ack: call_id: 11, NTLMSSP_CHALLENGE accept max_xmit: 5840 max_recv: 5840
    537 5.337729    10.177.17.15          10.177.17.6           DCERPC   AUTH3: call_id: 11, NTLMSSP_AUTH, User: 10.177.17.6\administrator

    538 5.337755    10.177.17.15          10.177.17.6           DCERPC   Request: call_id: 11 opnum: 3 ctx_id: 0 IEnumWbemClassObject V0
    539 5.337971    10.177.17.6           10.177.17.15          TCP      49157 > 59917 [ACK] Seq=221 Ack=789 Win=130560 Len=0
    540 5.338106    10.177.17.6           10.177.17.15          DCERPC   Response: call_id: 11 ctx_id: 0 IEnumWbemClassObject V0
    541 5.338385    10.177.17.15          10.177.17.6           DCERPC   Alter_context: call_id: 12, 2 context items, 1st IRemUnknown2 V0.0
    542 5.338450    10.177.17.6           10.177.17.15          DCERPC   Alter_context_resp: call_id: 12 Provider rejection, reason: Proposed transfer syntaxes not supported
    543 5.338491    10.177.17.15          10.177.17.6           IRemUnknown2 RemQueryInterface request
    544 5.338605    10.177.17.6           10.177.17.15          IRemUnknown2 RemQueryInterface response 

The client seemed ignore all the security errors and continue its execution. The client made a new TCP connection to the remote machine by sending a new SYN message, as indicated in frame 532 – 534. And this time, it uses the right credential 10.177.17.6\administrator so it was enable to make some progress, until again the RPC layer decided to re-authenticate with a wrong credential redmond\jiangyue to query IManagedObject interface. This pattern repeated itself again and again in the captured file.

The client was struggling to move forward until it reached the foreach loop, where it needed to call MoveNext() on the enumerator repeatedly.  

    745 5.338889    10.177.17.15          10.177.17.6           DCERPC   Alter_context: call_id: 13 IWbemFetchSmartEnum V0.0, NTLMSSP_NEGOTIATE 
    746 5.338996    10.177.17.6           10.177.17.15          DCERPC   Alter_context_resp: call_id: 13, NTLMSSP_CHALLENGE accept max_xmit: 5840 max_recv: 5840 
    747 5.339116    10.177.17.15          10.177.17.6           DCERPC   AUTH3: call_id: 13, NTLMSSP_AUTH, User: REDMOND\jiangyue 
    748 5.339146    10.177.17.15          10.177.17.6           DCERPC   Request: call_id: 13 opnum: 3 ctx_id: 3 IWbemFetchSmartEnum V0 
    749 5.339337    10.177.17.6           10.177.17.15          TCP      49157 > 59917 [ACK] Seq=685 Ack=1759 Win=131328 Len=0 
    750 5.339415    10.177.17.6           10.177.17.15          DCERPC   Fault: call_id: 13 ctx_id: 3 status: Unknown (0x00000721
    756 5.339866    10.177.17.6           10.177.17.15          DCERPC   Response: call_id: 14 ctx_id: 0 IEnumWbemClassObject V0 
    757 5.339887    10.177.17.15          10.177.17.6           TCP      59917 > 49157 [ACK] Seq=1871 Ack=5101 Win=131328 Len=0 
    758 5.340308    10.177.17.15          10.177.17.6           DCERPC   Request: call_id: 15 opnum: 3 ctx_id: 3 IWbemFetchSmartEnum V0 
    759 5.340479    10.177.17.6           10.177.17.15          DCERPC   Fault: call_id: 15 ctx_id: 3 status: nca_s_fault_access_denied
 
    760 5.340500    10.177.17.6           10.177.17.15          TCP      49157 > 59917 [FIN, ACK] Seq=5133 Ack=1967 Win=131072 Len=0 
    761 5.340515    10.177.17.15          10.177.17.6           TCP      59917 > 49157 [ACK] Seq=1967 Ack=5134 Win=131328 Len=0 
    762 5.340568    10.177.17.15          10.177.17.6           TCP      59917 > 49157 [FIN, ACK] Seq=1967 Ack=5134 Win=131328 Len=0 
    763 5.340651    10.177.17.6           10.177.17.15          TCP      49157 > 59917 [ACK] Seq=5134 Ack=1968 Win=131072 Len=0
  

As highlighted above in yellow, after the client enter the foreach loop,  it sent out the IWbemFetchSmartEnum::GetSmartEnum request (in frame 747, opnum 3) using the wrong credential once again, and it resulted the remote RPC to drop the TCP connection. This was the last user actions. After that, DCOM started garbage collections (frame 842-845) and then the communication over the wire was completely stopped.

    836 6.612612    10.177.17.15          10.177.17.6           TCP      59918 > 49157 [SYN] Seq=0 Win=8192 Len=0 MSS=1460 WS=8 SACK_PERM=1  
    837 6.612861    10.177.17.6           10.177.17.15          TCP      49157 > 59918 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1460 WS=8 SACK_PERM=1  
    838 6.613020    10.177.17.15          10.177.17.6           TCP      59918 > 49157 [ACK] Seq=1 Ack=1 Win=131328 Len=0
 
    839 6.613592    10.177.17.15          10.177.17.6           DCERPC   Bind: call_id: 17, 3 context items, 1st IRemUnknown2 V0.0, NTLMSSP_NEGOTIATE  
    840 6.614019    10.177.17.6           10.177.17.15          DCERPC   Bind_ack: call_id: 17, NTLMSSP_CHALLENGE accept max_xmit: 5840 max_recv: 5840  
    841 6.614199    10.177.17.15          10.177.17.6           DCERPC   AUTH3: call_id: 17, NTLMSSP_AUTH, User: 10.177.17.6\administrator  
    842 6.614236    10.177.17.15          10.177.17.6           IRemUnknown2 RemRelease request  
    843 6.614429    10.177.17.6           10.177.17.15          TCP      49157 > 59918 [ACK] Seq=245 Ack=881 Win=130560 Len=0  
    844 6.614761    10.177.17.6           10.177.17.15          IRemUnknown2 RemRelease response  
    845 6.616079    10.177.17.15          10.177.17.6           TCP      59918 > 49157 [RST, ACK] Seq=881 Ack=309 Win=0 Len=0

Frame 845 is the last record in the capture file.

After walking through the capture file, the root cause of the mysterious COMException should be clear now: Some code in WMI .NET layer was connect to remote RPC with a wrong credential, which makes the RPC server (on the remote machine) to drop the TCP connection. While the foreach loop was not aware of this, and kept reusing the dead TCP connection to make requests. When that happens, RPC reports generic error 0x6BE or 0x6BF

Now the next question is that why this issue only happens if the remote machine is running under WinPE environment, not in other full OS version. To answer that, another capture is made for the network traffic with the same codes running against a remote machine under full OS. By analyzing the new capture, it turns out that in full OS scenario, the RCP server behaves differently: It does NOT terminate the TCP connection when wrong credential is used.

With the knowledge obtained so far, do a search on Bing.com about “WMI.NET IManagedobject Current Security Token”, and it would reveal two KB articles from Microsoft (KB967622 and KB974893). They talked about exact the same issue faced here: When WMI.NET was using DCOM wrapper layer provided by .NET framework, the .NET framework is always trying to query IManagedObject interface for any COM object returned. However, .NET framework does not set the security blanket properly before issuing such a request. Without the security blanket being set, the request would use the default security context, i.e. the current security token of that process. This problem is annoying at least, because it would generate bunch of “Failure Audit” events on the remote machine, which causes security concerns, or in this WinPE case, it would actually disrupt the normal code execution.

Note: see [PWS00] chapter 9 for more information about SetSecurityBlanket

The same KB articles provides hotfix from Microsoft to fix this bug for earlier version of .NET. From .NET 4.0 onwards, this fix has already been integrated. After installing the hotfix for .NET 3.5 SP1, the same codes run smoothly against a WinPE machine. By observing the network traffic using package analyzer, it can be confirmed that the remote QueryInterface for IManagedObject requests are not sent over the wire anymore.  

Take Away:

Coding in managed environment has greatly simplified the developer’s life. While enjoying the mighty of modern programming techniques, it is also very helpful to understand the history of Microsoft technologies and the internal working of such technologies. To understand what happens under the hood, it is not only giving the developers the confidence when facing bugs, but also giving out examples how ones could design/implement to solve similar issues. Similarly, to know the path of how technologies evolve, it gives the developers the insight to those technologies’ rising and decline along with paradigm shift in this industry.

Package analyzers are efficient tools not only for traditional RPC based intranet applications, it is useful as well for newly Internet/Cloud based program. Having basic understanding of some popular network protocol, like TCP and HTTP, can greatly help to solve bugs. Therefore, It is well worthy to invest time to learn them. More inter-process communication debugging tips and expert advises can be found at [AWD07] Chapter 8.

References:

1. [PWS00] Programming Windows Security

2. [EC98] Essential COM

3. [MS-WMI] Windows Management Instrumentation Remote Protocol Specification

4. [AWD07] Advanced Windows Debugging

 

Special thanks to Gabriel and Pradeep’s help in solving this bug in the first place

Skip to main content