Debugging Whidbey Remoting AccessViolation Problem

There has been some cases where users have reported an AccessViolation when upgrading their Remoting app's to Whidbey. Some users found that the problem reprod only when they had some anti virus software (Nod32 in particular) installed and the AV went away when they configured the anti-virus not to scan the problematic exe's. In addition to working on WCF, I am also the dev servicing Remoting for all versions of the framework. We had never seen this problem in our internal testing and since there were enough cases out there to indicate that its not due to one single machine/configuration, we decided to see if there really was a problem.

Reproing the problem.

First we had to repro the problem and since most users said it was reproducible reliably with Nod32 anti virus installed, we installed a trial version of the software and wrote a simple remoting app that just echoes the value sent by the client. The client was able to connect to the server and have its input echoed but when the client shutdown, it caused the server process to AV.

Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indi
cation that other memory is corrupt.
at System.Net.UnsafeNclNativeMethods.OSSOCK.WSAGetOverlappedResult(SafeCloseSocket socketHandle, IntPtr overlapped, U
Int32& bytesTransferred, Boolean wait, IntPtr ignored)
at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32 errorCode, UInt32 numBytes, NativeOverl
apped* nativeOverlapped)
at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverla
pped* pOVERLAP)

 

Debugging the problem:

Now that the issue was reproducible, we ran the server process in windbg and ran the repro again. With windbg attached the AV is caught as a first chance exception.

First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00000000 ebx=001872f0 ecx=011efa74 edx=00000000 esi=012a2498 edi=011efc48
eip=20b0cd29 esp=011ef9cc ebp=011efa00 iopl=0 nv up ei ng nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010286
imon!NOD32Ioctl+0xaec9:
20b0cd29 8910 mov dword ptr [eax],edx ds:0023:00000000=????????

So Nod32 tried to save value of EBX register in EAX which means that the system expects the register to hold a pointer reference and not value 0. Just to be sure that the heap isnt corrupted, ran the !verifyheap command.

0:005> !verifyheap
-verify will only produce output if there are errors in the heap

VerifyHeap command reports a clean heap. Lets take a look at the stacktrace and unwind the stack.

0:005> !clrstack
OS Thread Id: 0x89c (5)
ESP EIP
011efa50 20b0cd29 [NDirectMethodFrameStandaloneCleanup: 011efa50] System.Net.UnsafeNclNativeMethods+OSSOCK.WSAGetOverlappedResult(System.Net.SafeCloseSocket, IntPtr, UInt32 ByRef, Boolean, IntPtr)
011efa6c 7a60e1ce System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
011efaa4 793d6ac4 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
011efc48 79e88f63 [GCFrame: 011efc48]

0:005> kb 100
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
011efa00 00fc0a97 000005b4 012a2498 011efa74 imon!NOD32Ioctl+0xaec9
011efa38 7a60e1ce 012a2084 00000040 012a2498 CLRStub[StubLinkStub]@fc0a97
011efa98 793d6ac4 00000000 00000000 00000000 System_ni!System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)+0xb2 [f:\RTM\ndp\fx\src\Net\System\Net\Sockets\_BaseOverlappedAsyncResult.cs @ 360]
011efab8 79e88f63 00000000 00000000 00000000 mscorlib_ni!System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)+0x68 [f:\RTM\ndp\clr\src\BCL\System\Threading\Overlapped.cs @ 85]
011efacc 79e88ee4 011efc80 00000001 011efc74 mscorwks!CallDescrWorker+0x33 [f:\rtm\ndp\clr\src\vm\i386\asmhelpers.asm @ 936]
011efb4c 79f20212 011efc80 00000001 011efc74 mscorwks!CallDescrWorkerWithHandler+0xa3 [f:\rtm\ndp\clr\src\vm\class.cpp @ 13568]
011efb6c 79f201bc 011efc7c 00000001 011efc74 mscorwks!DispatchCallBody+0x1e [f:\rtm\ndp\clr\src\vm\crossdomaincalls.cpp @ 504]
011efbd0 79f2024b 011efc7c 00000001 011efc74 mscorwks!DispatchCallDebuggerWrapper+0x3d [f:\rtm\ndp\clr\src\vm\crossdomaincalls.cpp @ 528]
011efc04 7a07bebf 011efc7c 00000001 011efc74 mscorwks!DispatchCallNoEH+0x51 [f:\rtm\ndp\clr\src\vm\crossdomaincalls.cpp @ 620]
011efcac 79ecb4a4 012a246c 001872f0 00000001 mscorwks!BindIoCompletionCallBack_Worker+0x123 [f:\rtm\ndp\clr\src\vm\comthreadpool.cpp @ 702]
011efcbc 79ecb442 011efd98 011efd44 79f93fe6 mscorwks!ManagedThreadBase_DispatchInner+0x4d [f:\rtm\ndp\clr\src\vm\threads.cpp @ 16923]
011efd50 79ecb364 011efd98 bdde0c59 001872f0 mscorwks!ManagedThreadBase_DispatchMiddle+0xb0 [f:\rtm\ndp\clr\src\vm\threads.cpp @ 16974]
011efd8c 7a0e1b7e 011efd98 00000001 00000000 mscorwks!ManagedThreadBase_DispatchOuter+0x6d [f:\rtm\ndp\clr\src\vm\threads.cpp @ 17085]
011efdb4 7a0e1bab 00000001 7a07bdfe 011efdec mscorwks!ManagedThreadBase_FullTransitionWithAD+0x25 [f:\rtm\ndp\clr\src\vm\threads.cpp @ 17107]
011efdc8 7a07c031 00000001 7a07bdfe 011efdec mscorwks!ManagedThreadBase::ThreadPool+0x13 [f:\rtm\ndp\clr\src\vm\threads.cpp @ 17149]
011efe1c 7a07c063 00000040 00000000 012a2498 mscorwks!BindIoCompletionCallbackStubEx+0x8c [f:\rtm\ndp\clr\src\vm\comthreadpool.cpp @ 777]
011efe30 79f2f3b0 00000040 00000000 012a2498 mscorwks!BindIoCompletionCallbackStub+0x13 [f:\rtm\ndp\clr\src\vm\comthreadpool.cpp @ 797]
011efe94 79ecb00b 00000000 badb0d00 00000000 mscorwks!ThreadpoolMgr::CompletionPortThreadStart+0x406 [f:\rtm\ndp\clr\src\vm\win32threadpool.cpp @ 4060]
011effb4 7c80b683 00183ff0 7c919a9c 7c800000 mscorwks!Thread::intermediateThreadProc+0x49 [f:\rtm\ndp\clr\src\vm\threads.cpp @ 3001]
011effec 00000000 79ecafc5 00183ff0 00000000 KERNEL32!BaseThreadStart+0x37 [d:\nt\base\win32\client\support.c @ 532]

Looks like Network Class Libraries (NCL) called in to unmanaged code WSAGetOverlappedResult and the call to that AV'd. Since the unmanaged call goes straight to Nod32 it means that its an LSP which sits on top of TCP. To confirm it, ran a query to see the list of installed LSP's on the system.

C:\Program Files\Microsoft Platform SDK\Samples\NetDS\WinSock\LSP>nonifslsp\XP32_RETAIL\instlsp.exe -p

Winsock 32-bit Catalog:
=======================
1019 - NOD32 protected [MSAFD Tcpip [TCP/IP]]
1020 - NOD32 protected [MSAFD Tcpip [UDP/IP]]
1021 - NOD32 protected [MSAFD Tcpip [RAW/IP]]
1022 - NOD32 protected [RSVP UDP Service Provider]
1023 - NOD32 protected [RSVP TCP Service Provider]
1001 - MSAFD Tcpip [TCP/IP]
1002 - MSAFD Tcpip [UDP/IP]
1003 - MSAFD Tcpip [RAW/IP]
1004 - RSVP UDP Service Provider
1005 - RSVP TCP Service Provider
1006 - MSAFD NetBIOS [\Device\NetBT_Tcpip_{EF980A36-42D5-4F46-8C5E-B9D289A5A7AE}] SEQPACKET 0
1007 - MSAFD NetBIOS [\Device\NetBT_Tcpip_{EF980A36-42D5-4F46-8C5E-B9D289A5A7AE}] DATAGRAM 0
1008 - MSAFD NetBIOS [\Device\NetBT_Tcpip_{14C32293-28C3-40B2-A4FF-FD3B9147252F}] SEQPACKET 1
1009 - MSAFD NetBIOS [\Device\NetBT_Tcpip_{14C32293-28C3-40B2-A4FF-FD3B9147252F}] DATAGRAM 1
1010 - MSAFD NetBIOS [\Device\NetBT_Tcpip_{D7376ECB-770A-4B34-8AFB-E553ADEA4B9A}] SEQPACKET 2
1011 - MSAFD NetBIOS [\Device\NetBT_Tcpip_{D7376ECB-770A-4B34-8AFB-E553ADEA4B9A}] DATAGRAM 2
1018 - NOD32

So Nod32 does indeed sit on top of Microsoft's Tcp provider and is routing calls. When the client disconnects then the server receives that socket shutdown and as part of that NCL tries to get the overlapped result on that socket. If you look at the disassembly at the place the AV happened you will find that its all in Nod32 code which is proprietary.

0:005> u 20b0cd29-10 20b0cd29+10
imon!NOD32Ioctl+0xaeb9:
20b0cd19 4c dec esp
20b0cd1a 241c and al,1Ch
20b0cd1c 8b4604 mov eax,dword ptr [esi+4]
20b0cd1f 5f pop edi
20b0cd20 8901 mov dword ptr [ecx],eax
20b0cd22 8b442420 mov eax,dword ptr [esp+20h]
20b0cd26 8b560c mov edx,dword ptr [esi+0Ch]
20b0cd29 8910 mov dword ptr [eax],edx ==> line where exception was thrown.
20b0cd2b 8b542424 mov edx,dword ptr [esp+24h]
20b0cd2f 8b4e08 mov ecx,dword ptr [esi+8]
20b0cd32 33c0 xor eax,eax
20b0cd34 890a mov dword ptr [edx],ecx
20b0cd36 8b4e08 mov ecx,dword ptr [esi+8]

So disassembling that code will not be possible and so lets take a step back and try to debug the next code in the stack before Nod32. Lets take a look at the arguments passed in to the WSAGetOverlappedResult call. The signature is defined in NCL as follows.

[DllImport(WS2_32, SetLastError=true)]
internal static extern bool WSAGetOverlappedResult(
[In] SafeCloseSocket socketHandle,
[In] IntPtr overlapped,
[Out] out uint bytesTransferred,
[In] bool wait,
[In] IntPtr ignored
);

Lets take a look at the invocation of this function in BaseOverlappedAsyncResult.CompletionPortCallback().

bool success = UnsafeNclNativeMethods.OSSOCK.WSAGetOverlappedResult(
socket.SafeHandle,
(IntPtr)nativeOverlapped,
out numBytes,
false,
IntPtr.Zero);

From the list of arguments lets try to rule out potentially problematic arguments. Socket.SafeHandle isnt null or NCL would have NullRef'd in managed code itself. In the debugger it was ensured that the nativeOverlapped structure wasn't null. numBytes is passed by reference and the the value is passed in to this function by  _IOCompletionCallback.PerformIOCompletionCallback. The bool is passed by value and so that leaves the the only other variable IntPtr.Zero that could be the problematic argument. But this part of code hasn't been changed from V1 which means that if this is the issue then it should repro in V1 and V1.1 as well. Since its a Whidbey only issue may be something changed in IntPtr implementation from V1.1 and 2.0. Lets take a look at the struct IntPtr.Zero implementation in Whidbey.

unsafe private void* m_value;

public static readonly IntPtr Zero;

So we have memory allocated for the struct but there is no memory allocated for the the void pointer! This means that if someone tries to reference/dereference the value pointed by this IntPtr struct it will either throw NullRef in managed code or throw AV in unmanaged code. So this explains the AV in this case. So does this mean that the implementation has changed in V2.0 only? Lets take a look at the implementation of IntPtr.Zero in V1.1

public static readonly IntPtr Zero = new IntPtr(0);

Sure enough the implementation has changed. In v1.1 the struct was initialized with a pointer pointing to zero and hence when Nod32 tried to dereference the pointer it works fine. So next question is should Nod32 be dereferencing this pointer? For that lets see what MSDN documents about the WSAGetOverlappedResult  method.

 

 BOOL WSAAPI WSAGetOverlappedResult( 
  SOCKET  s  , 
  LPWSAOVERLAPPED  lpOverlapped  , 
  LPDWORD  lpcbTransfer  , 
  BOOL  fWait  , 
  LPDWORD  lpdwFlags  ); 
Parameters
  • s

    [in] A descriptor identifying the socket. This is the same socket that was specified when the overlapped operation was started by a call to WSARecv, WSARecvFrom, WSASend, WSASendTo, or WSAIoctl.

  • lpOverlapped

    [in] A pointer to a WSAOVERLAPPED structure that was specified when the overlapped operation was started. This parameter must not be a NULL pointer.

  • lpcbTransfer

    [out] A pointer to a 32-bit variable that receives the number of bytes that were actually transferred by a send or receive operation, or by WSAIoctl. This parameter must not be a NULL pointer.

  • fWait

    [in] A flag that specifies whether the function should wait for the pending overlapped operation to complete. If TRUE, the function does not return until the operation has been completed. If FALSE and the operation is still pending, the function returns FALSE and the WSAGetLastError function returns WSA_IO_INCOMPLETE. The fWait parameter may be set to TRUE only if the overlapped operation selected the event-based completion notification.

  • lpdwFlags

    [out] A pointer to a 32-bit variable that will receive one or more flags that supplement the completion status. If the overlapped operation was initiated through WSARecv or WSARecvFrom, this parameter will contain the results value for lpFlags parameter. This parameter must not be a NULL pointer.

Turns out the flag cannot be a null pointer which means that NCL was passing the wrong argument for this. So the fix is to pass a valid pointer to the unmanaged API.

Now for users that want to try this out without having to install Nod32 (because you already have anti-virus installed) then you can use the LSP route. Windows SDK ships a LSP sample that you can install on your box on top of TCP and then run any Remoting app and sure enough you can find odd behaviors. I will leave that as an exercise for readers.

 

How to get the QFE:

We have issued an QFE for this and since it hasnt been through the requirements for a public KB yet this is only available by calling Microsoft product support. Dont worry this should be a free call. To get this QFE call Microsoft Product Support Services (https://support.microsoft.com) and ask for the Hotfix for KB number 923028 or mention that you need the fix for "Visual Studio Update QFE 4333" (Thats the internal bug # of this).

 

Maheshwar Jayaraman [WCF]