The Case of the Mysterious Exchange Server Hang

I have been working an interesting case for a large enterprise customer with a few other people from my group. We were having a problem where outlook clients were unable to connect to the exchange server, however the exchange server was operating like normal. After looking at the ESE internal structures I could see that the information store was not backed up on any I/O requests. As I started to debug a number of store dumps, I saw two interesting things that stood out. Below is a RPC fault, which is very rare in a store dump.

0:261> kbnL
# ChildEBP RetAddr  Args to Child             
00 61d9fc18 7c827d29 77e61d1e 000089dc 00000000 ntdll!KiFastSystemCallRet
01 61d9fc1c 77e61d1e 000089dc 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
02 61d9fc8c 77cdbedf 000089dc ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
03 61d9fca8 77cdbf39 61d9fd20 000089dc 00000000 rpcrt4!UTIL_WaitForSyncHTTP2IO+0x18
04 61d9fcd0 77cccb39 67d1f18c 61d9fd20 000089dc rpcrt4!UTIL_GetOverlappedHTTP2ResultEx+0x20
05 61d9fcf8 77ccf2e8 67d1f18c 61d9fd1c 67d1f18c rpcrt4!WaitForSyncSend+0x26
06 61d9fd5c 77cd05ac 00000020 61d9fdd0 00000001 rpcrt4!HTTP2VirtualConnection::SyncSend+0xdd
07 61d9fd7c 77ccc1f6 00000020 61d9fdd0 00000001 rpcrt4!HTTP2ServerVirtualConnection::SyncSend+0x2a
08 61d9fd98 77c689e8 67d1f18c 00000020 61d9fdd0 rpcrt4!HTTP_SyncSend+0x2a
09 61d9fdc0 77ca9db9 61d9fdd0 00000020 03030005 rpcrt4!OSF_SCONNECTION::TransSend+0x74
0a 61d9fe08 77caa3cf 00000005 00000000 00000e1d rpcrt4!OSF_SCONNECTION::SendFault+0xb5
0b 61d9fe30 77c93e4e 00000005 00000000 67c57af8 rpcrt4!OSF_SCALL::CleanupCallAndSendFault+0x46
0c 61d9fe60 77c68b5e 67c57af8 030001f0 0000000a rpcrt4!OSF_SCALL::ProcessReceivedPDU+0x4a6
0d 61d9fe80 77c6e8db 00000000 000001f0 00000000 rpcrt4!OSF_SCALL::BeginRpcCall+0x194
0e 61d9fee0 77c6e7b4 00000000 67c57af8 000001f0 rpcrt4!OSF_SCONNECTION::ProcessReceiveComplete+0x435
0f 61d9fef4 77cae695 00091918 0000001c 00000000 rpcrt4!ProcessConnectionServerReceivedEvent+0x21
10 61d9ff18 77c7b799 00091918 0000001c 00000000 rpcrt4!ProcessComplexTReceive+0x54
11 61d9ff84 77c7b9b5 61d9ffac 77c8872d 00091918 rpcrt4!LOADABLE_TRANSPORT::ProcessIOEvents+0x1b8
12 61d9ff8c 77c8872d 00091918 00000000 00000000 rpcrt4!ProcessIOEventsWrapper+0xd
13 61d9ffac 77c7b110 0008f5f8 61d9ffec 77e6482f rpcrt4!BaseCachedThreadRoutine+0x9d
14 61d9ffb8 77e6482f 6a084fa0 00000000 00000000 rpcrt4!ThreadStartRoutine+0x1b
15 61d9ffec 00000000 77c7b0f5 6a084fa0 00000000 kernel32!BaseThreadStart+0x34

and a ton of threads calling in to “secur32!CallSPM”.

0:329> kbn
# ChildEBP RetAddr  Args to Child             
00 6ccafbe4 7c827899 76f52135 000006f4 6ccafc24 ntdll!KiFastSystemCallRet
01 6ccafbe8 76f52135 000006f4 6ccafc24 6ccafc24 ntdll!NtRequestWaitReplyPort+0xc
02 6ccafc04 76f55173 0009ad70 6ccafc24 6ccafc24 secur32!CallSPM+0x23
03 6ccafd30 76f54eb5 00000000 0009ad70 6ccafdac secur32!SecpAcceptSecurityContext+0x23c
04 6ccafdc0 76f54d63 00000000 6ccafe00 6ccafef0 secur32!LsaAcceptSecurityContext+0x15d
05 6ccafe10 77c8b90e 0010a090 67b31e68 6ccafef0 secur32!AcceptSecurityContext+0xa8
06 6ccafe58 77c8b8b8 00000010 6ccafef0 00000000 rpcrt4!SECURITY_CONTEXT::AcceptThirdLeg+0x42
07 6ccafea0 77c8b1c5 00000010 6ccafef0 00000000 rpcrt4!OSF_SCONNECTION::AcceptThirdLeg+0x61
08 6ccaff04 77c6e7b4 00000000 69ca4c68 000000ce rpcrt4!OSF_SCONNECTION::ProcessReceiveComplete+0x594
09 6ccaff18 77c7b799 00091918 0000000c 00000000 rpcrt4!ProcessConnectionServerReceivedEvent+0x21
0a 6ccaff84 77c7b9b5 6ccaffac 77c8872d 00091918 rpcrt4!LOADABLE_TRANSPORT::ProcessIOEvents+0x1b8
0b 6ccaff8c 77c8872d 00091918 00000000 00000000 rpcrt4!ProcessIOEventsWrapper+0xd
0c 6ccaffac 77c7b110 0008f5f8 6ccaffec 77e6482f rpcrt4!BaseCachedThreadRoutine+0x9d
0d 6ccaffb8 77e6482f 5ac85d28 00000000 00000000 rpcrt4!ThreadStartRoutine+0x1b
0e 6ccaffec 00000000 77c7b0f5 5ac85d28 00000000 kernel32!BaseThreadStart+0x34

Based on the “secur32!CallSPM” call stack, this tells me that we are processing authentication requests. These requests need to go to the domain controller for processing, and could potentially be proxied to another domain controller for processing. Being there could be a latency here I contacted my good friend Tate Calhoun up in the Platforms group so he could take a look at the LSASS dumps for confirmation. Tate is a Platforms EE debugging god, so I know I was in good hands here :) From the LSASS dumps Tate was able to tell me that we have 199 call stacks found containing : "netlogon!NLPUSERVALIDATEHIGHER"

0:061> kbn fff
#   ChildEBP RetAddr  Args to Child             
00  02e3ec98 7c827d29 77e61d1e 000005b4 00000000 ntdll!KiFastSystemCallRet
01  02e3ec9c 77e61d1e 000005b4 00000000 02e3ece0 ntdll!NtWaitForSingleObject+0xc
02  02e3ed0c 77e61c8d 000005b4 0000afc8 00000000 kernel32!WaitForSingleObjectEx+0xac
03  02e3ed20 74266947 000005b4 0000afc8 000e0270 kernel32!WaitForSingleObject+0x12
04  02e3ed3c 74266e4c 000a9f70 0000afc8 000e0270 netlogon!NlAllocateClientApi+0x53
05 02e3edbc 74266cf8 000a9f70 00000000 00000006 netlogon!NlpUserValidateHigher+0x5d
06  02e3ee1c 74267407 000e0270 00000001 00000000 netlogon!NlpUserValidate+0x309
07  02e3eea4 7426728d 00000000 00000000 00000000 netlogon!NlpLogonSamLogon+0x3f5
08  02e3eee0 76c9ce2f 00000000 00000000 00000000 netlogon!NetILogonSamLogon+0x32
09  02e3f720 76c9a815 ffffffff 00000003 03b61f40 msv1_0!LsaApLogonUserEx2+0x1106
0a  02e3fa80 76c95302 00607a20 01b47508 00200a03 msv1_0!SsprHandleAuthenticateMessage+0xcca
0b  02e3fc98 4ab85ebb 00607a20 00b48f80 02e3fe40 msv1_0!SpAcceptLsaModeContext+0x1f8
0c  02e3fd0c 4ab860a8 01b474c0 01b474c8 02e3fe40 lsasrv!WLsaAcceptContext+0x139
0d  02e3fe84 4ab7ae1b 01b47498 00000000 005ed200 lsasrv!LpcAcceptContext+0x13b
0e  02e3fe9c 4ab7ad1e 01b47498 4ac22738 05924808 lsasrv!DispatchAPI+0x46
0f  02e3ff54 4ab7a769 01b47498 02e3ff9c 77e5ba69 lsasrv!LpcHandler+0x1fe
10  02e3ff78 4ab8f465 0062c510 00000000 00000000 lsasrv!SpmPoolThreadBase+0xb9
11  02e3ffb8 77e6482f 006bee48 00000000 00000000 lsasrv!LsapThreadBase+0x91
12  02e3ffec 00000000 4ab8f40e 006bee48 00000000 kernel32!BaseThreadStart+0x34

So, we are looking at authentication requests going to the LSASS process. LSASS has two threads that process authentication requests for the operating system. Here is one of the two allowed to proceed….

0:093> kbn fff
#   ChildEBP RetAddr  Args to Child             
00  0363e618 7c827d29 77e61d1e 00001541 00000001 ntdll!KiFastSystemCallRet
01  0363e61c 77e61d1e 00001541 00000001 0363e660 ntdll!NtWaitForSingleObject+0xc
02  0363e68c 77c6a85f 00001541 00057e40 00000001 kernel32!WaitForSingleObjectEx+0xac
03  0363e6a8 77c69bf7 0167b15c 00000001 00057e40 rpcrt4!UTIL_WaitForSyncIO+0x20
04  0363e6cc 77c59546 0167b124 0167b15c 0363e6fc rpcrt4!UTIL_GetOverlappedResultEx+0x1d
05  0363e700 77c6af6a 00000001 0363e84c 0363e848 rpcrt4!WS_SyncRecv+0xbe
06  0363e728 77c69667 00000000 00ab8f70 000001a0 rpcrt4!OSF_CCONNECTION::TransSendReceive+0xb8
07  0363e7b0 77c695d4 10000000 01688228 00000001 rpcrt4!OSF_CCONNECTION::SendFragment+0x2ae
08  0363e808 77c6977a 00000000 00000160 0363e84c rpcrt4!OSF_CCALL::SendNextFragment+0x1e2
09  0363e850 77c699f2 0363e8d8 0363e894 00000001 rpcrt4!OSF_CCALL::FastSendReceive+0x148
0a  0363e86c 77c69975 0363e8d8 0363e894 0363e8d8 rpcrt4!OSF_CCALL::SendReceiveHelper+0x5b
0b  0363e89c 77c7fcf0 00ab8f88 0363e8bc 77c80673 rpcrt4!OSF_CCALL::SendReceive+0x41
0c  0363e8a8 77c80673 0363e8d8 71c43dd0 0363ecc4 rpcrt4!I_RpcSendReceive+0x24
0d  0363e8bc 77ce315a 0363e904 00ab90e8 00000000 rpcrt4!NdrSendReceive+0x2b
0e  0363eca4 71c4e665 71c43dd0 71c4b2e4 0363ecc4 rpcrt4!NdrClientCall2+0x22e
0f  0363ecbc 71c4e620 01b2ee68 03985a38 000e0300 netapi32!NetrLogonSamLogonEx+0x1c
10  0363ed1c 74266fdc 01b2ee68 03985a38 000e0300 netapi32!I_NetLogonSamLogonEx+0x41
11  0363edbc 74266cf8 000a9f70 00000000 00000006 netlogon!NlpUserValidateHigher+0x2d5
12  0363ee1c 74267407 000e0270 00000001 00000000 netlogon!NlpUserValidate+0x309
13  0363eea4 7426728d 00000000 00000000 00000000 netlogon!NlpLogonSamLogon+0x3f5
14  0363eee0 76c9ce2f 00000000 00000000 00000000 netlogon!NetILogonSamLogon+0x32
15  0363f720 76c9a815 ffffffff 00000003 00b55030 msv1_0!LsaApLogonUserEx2+0x1106
16  0363fa80 76c95302 00607708 01b05188 00200a03 msv1_0!SsprHandleAuthenticateMessage+0xcca
17  0363fc98 4ab85ebb 00607708 017920d0 0363fe40 msv1_0!SpAcceptLsaModeContext+0x1f8
18  0363fd0c 4ab860a8 01b05140 01b05148 0363fe40 lsasrv!WLsaAcceptContext+0x139
19  0363fe84 4ab7ae1b 01b05118 00000000 005ed050 lsasrv!LpcAcceptContext+0x13b
1a  0363fe9c 4ab7ad1e 01b05118 4ac22738 01b420b8 lsasrv!DispatchAPI+0x46
1b  0363ff54 4ab7a769 01b05118 0363ff9c 77e5ba69 lsasrv!LpcHandler+0x1fe
1c  0363ff78 4ab8f465 0062c510 00000000 00000000 lsasrv!SpmPoolThreadBase+0xb9
1d  0363ffb8 77e6482f 006bf2c8 00000000 00000000 lsasrv!LsapThreadBase+0x91
1e  0363ffec 00000000 4ab8f40e 006bf2c8 00000000 kernel32!BaseThreadStart+0x34

Now the question remains, why are we having a backlog. Mike wrote a detailed blog describing these findings and how to fix. Please see Mike Lagase’s blog: https://blogs.technet.com/mikelag/archive/2009/08/04/the-case-of-the-mysterious-exchange-server-hang.aspx

Dave