Troubleshooting intermittent connection timeout

In this blog, we are going to take a deep dive into a connectivity issue my customer encountered a while ago. DBA was pulled into this case because the error was caused by SQL Server connection timeout. We looked at SQL Server log and there was no messages about connection failure. The issue turned out to be an application server resource issue but the investigation was fun and I would like to share with you J

Symptom:

Application intermittently fails to connect to SQL Server. In application log, we saw error messages like below:

To begin with this investigation, we looked at SQL Server Error Log and found nothing that could be related to the connection issue.

Next, we created a new SQL Server Extended Event session and capture connectivity_ring_buffer_recorded event. This event fires when there is login failure:

And this Extended Event trace captured connectivity errors like "PhysicalConnectionIsKilled", "ErrorFoundBeforeLogin", "SessionIsKilled" during the test:

These messages suggested SQL Server had connection error establishing a connection between application server and SQL Server. Since the error happens during handshaking process and SQL Server doesn't savethese information anywhere, we had to use NetMon to capture network trace on both SQL Server and application server.

During the test, we found 2 typical scenarios when connections failed:

Case 1 – Missing TLS message:

As part of handshaking process, SQL Server expects 5 TLS messages. In our case, SQL Server failed to receive all 5 TLS messages within timeout threshold and the connection timed out:

On SQL Server side, we could see the trace like below:

Network trace on application server shows application server didn't send out the last TLS message which caused network failure:

At the same time, application log shows application experienced connection timeout a few seconds later:

Case 2: Incomplete NTLM authentication

Besides missing TLS message, we also found another connection failure caused by lack of responding to NTLM authentication from application server:

Below is trace from SQL Server side, SQL Server sends out NTLM Challenge message and expect a response from application server:

From the trace captured on application server for the same connection, it shows application server received the NTLM challenge message but didn't send out response within the timeout threshold:

Both cases indicate this connection issue was caused by application server failed to complete the handshaking process within timeout threshold. Based on these evidence, we looked at the application server and found the performance counters during the network failures.

Below is CPU utilization on application server which indicates CPU was at 100% for some time during the tests:

Memory counters on application server also suggest memory consumption was at its peak at same time:

Network counters showed same pattern:

Based on what we found, we concluded the intermittent connection issue was caused by resource bottleneck on the application server. The recommendation to development team is double the number of CPUs and RAM to application server, after which, the issue disappeared.