Case study: The intermittent SqlException

 

Recently I got a service call regarding below SqlException in event log:

 

Event Type: Warning

Event Source: ASP.NET 2.0.50727.0

Event Category: Web Event

Event ID: 1309

Description:

..

Event code: 3005

Event message: An unhandled exception has occurred.

 

 Exception information:

    Exception type: SqlException

    Exception message: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)

The exception message clearly said that the web application couldn’t connect to the SQL Server. Usually it is due to a failure in establishing a TCP connection.

 

The customer said this issue didn’t always happen, but intermittently.

 

In order to get more details of the root cause than the result recorded in the event log, I told my customer to keep nmcap.exe running until the next repro.

 

Nmcap.exe has a /File switch, which can split the output into small files. This was useful in this case as I would only need to read the file at the moment when it would repro. The event log would still useful as it would record the next repro time.

 

I also enabled DebugDiag to monitor the SqlException and configured it to capture a full memory dump on that exception.

 

Several days later, I received the nmcap log files. Supposed the SQL server used the default port, I filtered the log by port 1433 and found below TCP sync requests:

 

Time Delta

Source

Destination

Description

0.0000000

Web

SQL

TCP:Flags=......S.,

2.8593750

Web

SQL

TCP:[SynReTransmit]Flags=......S.,

6.0162010

Web

SQL

TCP:[SynReTransmit]Flags=......S.,

 

It was a typical pattern of a TCP handshake that couldn’t get response from the other end. The Web client sent a Sync to the SQL server but got no response. It re-transmitted the Sync package in 3 seconds, however got silence again. The retransmit delay was doubled to 6 seconds, but it still got no luck. The handshake got timeout after another 12 seconds. At the very second, a SqlException was thrown and got recorded in the event log. DebugDiag captured the exception too and generated a full memory dump to the disk at the same time. Perfect!

 

Below was the connection string I found in the dump file.

 

data source=test1;persist security info=True;initial catalog=…

 

I was about to ask the customer to check the network connectivity between the web server and the SQL server test1, while I got a weird feeling with the SQL Server host name. “test1”? Was it customer’s test SQL Server? Why was it used in production? It shouldn’t because customer forgot to update the connection string in web.config, when they were deploying the application to production. Otherwise the issue shouldn’t be intermittent.

 

I dumped the .net assemblies out of the dump file. ILSpy then showed me how the connection string was read from web.config

 

public static string getConnectionString(string name)

{

                string str = "T";

                if (!isTesting())

                {

                                str = "P";

                }

                return ConfigurationManager.ConnectionStrings[str + name].ConnectionString;

}

Both of the testing and production connection strings were in the configuration file. The testing one has a ‘T’ prefix in its key name and the production one has a ‘P’ prefix. The web application decided at runtime to load the right one, either production or testing.

 

Here is the logic used to make the decision.

 

public static bool isTesting()

{

                string a = HttpContext.Current.Request.Url.Host.ToLower();

                return !(a == “www.contoso-production.com”);

}

Aha, they decided it by comparing the host name of the request. The hardcoded host name was correct, only production server had that domain name. Then why did it fail?

 

Finally I loaded the sos.dll debugger extension into windbg. It showed me that it was below request URL that triggered the exception

 

https://www.contoso-production.com.:80/default.aspx

The root cause was obvious, once I saw the dot between “.com” and “:80”. DNS names end with dot are actually valid. Based on the RFC https://www.ietf.org/rfc/rfc1034.txt

 

The most common interpretation uses the root "." as either the single origin or as one of the members of the search list, so a multi-label relative name is often one where the trailing dot has been omitted to save typing.

 

It would be a good lesson if your application depends on a similar logic to decide production or testing.

 

Regards,

Juntao Zhu from APGC DSI Team