TCP Offloading again?!

I have spent probably hundreds of hours on cases involving TCP Offloading and I know most of the signs (intermittent dropped connections, missing traffic in network traces).  However, I have to admit I got burned by it the other day and spent several more hours working an issue than I should have.

I was working on a server-down case for a financial trading company (in other words, large dollars involved every minute they were down) where the customer was experiencing slow connections to SQL Server.  The customer reported only some Linux ODBC clients were impacted.  Based on that description, we started looking at the client side.  However, we soon discovered that, while there was no detectable correlation between the clients, the problem was only visible going to a specific SQL Server instance.  The affected clients had no problem communicating with other instances of SQL Server.  Based on this, we started focusing on the SQL Server machine itself.

From the client application’s perspective, every query was taking roughly five seconds longer than expected.  Therefore, we collected a PSSDiag and looked at the performance of the SQL Server machine as a whole.  The Profiler traces showed that there was no delay inside SQL Server:

image

So, where were the five seconds coming from?

The next step was to look at a network trace:

image

Check out the two sets of timestamps circled.  Both of them had a five second delta!  Now we had physical proof of the problem, but we still don’t have a reason…

Then, I noticed something that turned out to be the key – the five second delay was always between the data sent from the client and the server’s response to that data.  That clinched the fact that this was a server-side is 100%.  I couldn’t explain yet why only some clients were impacted, but this was definitely a server-side issue.  The other interesting thing to notice above is that the delay is even visible on the login!  This was completely surprising because this customer was using SQL Authentication.  That is a highly optimized query which should never have performance issues.  This, combined with the fact that the subsequent query wasn’t showing up inside SQL Server as being delayed caused me to start thinking about things outside of SQL Server.

The next thing to check was for filter drivers that might have inserted themselves in the TCP stack – antivirus, firewall, NIC teaming, etc.  Unfortunately, nothing like this was installed so there were no clues there.  We also reconfirmed that TCP Chimney was turned off at the OS level. And then it hit me…NIC level TCP Offloading!!!

We pulled up the Ethernet adapter settings (Network Connections –> LAN XXX –> Properties –> Configure –> Advanced) and saw something that looked like this:

image_c1ccbabf-472b-426f-b882-d0984d540298

Lo and behold – TCP Offloading was enabled!

We disabled all of the Offloading settings, clicked OK and performance was back to normal.  Connections were fast and query results were returning right after SQL Server generated them.  I should point out that we didn’t take a stepwise approach here because this customer was losing large amounts of money every minute this system was down.  In a less critical issue, it would be worth doing each setting one at a time and testing in between.  In addition, I would also recommend that you go back after the fact and test enabling each setting to see if there is a negative impact.  There are some non-trivial performance benefits to be gained from these settings if everything is working properly.

We never did figure out why only some clients were impacted since all of the clients were using the same driver.  Nor where we able to figure out why only this SQL Server instance was impacted when several other SQL Server machines were configured the same way at the driver level.

The moral of the story?  I need to update my standard steps for capturing network traces to include NIC level TCP Offload settings!

As of this morning, my first four steps for capturing a network trace now look like this:

1a. Turn off TCP Chimney if any of the machines are Windows 2003
Option 1) Bring up a command prompt and execute the following:
Netsh int ip set chimney DISABLED
Option 2) Apply the Scalable Networking Patch -
https://support.microsoft.com/default.aspx?scid=kb;EN-US;936594

1b. Confirm that TCP Chimney is turned off if any of the machines are Windows 2008 (see https://support.microsoft.com/default.aspx/kb/951037 for more details)
a) bring up a command prompt and execute the following:
netsh int tcp show global
b) if it turns out TCP Chimney is on disable it
netsh int tcp set global chimney=disabled

2. Turn of TCP Offloading/Receive Side-Scaling/TCP Large Send Offload at the NIC driver level

3. Retry your application. Don't laugh - many, many problems are resolved by the above changes.

Evan Basalik | Senior Support Escalation Engineer | Microsoft SQL Server Escalation Services