Since TFS 2010, it has been possible to have multiple Application Tier servers configured in a load-balanced configuration. If you use something like a F5 BIG-IP LTM device, then the default Idle Timeout settings for the TCP Profile can cause problems. (But don’t despair, read the whole post).
Here’s the scenario:
- Between the TFS ASP.NET Application and SQL Server, there is a maximum execution timeout of 3600 seconds (1 hour)
- In IIS/ASP.NET there is a maximum request timeout of 3600 seconds (it’s no coincidence that it matches)
- This allows TFS operations to run for up to an hour before they get killed off. In reality, you shouldn’t see any TFS operations run for anywhere near this long – but on big, busy servers like the ones inside Microsoft, this was not uncommon.
Load balancers, in their default configuration usually have an ‘Idle Timeout’ setting of around 5 minutes. The reason for this is that every request that stays open, is consuming memory in the load balancer device. A longer timeout means that more memory is consumed and it’s a potential Denial-of-Service attack vector. (Side note: What’s stopping somebody using TCP Keep-Alives like I describe below to keep a huge number of connections open and have the same DoS effect?)
So why is this a problem if your ‘Idle Timeout’ is set to something less than 3600 seconds? This is what can happen:
- The client makes a request to TFS – for example: “Delete this really large workspace or branch”. That request/connection remains open until the command completes.
- The TFS Application Tier then goes off and calls a SQL Stored Procedure to delete the content.
- If that Stored Procedure takes longer than the ‘Idle Timeout’ value, the load balancer will drop the connection between the client and the application tier.
- The request in IIS/ASP.NET will get abandoned, and the stored procedure will get cancelled.
- The client will get an error message like ‘The underlying connection was closed: A connection that was expected to be kept alive was closed by the server’. Basically, this means that the connection got the rug pulled out from under it.
Prior to Visual Studio & Team Foundation Server 2012, I recommended that people talk to their Network Admin guys and get the load balancer configuration updated to a higher ‘TCP Idle Timeout’ setting. This usually involved lots of back-and-forth with the grumpy admins, and eventually you could convince them to begrudgingly change it, just for TFS, to 3600. If you think that you’re hitting this problem – one way to verify is to try the same command directly against one of your application tier servers, rather than via the load balancer. If it succeeds, then you’ve likely found your culprit.
If you’ve administered web sites/webservers before, you’ve likely heard of HTTP Keep-Alive. Basically, when they’re enabled on the client and the server, the client keeps the TCP connection open after making a HTTP GET request, and reuses the connection for subsequent HTTP GET requests. This avoids the overhead of closing and re-establishing a new TCP connection.
That doesn’t help our Idle Timeout problem, since we only make a single HTTP request. It’s that single HTTP request that gets killed halfway through – HTTP Keep-Alives won’t help us here.
Introducing TCP Keep-Alives
There’s a mechanism built-in to the TCP protocol that allows you to send a sort-of “PING” back and forth between the client and the server, but not pollute the HTTP request/response.
If you have a .NET client application, this is the little gem that you can use in your code:
webRequest.ServicePoint.SetTcpKeepAlive(true, 50 * 1000, 1000); // Enable TCP Keep-Alives. Send the first Keep-Alive after 50 seconds, then if no response is received in 1 second, send another keep-alive.
In this example NetMon network trace:
- I deployed a web services to Windows Azure, where the load balancer had a TCP Idle Timeout set to 5 minutes (this has changed lately in Azure now that they moved to a software based load balancer).
- This web services was coded to do a Thread.Sleep(seconds) for however long I told it to, then send a response back.
First of all, you’ll notice that I did this investigation quite some time ago (~2 years…). Next, you’ll see that there’s some other traffic that happens on my connection between the HTTP:Request at frame 179 and the HTTP:Response at frame 307. Those are the TCP Keep-Alive ‘PING’ and ‘ACK’ packets.
Finally, you can see that after 320 seconds have passed (i.e. 20 seconds after the load balancer should’ve closed the connection), I get a valid HTTP:Response back. This means that we have successfully avoided the load balancer killing our connection prematurely.
What’s in it for me?
The whole reason I did this investigation was when I was working on the TFS team and they were getting ready to launch the Team Foundation Service. Although it was quite rare, there were instances where users could hit this TCP Idle Timeout limitation.
The good news is that by working with the rock star dev on the Version Control team, Philip Kelley – we were able to include a change in the TFS 2010 Forward Compatibility update and the TFS 2012 RTM clients to send TCP Keep-Alives every 30 seconds, thus avoiding the issues altogether when talking to the Team Foundation Service, and on-premises TFS servers deployed behind a load balancer. You can see this for yourself in Microsoft.TeamFoundation.Client.Channels.TfsHttpRequestHelpers.PrepareWebRequest().
webRequest.ServicePoint.SetTcpKeepAlive(true, 30000, 5000);
If you don’t have a direct connection between your client and your server, and you go via a HTTP proxy server or something like ISA/ForeFront Threat Management Gateway – the TCP Keep-Alive packets aren’t propagated through those proxies. You’ll get an error back with something like ‘502: Bad Gateway’, which basically means that the connection between the Proxy server and the TFS server was dropped.
Here’s what the NetMon trace looks like for this example: