TCP Keep Alive

How do I detect when the other side of a TCP connection has gone away? Does TCP keep-alive take care of this for me?

Although we take it for granted that change can be quickly detected for closely connected components, it turns out to be surprisingly difficult to detect change when two machines are isolated by more than a simple wire. Even a really big change to the system, like one of the machines disappearing, is hard to spot.

Detecting that the other side has disappeared is a common request because, on a server, knowing that the client has dropped the connection allows you to clean up resources much faster. The TCP transport sometimes gives you quick notifications by aborting the session that the connection has been dropped. However, there’s no guarantee that the transport will be able to detect that the other side has gone away. That’s because notification of a TCP connection reset has to travel just like any other piece of data and can be lost or redirected along the way, if it was sent at all. The only sure thing is that the next time you attempt an IO operation, you’ll find out if the channel was still good or not.

If you’re unhappy waiting for the next IO operation, then you can make IO operations happen faster. The basic concept is to have a cheap IO operation that does nothing but bounce between the two parties. This is sometimes called a heartbeat and is exactly what takes place when you talk about TCP keep-alive. However, the standard keep-alive interval for TCP is 120 minutes, which is probably worse than your current latency for detecting change. By default, a service gets bored waiting after about 10 minutes and gives up. The chance of a keep-alive happening between the time that a client disconnects and the service notices it is pretty small.

If you want something faster but don’t want to change the timeouts, then you can take the basic concept into your own hands. One approach is to create a keep-alive method on your service contract that does nothing but let you trigger IO operations at a frequency you desire. Another approach is if you control both ends and don’t want to change your service contract, then you can do the same thing in a protocol channel and swallow those messages so that the service never has to see them.

Next time: Collections without CollectionDataContract

Comments (9)

  1. Yair Zadik says:

    Is there any easy way to get to the underlying TCP stream and change the keep-alive to something shorter like 1 minute, assuming you’re willing to pay for the extra network traffic and processing time?

    Is this something that could be done in a custom binding?

  2. Sean says:

    Shy Cohen at MSFT suggested to use reliable session:

    Are there problems with this solution?


  3. Hi Yair,

    Controlling the keep-alive is a socket option and there’s no way to get at the socket held by the TCP transport.

  4. Hi Sean,

    There are a couple ways to get keep alives with just the components that ship in the box but you generally have to pay something in return for the convenience.  Reliable messaging is a very heavy protocol if you just want a heartbeat so you’ll pay in terms of latency and throughput for the extra features you’re not using.

  5. Interestingly enough, I implemented the heartbeat model for a project based on WCF. We were tasked with creating a seemless offline system. Luckily, in my scenario, we have a series of services so creating a new service with required heartbeat contract was very simple. A background manager is used to handle subscriptions and send out call backs. That said, there seem to be many corner cases to watch out for when implementing this method.

    Maybe there is a better way to implement this model, but the code we have written works just fine.

  6. DataContractSerializer supports multiple serialization mechanisms. If more than one serialization mechanism

  7. Sean says:

    Thanks Nicholas for quick reply!

    You mentioned there are a couple ways to get keep alives with just the components that ship in the box.  It seems to me in a duplex connection, channel faults when the other side is down or throws an exception. Is this an alternative to get keep alive?


  8. Hi Sean,

    Any configuration that keeps a continuously pending IO will allow you to detect a disconnection.  An ordinary duplex connection won’t do this but if you have a callback contract so that there is a local listener, then you’ll be able to detect when the connection is dropped.