Custom Transport Retry Logic



What are the best practices for building retry logic around network transport failures?

Let’s define some terms first so that we have a common language for communication. I’ll say that “retry logic” is any automatically applied compensation activity that replays the same messages to either the same or a different destination. Also, I’ll say that a “network transport failure” is any delivery or communication failure while attempting to transmit an application message or an infrastructure protocol message. With that out of the way, there are two places where you could attempt to handle transport failures.

  • In a layered channel. Use a layered channel when you want the retry logic to be applied to all network calls. A layered channel also allows you to finely control the order of operations by positioning the retry logic channel within the channel stack. Placing the retry logic in a layered channel means that you don’t have to deal with it at each application call site.
  • In the application. Use application code to perform the retry logic when a retry decision involves business logic, business rules, or application state. For example, if you encounter an error while sending the third of four related messages, you may need to manipulate application state to reestablish consistency within the system. Placing the retry logic in application code means that you have to make use of it explicitly at each application call site.

Regardless of the location, the retry logic for a network transport failure is going to look fairly similar.

  • You will find out about a transport failure because some network operation threw an exception. The exceptions that you should consider handling will either be a subtype of TimeoutException or CommunicationException. Your retry logic needs to decide whether the specific exception is recoverable. That decision depends on both the types of network operations that you’re performing and the types of failures that your application is resilient to.
  • Before attempting to retry an operation that uses the same channel, you first need to check that the channel is still usable. If the channel state is anything except for Opened, then you will be unable to send messages using that channel. The only thing that you can do with a Closed or Faulted channel is to throw it away and create a replacement.

Next time: A Call to SSPI Failed

Comments (7)

  1. Sam Gentile says:

    Finally! Its taken me like 5 years to get to 200. I sometimes I wish I was prolific like Mike Gunderloy

  2. How can I find out whether my service is running in ASP.NET compatibility mode? Why do you need to detect

  3. Sam Gentile says:

    SOA Nick has his fourth post in a series on the impact of the business operating model on Service Oriented

  4. Q says:

    Well thanks to my luck I see this as an entry just as I was wondering how on earth does WCF deal with something simple like NetTcp channel experience a transport failure without diving into ping-like and retry/reconnect logic.

    So is it at all advisable for it to be a part of the contract?

  5. Q says:

    Also, looking around I can see various timeout properties available for reliable nettcp config but also notes that there are bugs around. I am no longer confident WCF will work in the following scenario.

    In short, what I am looking for is an explanation of how an endpoint can continously callback into a client (ideally not a poll but a stream), over whichever NetTcp binding is required. How does WCF do that, can it detect on either and both ends (via an exception I presume) that an ‘underlying transport’ (aka TCP connection) has been lost (router, path problem, cable plugged out etc). It can be a failure on any side, for any reason, but I am looking for a stream-like behaviour and not RPC that is buffered.

    Is there such a feature at all?

    Thanks in advance.

  6. It’s not possible to have reliable transmission without buffering data.

    If you want reliable transmission, then that’s exactly what the reliable session channel does.  Those are the knobs that you see in the NetTcp binding.  A reliable session reestablishes the connection and replays the data that didn’t get through.  You can use chunking to make the size of the buffers smaller and approach a more stream-like behavior.

    If you want streaming, then that’s exactly what datagram channels like HTTP and UDP do.  You can keep sending messages and while they may or may not get there, the underlying transport gets reestablished whenever the next message is sent.

  7. SOA Nick has his fourth post in a series on the impact of the business operating model on Service Oriented Architecture with SOA in the Replication Model Microsoft My collegue Mickey Williams has posted that Microsoft Search Server (MSS) 2008 and MSS