Reliable Messaging demystified

Recently I hear many questions about Reliable Messaging in WCF. In fact, I just had an IM chat with Clemens which resulted in this blog entry. I want to expand on his comments with a little more of the insight I got being a past owner for this feature.

The first thing we should clear out is terminology - Reliable Messaging means a lot of things to a lot of people so I like to use 2 separate terms when it comes to WCF features: Reliable Sessions and Queued Messaging.

Reliable Sessions provide the equivalent of “TCP at the SOAP level” and give you exactly-once, in-order message delivery. TCP is reliable like that too (at the packet level), but only hop-to-hop. If all you ever have is a single, un-bridged connection on a super-reliable network then reliable sessions don’t give you much. However, this is rarely the case both inside the corporate network and obviously across the Internet. Reliable sessions overcome failures at the transport level (e.g. wireless network connection blinks in and out), at the transport intermediary level (e.g. a web proxy drops the request or the response), and at the SOAP intermediary level (e.g. a SOAP router blinks or drops your message due to load issues). Without this feature it is very hard to write “connected” applications that work correctly in the face of communication errors. Reliable sessions also track the connection liveliness and let you free resources on the server side if the client “goes away” for longer than a given amount of time. On top of all of that, reliable sessions give you network-adaptive congestion control which will adjust the rate of sending to the network’s ability to handle the communication load, and end-to-end flow control which will adapt the rate of sending to the server’s ability to handle the communication load. Both these features result in better network utilization. Reliable sessions are a wonderful thing for the “connected” case where both sides are up and running at the same time and allows you to use request-reply operations and alike safely.

Queued Messaging is all about having a buffer between the client and the service that decouples them in terms of availability (they don’t have to be up at the same time), processing capacity (the service only needs to be able to process the average, not peak client load), and allows wonderful things like offline support for the client (i.e. send messages to the server when it is unreachable or even not running), load-sharing and load-curve-smoothing on the service side, etc. Obviously, when you don’t know when the other side is going to process your message you don’t assume connectivity and would never block a thread waiting for the service to respond. In WCF terms this means that all your operations would be one-way.

Reliable sessions are implemented using the WS-ReliableMessaging protocol. This protocol is yet another misnamed WS-* protocol, as it actually only deals with the reliability of the transfer and says nothing about durability, delivery acknowledgments, TTL for a message, long running sessions where a particular message is lost forever, etc. Currently, there is no active work going on to develop an interoperable Queued Messaging specification, but I expect that we will get to it in the near future.

With no WS-* interoperable spec for Queued Messaging we have implemented this capability on top of the known and trusted infrastructure provided by MSMQ. We did this in a way that gives the programmer all the benefits and capabilities of working in the WCF programming model combined with the durability and transacted capture/delivery features of MSMQ. In addition, and some would say more importantly, it gives the administrator the familiar MSMQ management and administration features. In an enterprise environment one can’t just “stick” a new durable resource into the application model without getting the IT folks to buy off on it – they need to manage this resource, including doing backups, troubleshooting it when something goes wring, etc., and building this feature on a familiar V3 product like MSMQ (V4 in Windows Vista), with well documented tools and practices, is a boon for the IT guys.

We do hear what folks are saying though, and we are actively working on a set of features that will allow Queued Messaging implementations to leverage WCF’s WS-ReliableMessaging implementation. With this feature set you’d be able to “bring your own durable store” and have an interoperable Queued Messaging channel that fits nicely and naturally into the WCF architecture. In fact, if we do the job right then at the programming model level one would see no difference between the MSMQ cannel and the BYODS channel (not including configuration and error handling of course). We also expect to, at the least, have a reference Queued Messaging implementation that uses these features and illustrates the model. All I can say with any level of certainty at this point though is that we expect this to be available some time after v1 ships.