Try, Try, and Try Again

There are some seemingly simple phrases that trip so easily off the tongue, but end up leaving you tongue-tied. Or, if not physically entangled, then tied in knots both architecturally and programmatically. Our intrepid little band of developers and writers just encountered an interesting example of one of these disarticulating phrases: namely "reliable messaging".

It all seemed so easy at the start. We have an application that needs to send a message to a partner organization instructing that partner to fulfill a specific task, and then - at some time in the future - receive a message back from the partner indicating that the task has been successfully completed. What could be difficult about that? Surely applications do that every day...

So we create some code that sends a message. As our application is running in Windows Azure, we decide to use Service Bus Brokered Messaging, which provides a robust and reliable mechanism that survives application failures and role restarts. The application also listens on a separate Service Bus queue for an acknowledgement message that the partner sends back to confirm that it received the message. If we don't get an acknowledgement within a specified time, we send the message to the partner again. What could go wrong?

Well, the message is going across the Internet, which everyone tells us is not a fully reliable transport mechanism and our application is likely to suffer transient loss of connectivity. So we include a retry mechanism for sending the message, namely the Enterprise Library Transient Fault Handling Application Block (also known as "Topaz"). We tell it to have four attempts and then give up if it still fails and raise an exception. Easy.

Ah, but what if the message is sent OK, yet an acknowledgement never comes back? So we build a custom mechanism that maintains the status of the overall process in a local database table in SQL Azure. But reading and updating the database might fail, so we need to use Topaz to retry that operation as well. If we don't get a reply, we can restart the whole sending-and-waiting-for-an-acknowledgement process; though we also need to keep track of how many times we restarted it in case there's a blocking fault such as a duff certificate and Service Bus authentication fails every time, or if the partner has gone away for good.

Of course, it may be that the partner received the message but failed when attempting to send back the acknowledgement message, so we'll implement the same retry mechanism there. But how will the partner know that their acknowledgement message was delivered and processed by the application? Does it matter? Well, it might if the partner later sends a message to say that the requested task is complete. The application will think it never managed to send the original message because it never got a reply, but here's a message to say the task is complete.

Perhaps the partner should expect the application to send an acknowledgement message to say it received the original acknowledgement sent by the partner, and wait for this before actually starting the instructed task? It could end up like two lovers trying finish a phone conversation: "You hang up first", "No, you hang up first"... And, of course, because now we have dozens of duplicate message going both ways we also need to include code in the partner that prevents it from carrying out the same task again when a duplicate instruction message arrives.

And then, after we've satisfactorily completed the original send/acknowledge cycle, the partner carries out the requested task and, when complete, sends the confirmation of completion message to the application. Using, of course, a suitable retry mechanism to ensure the message gets sent. And a restart mechanism in case the application fails to send back an acknowledgement of the confirmation message within a specified time. And maybe an acknowledgement of the acknowledgement so the partner can confirm that the application received its acknowledgement...?

What we end up with is a cornucopia of database tables, Service Bus queues, and multiple layers of code that resembles an onion - retry and restart functionality wrapping more retry and restart functionality wrapping more retry and restart functionality. All to ensure that our message-based communication actually is "reliable messaging".

Wouldn't it be easier just to fix the Internet...?