Reliability in SOA is HUGE

A colleague of mine, Dottie Shaw, blogged recently about why Durable Messaging matters.  I agree with everything she says.  Even more so, I'd add that of the system quality attributes, the one that is most endangered by the SOA approach, and therefore the one that we need to be the most aware of, is reliability.

Reliability takes many forms, but the definition that I work from comes from the IEEE.  IEEE 610.12-1990 defines Reliability as "The ability of a system or component to perform its required functions under stated conditions for a specified period of time."

The reason that this becomes a problem in SOA is because the basic strength of SOA is the message, and the weakest link is the mechanism used to move the message.  If we create a message but we cannot be certain that it gets delivered, then we have created a point of failure that is difficult to surpass.

One friend of mine, Harry Pierson, likes to point out that the normal notion of 'Reliable Messaging' is not sufficient to provide system reliability.  You need more.  You need durable messaging.  Durable messaging is more than reliable messaging, in his lexicon, because durable messages are stored and forwarded.  Therefore, if a system goes down, you can always rely on the storage mechanism to keep it from being lost.  Reliable messages are kept in memory and simply retried until acknowledged, but lost if the sending system goes down during the process.

Of course, Harry and Dottie are not alone in this.  In fact, when discussing reliability these days, web authors have started clubbing the terms together for clarity.  Just search on "reliable durable messages" to get a feel for how pervasive this linguistic gymnastics has become.  Clearly, messages have to be durable in order to improve system reliability.  Discussing one without the other has become passe'.

Note that I view durability as an attributed of the message.  I view reliability as a measurable condition of a system, usually measured in Mean Time Between Failure (MTBF).  What becomes clear from this thread is this: in order to increase system reliability, especially in a system based on messages, we need to insure message delivery, and the best way to do this is through message durability.

So, we need message durability to get system reliability.  Cool.

Where do we get it from?

Well, durability requires that a message be stored and that a mechanism exist to forward it.  (you heard me right... I just equated 'durability' to store-and-forward.  Prove me wrong.  Find a single durable system that doesn't, essentially, store the message and then forward it.)

By seperating storage from forwarding, we get durability.  The message is saved, and the time and place when it is forwarded is decoupled from the system that sends it.  Of course, the most demanding folks will ask for more than simple durability.  They will ask that messages be sent once and in order.  Not always needed, but nice when you can get it.

So, in your SOA architecture, consider this: if you are sending messages from one point to another, and you wish to increase the reliability of your system, you need to find a way to store your message first, and then forward it. 

To build a quality system, however, you want to consider more than one System Quality Attribute.  Sure reliability is important, but if I build a system that is reliable yet brittle, I'd be a poor architect indeed.

We need to consider reliability... and... Agility, Flexibility, Scalability, and Maintainability and all the rest.  Just as SOA reliability requires durability, SOA flexibility and SOA agility both require the use of standard transport mechanisms.  SOA scalability and maintainability both require intermediability.  So we need a solution that doesn't sacrifice one for another. 

Unfortunately, our platform is lacking here.  To solve this problem, we need a mix of WCF, SSB, Biztalk, and good old fashioned code.  MSMQ should be able to do this, and it gets kinda close, but it sacrifices ease of operations, so no easy answer there. 

On the project I'm on, we are using Biztalk for transactional messages, and for data syndication, we wrote our own mechanism based on SQL Agent and a durable protocol that gives us reliability without sacrificing intermediability and standard protocols.

Now if I could only get that out of the box...