Fronting long-running WF Services with MSMQ, the right way

A couple of weeks ago I was working with a customer who engaged me to assist with a "little" problem they had – under certain conditions, the system would lose messages coming from a MSMQ queue and designed to correlate to instances of a long-running WF service hosted in IIS/AppFabric.

The Rule of Thumb

Now, because I don't want to waste anybody's time, before I go into explaining all the details I'll just give you the rule of thumb from my learning: if you want to reliably expose your WF services via MSMQ, always use transactional queues coupled with a TransactedReceiveScope inside the workflow definition.

As long as you remember the above and are comfortable implementing it, you can happily stop reading right here. For the curious amongst us, I'll continue with the full story.

The Scenario

So, the customer had a long-running WF-based service that at some point in the execution was awaiting a correlated message to come in, via an endpoint bound to a non-transactional MSMQ queue.

(By the way, if you are looking for information on enabling MSMQ communication for WF services, I would point you to an excellent step-by-step guide written by Cindy Song and available here.)

Anyway, under normal circumstances, everything works OK, with the correlated message consumed off the queue and fed into the right workflow instance, which would continue its execution as expected. However, as with any real-life solution, the customer had a few scenarios (not very uncommon) that pushed the envelope a bit – welcome to the real world! So, let's look at the easiest and most obvious to explain in a blog post – a workflow instance starts, sends a one-way request to an external application or a human to perform some task, continues to do a bit of work, and then starts waiting for the external one-way "response" to come in (via a non-transactional MSMQ queue) before completing. Conceptually, this is depicted below:

Now, let's look at what happens if one of the "Do something" activities fails – obviously the workflow instance will get suspended and never reach the "Receive TaskCompleted" activity. Unless you've created some special logic on the client to detect such conditions the following sequence will happen:

  1. The client "Complete Task" message will go to the queue bound to the "Receive TaskCompleted" endpoint
  2. From there, the workflow endpoint using the MSMQ binding will pick up the message (this takes the message off the queue completely and unconditionally)
  3. The message dispatcher, which is the WF runtime component responsible for inspecting incoming messages and matching them to their corresponding WF instances, will try to get to our now suspended WF instance, which of course will fail.

What now? The message has been consumed from the queue and because it is a non-transactional queue there is no rollback. At the same time the target WF instance is incapable of receiving the message and AppFabric itself does not provide a temporary message store to put the message into. So, at that point, the runtime's only choice is to logs an exception in the AppFabric monitoring DB and discontinue further actions for this message,  as follows:

The bottom line is that in cases like this we are losing the message payload with no ability to handle the message manually or through a different component.

Once you realize what's happening in step 3 above, even if it is not the desired/ideal outcome, at least it is all logical and makes sense. So, let's focus on the solution - to transaction-enable the queue and the corresponding receive activity within the workflow so that if the WF runtime cannot deliver the message to an instance, the pick-up from the queue would roll back and a delivery retried later, if so configured.

Using a transactional MSMQ queue

To enable transactions for a MSMQ queue, you will need to create the queue as Transactional. What I mean here is that you cannot just change a non-transactional queue to a transactional one without re-creating it completely.

  • To create a transactional queue, use the Computer Management console, right-click on the "Private Queues" node, and select "New->Private Queue" from the context menu:

    Make sure you name the queue after the relative URL of your WF service – also note the forward slash in the path (the service in this example is deployed to https://localhost/MsmqWF/Service1.xamlx). Then select the "Transactional" checkbox.
  • Once the queue is created, adjust its permissions so that the account used by the app pool running your WF service has read/write access to the queue (for simplicity, the screenshot below grants "Full Control" to all Authenticated Users):
  • You can also elect to enable the Journal via the General properties tab, which will provide traceability for your messages:

Configuring the service to receive messages from the transactional queue

The first step is to update the workflow to use the TransactedReceiveScope for all MSMQ-bound Receive activities. Let's assume we have the following super-simple workflow with just two one-way Receive activities - Start, and FollwowUp, which is correlated to the Start:

We will need to modify the definition to use the TransactedReceiveScope activity as depicted below:

As we can see, the change is as simple as placing the FollowUp activity into the Request placeholder of a TransactedReceiveScope activity.

The second step is to create the correct web.config file entries for the service's endpoint(s), along with their binding configuration. For the sample service below, my updated web.config file has the following content:

The second, MSMQ-bound, endpoint is using the netMsmqBinding, with a custom configuration named netMsmqBinding_Config, which specifies the exactlyOnce attribute with a value of true (meaning it should use transactional semantics for communicating with the queue). The custom binding configuration also specifies what retry semantics should be used for "failed" messages via the receiveRetryCount, maxRetryCycles, and retryCycleDelay attributes. The meaning and default values for these attributes and how they affect the MSMQ message delivery retry logic can be found on MSDN here. With the sample web.config above, if the target WF instance cannot receive the request, the message will be placed in the "retry" pool of the queue and a single re-delivery attempt will be made after 1 minute. The receiveErrorHandling attribute with a value of Move means that if all specified delivery attempts fail, the message will be moved to the poison queue where another process or a person can handle the error condition (the topic of poison queues is also covered in the same MSDN article).

The Conclusion

When using MSMQ to communicate with long-running WF services, the system design should always employ transactional queues for reliable delivery. This will effectively eliminate the potential for message loss in the WF message dispatcher in cases where the correlated target WF instance cannot either be found, activated, or capable of receiving the incoming message at the given point in time.

Thanks for reading again and happy MSMQ'ing! J

Authored by: Emil Velinov
Reviewed by: Keith Bauer, Christian Martinez