An orchestration pattern that can hog the master message box CPU

Overview

We had a customer scenario such that when a complex business process was executed at the rate of 1 request per second the CPU utilization of the SQL Server hosting the master message box quickly grew up to 100% sustained.  Having the master message box’s CPU utilization so high is not desirable as it can:

  • cause SQL connection timeouts, which has the side affect of restarting BizTalk host instances
  • SQL jobs fail, which has the side affect of aged data not getting cleaned out of the database
  • Performance impact, as it will take longer to submit messages into the message box, since the master message box is responsible for subscription matching

Background

In this scenario, there were two SQL Servers allocated for BizTalk; one held only the master message box with publication turned off and the other had all of the other BizTalk databases including a secondary message box.  Both SQL Server machines had 8 hyper-threaded 3.0 GHz processors and 8GB of RAM. 

 

The orchestrations consisted of about 4 orchestrations chained via messaging including about 2 called orchestrations.

 

In this scenario a new business process is created per order and there can only be one business process running at any one time for a particular {customer, order} pair.  Each order can have updates which can interrupt the currently running business process handling that request.  An interruption can only happen at certain points in the business process (i.e. business process atomicity).  So if an update has come in and a business process is currently handling a request, then the update will be queued up to a point in the business process where it can check to see if this current business process instance should terminate and then allow the update to start a new business process. 

 

To accomplish this in orchestration we used correlations.  At certain points in the orchestrations there would be a Listen shape with a Delay of 0 on one branch and a Receive following a correlation on the other branch.  So when the orchestration gets to this point in the orchestration, if there is no update then the business process continues until it needs to check again at the next interrupt point.

 

In the design of the orchestration there are several .NET remoting calls made.  If the remoting call fails then an exception orchestration is “Called”.  Since there are several remoting calls, then there are several of these exception orchestrations called throughout the main orchestration.  The exception orchestration includes logic such that it can post a request to an operator to determine whether or not to terminate the instance or to try again.  Since there is a blocking call waiting for user input, there can be a significant window of time where the orchestration instance is running.  In the meantime an update message could come, which would invalidate this original request.  To accomplish this interruption in the exception handling orchestration, the correlation set was passed in as a parameter to the called orchestration so that it can either wait (listen) for the response from the operator or an update message that interrupts the currently running instance.

BizTalk Behavior

The master message box is responsible for doing subscription matching.  If publication is turned off on the master message box, then it will only do subscription matching and the other message boxes will handle message publication and storage.  The master message box can only be scaled up but not out.  So eventually it can become the limiting factor in how far the message boxes can be scaled. 

 

The orchestration engine creates the necessary subscriptions when the orchestration is instantiated.  One set of subscriptions it will create include the ones for followed correlations.  The orchestration engine will find all of the points where the correlation is followed and create subscriptions for them.  If a correlation set is passed to a called orchestration then the engine crawls the called orchestration to create those subscriptions as well.  A called orchestration is essentially in-lined code which means that the called orchestration will look like it is part of the orchestration that calls it.

 

A subscription, in this case, will consist of a message type, an orchestration instance, a port operation, and the properties used in the correlation set.  So the subscriptions for the called orchestrations will actually have the name of the caller orchestration as its orchestration instance name.

 

By default, when you create a port in the orchestration designer the first operation under the port is defaulted to Operation_1.  Unless a developer has an explicit purpose for changing this (for example, if he is exposing an orchestration as a web service this operation will become part of the method name) the developer will typically leave it with the default name.

 

Since the same called exception orchestration is called many times within the main orchestration with the same correlation set with the same port operation name, then identical subscriptions will be created in the master message box.  As a general rule of thumb, the master message box can get overwhelmed when trying to match a message to a subscription with more than 20 identical subscriptions (I won’t go into the complexities of what happens when a number of subscriptions are matched for a particular message). 

So in this case each request is putting a lot of strain on the master message box trying to match the request to a particular subscriber since the message would match the subscription for activating receive as well as all of the receives that are in each call to the exception orchestration.  In this case, messages which are destined for the receive points in the exception orchestration are not common, but they are still referenced since the subscriptions are all the same.  To alleviate the strain put on the matching processing we changed the names of the port operations to be unique.  Since the orchestration engine uses the port operation name as the distinguishing property for a subscription, the subscription for the activating receive is now unique and the master message box doesn’t have to waste CPU cycle trying to figure which of the other subscriptions it needs to match against.

 

In the above described scenario, after making this change, the CPU utilization dropped from 100% to about 20%.

Recommendations

Change the operation type name to something unique.  Even if the design doesn’t have a called orchestration with a correlation set passed to it, it is a best practice to change these names in case in the future these orchestrations get repurposed.  Also it allows you to more easily find the subscription for a particular operation port in the orchestration.

 

Lee  talks about this scenario in his blog with some more technical detail under his post on  Is there a pub/sub system underneath BizTalk?