BizTalk Server 2006 - Enterprise Production Considerations - Part 3 - Clustering

I have had this recurring conversation with customers and partners.  Clustering is not a simple concept, and neither is BizTalk.  If you put them together you seem to have a perfect storm.  There's lots of confusion, which I hope to be able to dispel with this post.

Also, I would like to have a link to send people, just so that I don't have to have the conversation again :)

The Summary

  1. The SQL servers hosting the BizTalk databases can be clustered for high availability.
  2. The Enterprise Single-Sign On Master Secret Service (MSS) can be clustered for high availability.
  3. The BizTalk hosts can be clustered for the following reasons:
    1. They are hosting something (i.e. Receive Port) that has to be single instance, but still highly available.
    2. If you have BizTalk instances running on the same nodes that are running a clustered instance of the MSS.

FAQ

  1. What parts should I cluster?
    You should cluster the SQL instance(s) holding the biztalk databases, the ENTSSO Master Secret service, and the BizTalk Hosts that require on a single instance running at a time, but still needs to be highly available.
  2. Is a non clustered-BizTalk host still Highly Available?
    If you have at least two host instances running in the same group for this host, then YES. It behaves more like an NLB cluster than an MSCS service.

The Easy Part

I say that the SQL clustering is the "Easy Part" not because it is simple, but because I don't feel compelled to explain the nitty gritty details in this post. I will give some of the BizTalk specifics here, and dedicate the next post to SQL cluster configuration. For now, as a rule, go with SQL 2005 Active/Active/Passive.

  1. Separate the MessageBox and Tracking Database among separate LUNs and separate SQL instances.
  2. The "Other" databases can often be place on the same node running the tracking DB. If you are heavy on BAM or BAS you might want to break off the respective databases to a separate instance.
  3. All the databases should have at least 2 LUNS (data log), often they have more, but that gets into mount points and that is a tanget.

The Somewhat Easy Part

Clustering the ENTSSO MSS is pretty simple and the procedure is posted here. Basically, this is just a running service of the ENTSSO that also takes on the responsibility of distributing the Master Secret Key. This is the key that all the ENTSSO services running on other machines need to read the encrypted information in the SSO Database.

  1. This should be clustered. Unless you don't care about high availabilty. This can really mess up BizTalk if it goes down.
  2. This is not really a resource intensive service (typically) so it often piggy-backs on the SQL server nodes runing the clustered instances of SQL Server.
  3. You don't want this clustered on the BizTalk boxes as a rule of thumb. (reasons covered later in this post)

Ok, now the meat.

BizTalk hosts generally have instances on more than one box, as such they are already highly available. This means that clustering a BizTalk host is not generally necessary to ensure high availability. If one goes down the others will pick up the work, out-of-the-box standard config. The problem comes in when you REALLY don't want to have mutliple instances of a host running at the same time. The classic example of this is a receive port for an adapter that can not gracefully handle two threads reading it at the same time (i.e. FTP). It also happens when you need to be certain that all of the messages are picked up and processed sequentially and you don't want to code around the problem.

So you want an FTP port receive, but you can not affort duplicate message reads.  That means you can have only one instance of the host, but that kills the out-of-the-box high availability.  Now you need to cluster that host instance, that is the only way to ensure that exactly one instance of the host will be running (so long as at least one of the BizTalk app servers is running)

The Very Wierd Part

This is the one that people scratch their heads over. It does not happen much, and its not really all that complex. But it is not intutitive, so it messes with people's heads. 

For this to really make sense, I have to back up a step and explain how the ENTSSO service generally works in a mutli-box BizTalk installation.

  1. All of the BizTalk App servers have an instance of the ENTSSO service running on the machine. It is this service that obtains the secret key from the ENTSSO Master Secret Server and then uses it to independently access the info in the SSO Database.
  2. One of the ENTSSO servers is setup as the Master Secret Server. This means that it will distribute the secret key to the other ENTSSO service instances, in addition to being the standard ENTSSO service for the local box. This is the service instances that needs to be configured for High Availablity because all of the other ENTSSO service instances depend on it.
  3. The BizTalk host instances have a dependency on the ENTSSO Service running on the same box as them. This is because they need desperately need it to give them info from the SSO DB. This information is typically configurations for their ports that is stored in the SSO DB for safe keeping.

Figure 1 - Standard Multi-Box ENTSSO Strategy

Lets say that you have only 2 production servers. They will be running both SQL and BizTalk.  Like a good boy/girl you decide to cluster SQL and the ENTSSO MSS.

Now you try to configure BizTalk host instances, and you find that they don't want to work quite right. That is because they have a dependency on the ENTSSO service, they always need it.  Now that the ENTSSO service is clustered, only the host instances running on the active (ENTSSO) node actually have access to the service they need. All the other instances will fail based on dependencies.

Figure 2 - BizTalk Hosts with Broken Dependency

An an extra bonus you have messed up the High availability of the MSS as well. The clustering service detects that the host instances are depending on the ENTSSO MSS service and probably will not allow it to failover to other nodes.

...NICE...

Option 1 is to not cluster ENTSSO MSS in a 2 box scenario

Bottom line is that, if you have only budget for two prod boxes, you probably don't want to cluster the ENTSSO MSS. But this is really the only time I can think of that clustering that service is bad.  If it is not clustered you better be a "Johnny on the spot" with MOM and implement a way to have MOM do a poor man's failover when it detects that the ENTSSO MSS server is down.

Option 2 is to also cluster all BizTalk host instances that depend on a clustered ENTSSO service.

If you really do want to cluster the ENTSSO MSS, then you also have to cluster all of the BizTalk hosts as well. This ensures that as the ENTSSO Service "flips" from one box to the next, the Microsoft Clustering Service can ensure that the BizTalk host instances flip at the same time and don't come crashing down.