Load-balancing MSMQ - a brief discussion

There are several ways of load balancing MSMQ with varying levels of support and efficiency, including:

  1. A hardware load balancer (load-balancing on the wire)
  2. DNS Round Robin (load-balancing at the outgoing queue)
  3. Software-implemented Round Robin (load-balancing within the sending application)

Hardware load balancing 

This KB article covers MSMQ and hardware load balancing very well:

 899611 How Message Queuing can function over Network Load Balancing (NLB)

Basically use NLB for sending non-transactional messages. Transactional messaging doesn't work because acknowledgements cannot be returned to the sending machine as it's IP address has been masked by that of the NLB device. Additionally, several messages within the same transaction would be split amongst different balanced destinations, resulting in out-of-order deliveries being discarded.

DNS Round Robin

In this implementation, a network name is created in DNS to represent the group of destination servers. IP addresses with short lifetimes are associated with this network name that match these servers. The idea is that whenever an application needs to connect to one of the remote servers, DNS will provide a different IP address to the one the network name resolved to last time. TechNet has an article discussing how to do this here:

Configuring round robin

This is similar to hardware load balancing and so has some of the problems.

MSMQ maintains network connections as long as it thinks it needs them (and for a little while afterwards just in case). This behaviour is controlled by the CleanupInterval registry value; by default you can expect a sending machine to wait for 2 or 5 minutes (depending on the installation type) of inactivity before the network connection is dropped. Every time a message is sent within this period, the inactivity timer is reset. You can see that a busy server that operates 24x7 may never voluntarily close the network connection. The registry value is in milliseconds so it is possible to force MSMQ to drop the connection almost immediately after a message is sent. This isn't recommended, though, as the next message has to wait for a new network connection to be created; although this doesn't take a long time, the accumulated delay would become significant with a very high message volume. 

Note - this doesn't work for sending MSMQ messages over the HTTP protocol. This is because you would need to disable KeepAlives on the web server to allow the connections to drop but MSMQ was designed with KeepAlives being enabled. In this situation, message delivery is delayed to such an extent as to become unusable.

DNS Round Robin differs from hardware load balancing in that the network connection is between sender and destination, instead of sender and hardware NLB device. This allows transactional acknowledgements to be returned to the correct sender but that does not automatically mean transactional messages can be sent successfully. You still have the problems discussed in the KB article of potential message duplication and splitting up a transaction of multiple messages across several machines. Admittedly they may occur less frequently than in NLB because of the direct connection but that is not good enough to guarantee once-only and in-order delivery.

Software-implemented Round Robin

In this scenario, the sending application has a list of destination servers that it rotates through when sending messages. This results in multiple outgoing queues, one per destination queue. This avoids all the NLB issues with sending transactional messages as they can never be delivered to any other machine than the one initially selected from the list.