Click, Boom...

MSMQ is a very robust product and people expect it to be able to look after data no matter what. To test that it does, customers can perform scary tests like hitting the power switch while the server is running. Obviously a power cut is a real-world event and customers should not expect to lose messages (which usually also means losing money) when one occurs. Surprisingly messages DO disappear and customers rise up to condemn MSMQ's transactional robustness.

But you need to dig deeper to find out the real cause of the problem. MSMQ handles transactions in the expected way - either the message is safely written to the storage files or it is rejected back to the sender. There is no grey area for lost messages (assuming you are using Transactional Dead Letter Queues, etc. for reliable messaging).

The trick is that the hardware may be using smoke and mirrors to improve performance. Modern drives will have a few megabytes of RAM to cache the disk reads and writes. Caching reads means that data can be recalled quicker than waiting for the head to reach the corresponding blocks. Similarly, caching writes will reduce the head movements as those updates that are in the same sector can be written all at once instead of separately. The end result is a drive that is faster than the spin speed will allow - this is good for the vendor as they can extend the life of their technology and also good for the customer because of the improved performance. The down-side is that the write-cache is transient and cannot survive a power cut.

For MSMQ the problem is detailed as follows:

  1. The Message Queue Access Control (MQAC) service passes message data to the I/O drivers
  2. The drivers send the data to the disk controller
  3. The disk controller returns a "Success" status but caches the disk writes
  4. The drivers returns a "Success" status to MQAC
  5. MQAC informs the Queue Manager that the transactional message commited successfully
  6. Power is lost along with the cache contents
  7. When the MSMQ service comes back on-line, none of the storage files have been updated and it is as if the message never existed

The obvious workaround is to disable the hard disk cache (if you can) but is the risk of losing one message through a power cut (or catastrophic hardware failure) high enough to justify the performance hit? Your call...