A reasonably common question, this.
It is important to note that MSMQ doesn’t lose your data. There’s no hole at the bottom of a queue where small messages can accidentally fall out, never to be seen again. Instead MSMQ will discard messages in certain circumstances unless told otherwise.
Here are a range of scenarios which explain why messages seem to disappear without reason. You may think that I’m inventing purely theoretical situations for some of these but you’d be wrong – all of the following are based on real customer cases.
Time-to-live for message is too short
MSMQ allows two different message timers:
- Time-To-Reach-Queue (TTRQ) – the time allowed for the remote server to confirm message delivery into the destination queue
- Time-To-Be-Received (TTBR) – the time allowed for the remote server to confirm the message has been read from the destination queue
The defaults are quite generous – 4 days and infinite respectively. So an undelivered message can sit in an outgoing queue for up to 4 days before being removed by MSMQ and remain in the destination queue indefinitely. Should someone set the default lifetime for messages to a short value or specifically assign one in their application code then a delay in the system, such as a network outage, could cause messages to unexpectedly expire. If negative source journaling has been enabled then these messages should be in the sending machine’s dead letter queue with an explanation that the timer expired. If you don’t have journaling then the message is discarded.
Incorrect queue or message property
If a message is sent with properties that don’t match the destination then it will be discarded. The common properties that can be misapplied are:
- Transactional message to non-transactional queue, and vice versa
- Queue requires authentication but no certificates used in the message
- Queue requires encryption but no certificates used in the message
- … and there are probably others.
As with the timers, if negative source journaling has been enabled then these messages should be in the sending machine’s dead letter queue with an appropriate explanation. If you don’t have journaling then the message is discarded.
Express messages lost when MSMQ service is restarted
If you want a message to survive a reboot then it must be set as recoverable in the application. Although express messages are copied to the disk, this is not to make them persistent. As a result, when the MSMQ service starts up again all the express messages are immediately discarded. Transactional messages are recoverable by design and no extra action is required for these.
Server name used to address message doesn’t match destination machine
When MSMQ receives a message from over the wire, it always validates that this machine is the correct recipient. This is to ensure that something like a DNS misconfiguration does not result in messages being delivered to the wrong place. The messages are, instead, discarded unless the IgnoreOSNameValidation registry value is set appropriately. You may want to do this with an Internet-facing MSMQ server, for example, where the domain and server names visible to MSMQ clients on the Internet often bear no resemblance to the real ones (for good security reasons).
Receiving application experiences a problem
Here the message has been successfully delivered to the destination queue and has been received but the application encounters a problem. Maybe a back-end database is offline and the message body cannot be written as a record to it. If the message is not received within a transaction (which can abort and roll back) then such a fault could mean the message is lost. An alternative to using transactions is to perform a peek and then a receive; the peek is used to read the message data for processing and the receive is to effectively delete the message should the peek be successful; if the peek failed then the message is still in the queue and a retry is possible. Journaling on the queue itself is a useful step as a copy of the message in the queue’s journal queue shows the message was successfully read.
Multiple applications are accessing the same queue without your knowledge
It is really hard to work out which applications are pulling messages from a queue and they could be physically located anywhere on the network. There’s no query you can run to show what applications have open queue handles on a queue or have pulled messages from the queue in the past. You may be able to get a clue by running a security audit on the queue as the account used to gain access to the queue may, if you’re lucky, indicate a particular machine as the source. Again, use journaling on the queue to prove the message had been delivered and removed correctly.
The server is switched off and the hard drive has write caching enabled
This has been covered in another blog post.
I’ll come back to this blog post over time as more scenarios occur to me.