The most interesting issue in DB Mirroring you will ever see

I recently worked on a very interesting “issue” in DB mirroring, relevant to a very specific scenario. Read on to find out more.

Basically, we have a setup which looks something like this:

Initial setup with machines A, B and C
A principal
B mirror
C witness

Take down the principal A (network disconnect or stop SQL)
A down
B online principal
C witness

Failover happens cleanly. While A is down we do our auto repair (remove witness, add a new mirror on D, establish a witness on E)
A down
B online principal
D mirror
E witness

Now when we bring A back up (reconnect network or start SQL)
A (principal in recovery)
B online principal
D mirror
E witness

At this point A correctly stays in recovery because it doesn’t have quorum. Now if you restart SQL on A
A online principal
B online principal
D mirror
E witness

So we end up with 2 copies of the database online, which can be undesirable in certain situations.

Looking at the errorlog, when the service is restarted for the first time, we saw these messages:

2012-06-26 17:10:23.80 spid21s     Error: 1438, Severity: 16, State: 1.

2012-06-26 17:10:23.80 spid21s     The server instance Partner rejected configure request; read its error log file for more information. The reason 1460, and state 1, can be of use for diagnostics by Microsoft. This is a transient error hence retrying the request is likely to succeed. Correct the cause if any and retry.

2012-06-26 17:10:26.49 spid19s Bypassing recovery for database 'XYZ' because it is marked as an inaccessible database mirroring database. A problem exists with the mirroring session. The session either lacks a quorum or the communications links are broken because of problems with links, endpoint configuration, or permissions (for the server account or security certificate). To gain access to the database, figure out what has changed in the session configuration and undo the change.

Attempting to access the database gives:

Msg 955, Level 14, State 1, Line 2

Database XYZ is enabled for Database Mirroring, but the database lacks quorum: the database cannot be opened. Check the partner and witness connections if configured.

However, when you restart the service for the second time, you see:

2012-06-26 17:32:32.51 spid7s      Recovery is writing a checkpoint in database 'XYZ' (5). This is an informational message only. No user action is required.

After some research and a lot of discussions, we were able to nail it down to the following steps:

When A comes back on (first startup), it looks for the Partner B and Witness C. It is able to communicate to the Witness C (say) on port 5033. The Witness C sends back a message, (actually a DBM_NOT_SHIPPPING error) indicating it is not part of the mirroring session #1 anymore.

So the old principal A removed the Witness C from it’s local configuration. Now, after the next restart, it again attempts to contact the mirror B (but not the Witness C, because it has been removed from the configuration on A, remember). The mirror B says it is already part of a different mirror session, mirroring session # 2. So the principal A removes Mirror B also from its configuration.

At this point the system A is a restarting primary with no witness configured so it has the implied quorum vote of a primary and is able to restart and come on line. This is the same case as if a drop witness command was executed and had acted on the mirror and witness without the acknowledgement getting to the primary before a restart (command accepted remotely so restarted node syncs with latest configuration on restart).

In the normal case where a session is dropped while the primary is down the old mirror will return DBM_NOT_SHIPPING which will cause the old primary to drop mirroring locally and stay online.

The mirror in this case has been configured with a different DBM session so it returns DBM_ALREADYSHIPPING_REMOTE which does not cause the session to drop but the DB (on A) comes online as a principal – no witness, mirror not connected. Running an alter database set partner off will put it into the same state as the normal case.

As you probably surmised already, this behaviour is by design. But how to avoid this? One of my esteemed colleagues was able to come up with the following workaround:

When you remove the mirroring session #1, and establish mirroring session with B as the Primary, D as mirror and E as witness, you need to make sure that the old Witness C is not using the same endpoint(5033) anymore. This, in turn, will ensure that the old Principal A is unable to talk to any of the remnants of the mirror session # 1. As a result, any attempts by A to communicate to Witness C will lead to a timeout. Thus, the old Principal A will remain in a “RECOVERING” state since Quorum is not established yet. The only negative impact of this approach is that you cannot share the same Witness server/endpoint for multiple mirroring sessions.

Now apart from this, there are few other things you need to account for:

  • After a new mirror session is setup, you need to drain the information about the old principal A from all existing application connections and provide them with the new principal and its partner information. For this a disconnect and reconnect is required.
  • After the old principal A comes backup, you need to use the following commands to remove remnants of mirror session #1 from it and keep it out of application use:

            alter database XYZ set partner off
            go

            alter database XYZ set single_user with rollback immediate
            go

            alter database XYZ set offline
            go

Not a very common scenario, but an interesting one nonetheless. What say?