Database mirroring questions from TechEd: failovers and partner timeouts

This year at TechEd there's been even more interest in database mirroring than last year - Kimberly's Always-On Demo-Fest yesterday had a whole bunch of questions during the first demo on mirroring and 50% of the booth questions I've answered have been on it too. It seems that more people are getting to the stage of actually implementing SQL Server 2005 this year, which kind of makes sense now that we've put out two Service Packs and companies have had a chance to implement and test new application designs and thoroughly test out workloads on the new version.

There were two really interesting questions that came up that I want to discuss in this post.

Clustering, mirroring and altering the partner timeout

The first question came up during Kimberly's session and is about combining failover clustering and database mirroring. A customer wanted to put the instance hosting the principal database on a failover cluster and have the mirror database on an unclustered server. The issue he saw with the setup was when something happened on the active cluster server and it failed over. The failover would take 60-90 seconds, which means the principal database would not be available during that time. 

The problem stems from the way database mirroring figures out whether to initiate a failover. The mechanism it uses is to have all SQL Server instances within a mirroring session send out 'pings' on each of its open mirroring connections every second. If an instance does not receive a ping from one of the other instances in the session within a defined time-out period (called the partner time-out), it considers the other instance as unavailable and takes the appropriate action. The default time-out period is ten seconds.

In the case above, when the cluster is failing over and the principal instance is unavailable for 60-90 seconds, the other instances in the mirroring session will time-out after ten seconds and initiate a mirroring failover. Although this is the correct behavior of mirroring, it is undesirable behavior for this customer. The solution to the problem is to change the partner time-out value - a facility that isn't well known.

To change the partner time-out value for a mirroring session, use the following code:

ALTER

DATABASE mydatabase SET PARTNER TIMEOUT 90;

GO

The value at the end of the statement is the new partner timeout value in seconds. There are a couple of things to be aware of if you're going to do this:

  • You can only issue this statement on the principal server.
  • Be very careful not to set the time-out value too low otherwise you run the risk of triggering failovers becuase of false-failures - especially on heavily-loaded systems that may not be able to respond within the time-out period. In fact, if you specify a time-out value of 4 seconds or lower, SQL Server will automatically set the time-out to 5 seconds to help avoid such problems.

Database mirroring failover types

The second question came up during our VLDB maintenance Q&A later in the afternoon and is about the different kinds of failures that can trigger mirroring failovers, and how quickly the failover happens after the problem occurs. As with most questions I get, I can use my favorite answer of "It depends!" :-) Let's look at some examples of failures and see how quickly the failover occurs, in decreeasing order of speed.

  • Fastest: The fastest possible failover occurs when the SQL Server instance crashes (so the mirroring connection endpoint no longer exists) but the operating system is still running. When the next ping comes from a partner instance, the OS knows that the network port (that was being used by the mirroring connection endpoint) is no longer being listened to and returns a failure. This immediately triggers a failover.
  • Fast: The next fastest failover occurs when the machine hosting the SQL Server instance crashes or shuts down (e.g. power supply cord pulled out of the back). In this case, nothing happens until the partner time-out period has expired (as there's no OS running to return the immediate failure) and then a failover will occur. By default this will be in ten seconds, but you can change this as I explained above.
  • Slow: A slow failover is when something happens to the server but it takes a while for the database to go offline. An example of this would be someone pulling out the transaction log drive on the principal server. Writes to the transaction log will start to queue up. After 20 seconds SQL Server will issue an IO warning but it isn't until 40 seconds has passed that SQL Server issues an IO failure and the database goes suspect. It's not until this point that the failover occurs - even though the partner timeout value is 10 seconds!
  • Maybe fast or not at all: This discussion all started by someone asking about how quickly a failover happens if a page checksum failure is detected. My answer was - it depends! If the page checksum failure happens during a query, then all that happens is that the checksum failure is reported, the query rolls back, and the connection is broken. It's only if the page checksum failure happens during a transaction rollback that a failover will be triggered. This is because a transaction rollback failure means the database is in a transactionally inconsistent state and has to be set to suspect mode - which triggers a mirroring failover.

So, don't assume that just because mirroring is setup that every failure will trigger a fast failover.

Hopefully there will be more questions during the rest of the week that I'll get time to blog about!