AlwaysOn Availability Group Forced Failover: Under the Hood

In this walkthrough I will take you through one aspect of FCI + AG configuration: forced failover. We will simulate failure and force failover. And in the process we will have some … Data Loss. Does that not sound horrible! Relax! You should not worry too much Smile , because in this case what we are going to discuss is actually a ‘manufactured’ data loss situation. The true intent of this post is to show you how the Availability Group architecture deals with ‘multiple master’ scenario and how it avoids the ‘split-brain’ condition.

Lab setup

For our testing the lab setup looks like this:

INST1 is a clustered instance of SQL with 2 nodes as possible owners, running SQL 2012 CU3. W2K8R2N3 is a server hosting a standalone instance of SQL 2012 CU3. All 3 servers are nodes in the same Windows Server Failover Cluster (WSFC), SQLAGCLUST.

Design decision #1: Asynchronous versus Synchronous secondary

We are allowed to configure either; but note the latency effects that synchronous secondary can have, especially if W2K8R2N3 was on a WAN link, I would prefer to use asynchronous in that case.

Design decision #2: Why not automatic failover for the secondary replica?

While configuring this setup, please note that while we are allowed to have a synchronous replica, it is not allowed to make that as automatic failover partner. If you try to configure W2K8R2N3 as a synchronous replica with automatic failover:

USE [master]
GO
ALTER AVAILABILITY GROUP [FCI_AG]
MODIFY REPLICA ON N’INST1′ WITH (PRIMARY_ROLE(READ_ONLY_ROUTING_LIST = (N’W2K8R2N3′)))
GO
ALTER AVAILABILITY GROUP [FCI_AG]
MODIFY REPLICA ON N’INST1′ WITH (FAILOVER_MODE = AUTOMATIC)
GO

When you do this, you will receive the following error message:

Msg 35215, Level 16, State 17, Line 1
The Alter operation is not allowed on availability replica ‘INST1’, because automatic failover mode is an invalid configuration on a SQL Server Failover Cluster Instance. Retry the operation by specifying manual failover mode.

This is expected and by-design, the logic being that the FCI already has automatic failover configured between the nodes. The current implementation of AG doe not allow a failover outside of the FCI nodes.

Test Case

We will simulate a ‘perfect storm’ case – failure of the link to the DR, followed by failure of the primary site, followed by forced failover to the secondary site, and finally a failback to the primary site. In this rarest of rare scenarios we will show how the scope for data loss can arise.

Database and AG details

Here is the ‘full disclosure report’ for the test Smile

In our sample AG, there is just 1 database: ‘mydb’.
The primary replica INST1.mydb is a synchronous, readable secondary, with manual failover.
The database replica W2K8R2N3.mydb is synchronous, readable secondary with automatic failover.
For our testing we will use a simulated table ‘newtab’ which has 1 integer column, ‘i’.
Prior to these tests, the table ‘newtab’ was truncated and the ‘mydb’ transaction log has been backed up.
We will test for missing rows to determine the data loss.

Cluster node weight assignment

As part of the ‘good practices’ since W2K8R2N3 cannot participate in automatic failover, it is recommended to set it’s node weight to 0. By default it is 1:

PS C:Windowssystem32> (get-clusternode “w2k8r2n3”).Nodeweight
1

We will set it to 0:

PS C:Windowssystem32> (get-clusternode “w2k8r2n3”).Nodeweight = 0

Test 0: Ensure synchronization works normally

We insert our first row, and check the secondary replica to see if the log records made it across.

insert newtab values (1)

Almost immediately we can see the record is synchronized on the secondary replica. A casual inspection of the transaction log on both sides using fn_dblog() shows that the log records are in sync. Here are the last 3 entries displayed by fn_dblog:

0000001f:0000007b:0018    LOP_INSERT_ROWS    LCX_HEAP    0000:000002e9
0000001f:0000007b:0019    LOP_SET_FREE_SPACE    LCX_PFS    0000:00000000
0000001f:0000007b:001a    LOP_COMMIT_XACT    LCX_NULL    0000:000002e9

fn_dblog is undocumented, unsupported, but can be useful for learning and testing purposes. You can read more about it at the following community resource pages:

Test 1: Link failure / secondary down

This is simulated by stopping the service on W2K8R2N3. In this case, the primary replica (which is the FCI) runs ‘exposed’. ‘Exposed’ means that potentially the RPO of 0 data loss (for the AG) is not met because there is no link to the secondary replica.

At this time (when the secondary replica is stopped or disconnected) we see the following entry in the errorlog of primary replica INST1:

2012-09-20 11:52:42.880 spid44s AlwaysOn Availability Groups connection with secondary database terminated for primary database ‘mydb’ on the availability replica with Replica ID: {ac95b2e1-d5fa-48c2-8995-2f7be4140ed3}. This is an informational message only. No user action is required.

Now, let’s continue ‘modifying’ data on INST1.mydb:

insert newtab values (2)

Let’s review the last few LSNs:

0000001f:00000081:0002 LOP_INSERT_ROWS LCX_HEAP 0000:000002ec
0000001f:00000081:0003 LOP_COMMIT_XACT LCX_NULL 0000:000002ec

So to summarize, at this time, we have 1 row (with a value of 2) in mydb.newtab which has not been transferred to W2K8R2N3. This is why we say we are ‘exposed’, because we have the potential of losing this change were something bad to happen to INST1 now.

Test 2: INST1 also fails; W2K8R2N3 comes up later

Imagine due to a steady degradation (maybe power issue or such) the INST1 SQL FCI instance also fails. We will demonstrate this by powering the N1 and N2 nodes off, so that way INST1 is unavailable. When these nodes are offline, the cluster itself fails.

Also imagine after a while, W2K8R2N3 (the erstwhile secondary) comes up. In such a case, let’s see what happens. Firstly, the failover cluster manager reports that it is unable to connect to our WSFC. So let’s see what we get when we generate the cluster log file:

C:Windowssystem32>cluster log /g
Generating the cluster log(s) …
The cluster log has been successfully generated on node ‘w2k8r2n3’…
The cluster log could not be created on node ‘W2K8R2N1’…
System error 1722 has occurred (0x000006ba).
The cluster log could not be created on node ‘W2K8R2N2’…
System error 1722 has occurred (0x000006ba).

The multiple ‘errors’ are because the cluster log is attempted to be generated on all nodes. The Nodes 1 and 2 being down, these errors are expected. Let’s see what is in the cluster log on W2K8R2N3:

00000578.00000764::2012/09/20-19:10:39.769 INFO [QUORUM] Node 3: setting next best epoch to 2012/09/13-06`03`57.603_11
00000578.00000764::2012/09/20-19:10:39.769 DBG [QC] Calculating quorum: Configured nodes count = 3, Node weights count = 2, Quorum witness configured = false, Intersection count = 0
00000578.00000764::2012/09/20-19:10:39.769 INFO [QUORUM] Node 3: Coordinator: no quorum yet. do nothing

Aha! So due to lack of quorum, our N3 cluster service is unable to form a cluster. At this stage, I will go ahead and stop the cluster service on N3 (if it has not stopped already.) We will start it and ‘force quorum’ later.

Test 3: Forcing quorum on W2K8R2N3

This part is quite simple in operation but underneath there are many complex details going on. I will not attempt to reproduce those details here but you can search for ‘Paxos Tag’ and also look at the 2 articles listed below to understand the mechanism:

So to force the cluster service to start and force a quorum, here are the commands. Firstly, we will use PowerShell to set the node weight of this W2K8R2N3 to 1:

PS C:Windowssystem32> (get-clusternode “w2k8r2n3”).NodeWeight = 1

Next, we use the ForceQuorum switch to start the cluster service.

C:Windowssystem32>net start clussvc /fq

At this stage, if you look at the SQL Management Studio, you will see this database in ‘Not Synchronizing’ state. Also the AG resource will state ‘Resolving’:

Next, we will force the failover of the AG to this node. To do this, we right click on the AG and select the Failover option, which in turn brings up this set of wizard screens. The first one warns us of the scope of data loss:

The second one ensures that we understand the implications of forcing failover:

Once the necessary steps are done, you will see the database status as Synchronized:

Forced failover: Under the hood

And most importantly from an internals perspective, looking at the transaction log using fn_dblog() shows the same record as the original modification (the row with value 1. The LSN is 1f:7b:18

0000001f:0000007b:0018 LOP_INSERT_ROWS LCX_HEAP 0000:000002e9

At this time, we can also check the rows in the table on N3:

select * from mydb.dbo.newtab

The above shows just 1 row, with value 1 as expected. Let’s take a look at the error log on W2K8R2N3:

2012-09-20 12:32:38.110 spid53 The state of the local availability replica in availability group ‘FCI_AG’ has changed from ‘RESOLVING_NORMAL’ to ‘RESOLVING_PENDING_FAILOVER’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.

Below you will see the state transition which is in response to our request to force failover:

2012-09-20 12:32:48.050 spid55       AlwaysOn: The local replica of availability group ‘FCI_AG’ is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required
2012-09-20 12:32:48.060 spid55       The state of the local availability replica in availability group ‘FCI_AG’ has changed from ‘RESOLVING_PENDING_FAILOVER’ to ‘RESOLVING_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.
2012-09-20 12:32:48.080 spid55       The state of the local availability replica in availability group ‘FCI_AG’ has changed from ‘RESOLVING_NORMAL’ to ‘PRIMARY_PENDING’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.
…

2012-09-20 12:32:48.200 Server The state of the local availability replica in availability group ‘FCI_AG’ has changed from ‘PRIMARY_PENDING’ to ‘PRIMARY_NORMAL’. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error.

At this stage, the database can be brought online on W2K8R2N3:

2012-09-20 12:32:48.210 spid21s      Starting up database ‘mydb’.
2012-09-20 12:32:48.300 spid21s      3 transactions rolled forward in database ‘mydb’ (5:0). This is an informational message only. No user action is required.
2012-09-20 12:32:48.340 spid10s      0 transactions rolled back in database ‘mydb’ (5:0). This is an informational message only. No user action is required.

Test 4: Bring INST1 back online

By doing this, we potentially risk what is known as a ‘split-brain’ situation. Normally the implementation of the AG resource DLL and the Lease Timeout avoids this issue IF it all nodes in the cluster can talk to each other. In the real DR failback scenario normally one would need to be very careful when bringing INST1 back online. The recommended way to prepare for this is to start the cluster services with the PreventQuorum switch first.

Details about node weight and PreventQuorum switch

When we do bring the nodes for INST1 back online, it does put the mydb on INST1 in a Not Synchronizing state.

If we try to use the mydb database on INST1 it gives us the following error:

The target database, ‘mydb’, is participating in an availability group and is currently not accessible for queries. Either data movement is suspended or the availability replica is not enabled for read access. To allow read-only access to this and other databases in the availability group, enable read access to one or more secondary availability replicas in the group. For more information, see the ALTER AVAILABILITY GROUP statement in SQL Server Books Online. (Microsoft SQL Server, Error: 976)

Let’s see if we can create a database snapshot to peer into what is inside INST1.mydb.

— Create the database snapshot
CREATE DATABASE mydb_dr_snap ON
( NAME = mydb, FILENAME =
‘f:sql2012datamydb_dr_snap.ss’ )
AS SNAPSHOT OF mydb;
GO

At this time, querying the snapshot using the following command:

select * from mydb_dr_snap.dbo.newtab

This gives us 2 rows back. That means the transactions on the erstwhile primary instance INST1 are still ‘there’ but because of the force quorum, the common LSN cannot be negotiated. This causes the issue of INST1 being in Not Synchronized state. You can see the proof for this in the error log. To compare the state of the old secondary replica with INST1, you can use TableDiff utility.

First, you will see the attempt made by the new secondary (INST1) to connect to the new primary (W2K8R2N3):

2012-09-20 12:40:34.990 spid23s      A connection for availability group ‘FCI_AG’ from availability replica ‘INST1’ with id [280B1AED-49EA-4EF7-9BAF-C1EC13F50E51] to ‘W2K8R2N3’ with id [AC95B2E1-D5FA-48C2-8995-2F7BE4140ED3] has been successfully established. This is an informational message
2012-09-20 12:40:37.630 spid29s      AlwaysOn Availability Groups connection with primary database established for secondary database ‘mydb’ on the availability replica with Replica ID: {ac95b2e1-d5fa-48c2-8995-2f7be4140ed3}. This is an informational message only. No user action is required.
2012-09-20 12:40:37.630 spid29s      Error: 35285, Severity: 16, State: 1.

This next entry is SO important for us to note. It is actually telling us that the ‘recovery LSN’ which is the ‘common handshake’ is 31:129:1. In hexadecimal notation, this is 1F : 81 : 1. From the first few steps in this walkthrough, you might remember that when we inserted row 2, the LSN was 1f: 81: 2.

2012-09-20 12:40:37.630 spid29s The recovery LSN (31:129:1) was identified for the database with ID 5. This is an informational message only. No user action is required.
2012-09-20 12:40:37.630 spid29s AlwaysOn Availability Groups data movement for database ‘mydb’ has been suspended for the following reason: “failover from partner” (Source ID 1; Source string: ‘SUSPEND_FROM_PARTNER’). To resume data movement on the database, you will need to resume the d

Here is the notification that the snapshot was created:

2012-09-20 12:44:36.460 spid57       Starting up database ‘mydb_dr_snap’.
2012-09-20 12:44:36.470 spid57       4 transactions rolled forward in database ‘mydb_dr_snap’ (6:0). This is an informational message only. No user action is required.
2012-09-20 12:44:36.520 spid57       0 transactions rolled back in database ‘mydb_dr_snap’ (6:0). This is an informational message only. No user action is required.
2012-09-20 12:44:36.520 spid57       Recovery is writing a checkpoint in database ‘mydb_dr_snap’ (6). This is an informational message only. No user action is required.

Resume synchronization for INST1.mydb

Later, we resumed the synchronization for this database on INST1. That effectively ‘sets the clock back’ on INST1.mydb by rolling back some of the transaction which are subsequent to the recovery LSN. We do this by right clicking on the database under the Availability Group and selecting Resume Data Synchronization.

At this time select * from mydb.dbo.newtab gives 1 row back! That confirms the second of our ‘manufactured data loss’ scenarios Smile

As always let us look at what happened under the hood, by noting the error log entries on INST1. First, you will see the effect of the RESUME:

2012-09-20 12:48:26.410 spid55       ALTER DB param option: RESUME
2012-09-20 12:48:26.410 spid55       AlwaysOn Availability Groups data movement for database ‘mydb’ has been resumed. This is an informational message only. No user action is required.
2012-09-20 12:48:26.530 spid29s      AlwaysOn Availability Groups connection with primary database established for secondary database ‘mydb’ on the availability replica with Replica ID: {ac95b2e1-d5fa-48c2-8995-2f7be4140ed3}. This is an informational message only. No user action is required.
2012-09-20 12:48:26.530 spid29s      Error: 35285, Severity: 16, State: 1.
2012-09-20 12:48:26.530 spid29s      The recovery LSN (31:129:1) was identified for the database with ID 5. This is an informational message only. No user action is required.
2012-09-20 12:48:26.640 spid29s      Error: 35278, Severity: 17, State: 1.
2012-09-20 12:48:26.640 spid29s      Availability database ‘mydb’, which is in the secondary role, is being restarted to resynchronize with the current primary database. This is an informational message only. No user action is required.

What follows is even more critical to note. It says we identified a higher LSN (31 : 130 : 1) but are ‘going back’ to the recovery LSN (31: 129: 1):

2012-09-20 12:48:26.790 spid40s      Error: 35285, Severity: 16, State: 1.
2012-09-20 12:48:26.790 spid40s      The recovery LSN (31:130:1) was identified for the database with ID 5. This is an informational message only. No user action is required.
2012-09-20 12:48:26.790 spid40s      Error: 35286, Severity: 16, State: 1.
2012-09-20 12:48:26.790 spid40s      Using the recovery LSN (31:129:1) stored in the metadata for the database with ID 5.

Finally, the database can be recovered and resume synchronization:

2012-09-20 12:48:26.960 spid40s      Starting up database ‘mydb’.
2012-09-20 12:48:27.000 spid40s      Starting up database ‘mydb_dr_snap’.
2012-09-20 12:48:27.420 spid40s      AlwaysOn Availability Groups connection with primary database established for secondary database ‘mydb’ on the availability replica with Replica ID: {ac95b2e1-d5fa-48c2-8995-2f7be4140ed3}. This is an informational message only. No user action is required.
2012-09-20 12:48:27.430 spid40s      Error: 35285, Severity: 16, State: 1.
2012-09-20 12:48:27.430 spid40s      The recovery LSN (31:129:1) was identified for the database with ID 5. This is an informational message only. No user action is required.
2012-09-20 12:48:27.430 spid40s      3 transactions rolled forward in database ‘mydb’ (5:0). This is an informational message only. No user action is required.
2012-09-20 12:48:27.460 spid29s      AlwaysOn Availability Groups connection with primary database established for secondary database ‘mydb’ on the availability replica with Replica ID: {ac95b2e1-d5fa-48c2-8995-2f7be4140ed3}. This is an informational message only. No user action is required.

Here is the final state of transaction log on INST1. You will notice the absence of the LSN numbers associated with the row with value 2 (1f: 81: 2)

0000001f:0000007b:0019    LOP_SET_FREE_SPACE    LCX_PFS    0000:00000000
0000001f:0000007b:001a    LOP_COMMIT_XACT    LCX_NULL    0000:000002e9
0000001f:00000080:0001    LOP_MODIFY_ROW    LCX_BOOT_PAGE    0000:00000000
0000001f:00000081:0001    LOP_MODIFY_ROW    LCX_BOOT_PAGE    0000:00000000
0000001f:00000082:0001    LOP_MODIFY_ROW    LCX_BOOT_PAGE    0000:00000000
0000001f:00000083:0001    LOP_BEGIN_XACT    LCX_NULL    0000:000002ec
0000001f:00000083:0002    LOP_HOBT_DELTA    LCX_NULL    0000:000002ec

So this proves that we did lose the changes done in INST1 because of the forced failover to W2K8R2N3. However, with the database snapshot, we can attempt to reconcile those ‘lost’ changes with the base table.

Wrapping up

We covered quite of lot of deep stuff here. Here is a quick summary:

Automatic failover is not possible when the primary replica is a FCI
Normally because of this fact, you will assign 0 votes to the secondary replica nodes
If you lose the primary site, you must bring up cluster service with force quorum on the secondary replica
Next, you will force failover the AG to the secondary replica. Data loss might occur in this scenario.
If you do get back the primary site in the mean time, use the Prevent Quorum switch on the erstwhile primary nodes to prevent ‘split-brain’ condition
If you take the right steps and the earlier primary will be in Not Synchronizing state
The earlier primary replica will still contain the changes which were made to it but not transmitted to the secondary site; you can use a database snapshot on the earlier primary replica to identify these changes, typically using an utility like TableDiff
On resuming the earlier primary replica, it will don the role of a secondary. Again, in this case, data loss is expected if changes were made on the new primary which ‘override’ the earlier primary.

Wow! That was a long post but I hope it brings to light a lot of the detail around this very specific scenario of forced failover. I hope you enjoyed reading this. Please leave your comments, questions and feedback below. I appreciate those very much!

Arvind Shyamsundar's technical blog

Arvind Shyamsundar is a Principal PM @ MSFT Azure Data, working on Azure SQL. Data geek. Apache Accumulo and Fluo PMC. SQL MCM, ex-Principal PFE (MSFT Services). These are my own opinions and not those of Microsoft.

AlwaysOn Availability Group Forced Failover: Under the Hood

Lab setup

Design decision #1: Asynchronous versus Synchronous secondary

Design decision #2: Why not automatic failover for the secondary replica?