How It Works: DROP AVAILABILITY GROUP Behaviors

I just learned something new about the DROP AVAILABILITY GROUP command behavior that I didn't realize. The comment on TechNet, alludes to how DROP works but we definitely need to update our documentation.

Remove an Availability Group (SQL Server) - https://technet.microsoft.com/en-us/library/ff878113(v=sql.110).aspx

"When the availability group is online, deleting it from a secondary-replica causes the primary replica to transition to the RESTORING state."

The statement is true but it leaves too much to the imagination - so here is How It Works.

Let's start with a fully functioning availability group (AG), for simplicity, with a 2 node cluster example (assume quorum using 3rd disk resource).  

image The Windows cluster is communicating between Node 1 and Node 2 and keeping the cluster registry and cluster activities in sync. The RHS.exe process is running, hosting the SQL Server cluster resource DLL, on the primary. The SQL Server resource DLL is maintaining the proper connection to the primary instance and servicing the IsAlive and LooksAlive requests. The Secondary is connected to the primary and receiving data. Note: The primary does not connect to the secondary, only the secondary connects to the primary in Always On. This will be important fact for step #3 later.

Step #1 - The first step in dropping an AG is to take the AG offline.
Step #2 - Remove the AG from the registered cluster resources.

image Offline is coordinated using the Windows clustering APIs to trigger the necessary behaviors. The change in resource state signals to the SQL Server instances. No further changes should be allowed so the databases on all nodes are taken to the restoring state. Important Fact: Offline can occur for more than just DROP AVAILABILITY GROUP. For example, if the cluster losses network communication the resource can be taken offline. The offline state protects the databases from issues such as split brain and changes in general. Once the AG is taken offline and the AG resource is removed from the cluster you are left with the databases in restoring state on all the impacted nodes. WARNING: Running restore with recovery only, on the specified instance, allows you to access the database and make changes. Use this with caution because if you want to re-join the AG only one of the copies can be recovered and modified. Recovering more than one copy of the database would require you to merge changes (you create your own split brain situation.)

Step #3 - Only Occurs when DROP AVAILABILITY GROUP is executed on the OLD/Original Primary

image Step #3 recovers the database(s) that were in the AG but it is only executed on the OLD/Original PRIMARY. It is only safe to execute the recovery on the old primary and only if the command was executed on the OLD PRIMARY. Offline can occur for many reasons (DROP AG, Loss of quorum, …) making the original, primary SQL Server instance, performing the T-SQL DROP AVAILABILITY GROUP command, the only safe/known owner of the AG. I pointed out a subtle (Note) during my description of step #1 and #2 about the connections to the primary from the secondary. When the DROP AVAILABILITY GROUP command is executed on a secondary the communication with the primary SQL Server instance is lost during the offline phase. (DROP AVAILABILITY GROUP is not recommended on a secondary node and should only be used for emergency purposes.) The secondary no longer has a safe communication channel with the primary resulting in no clear way to tell the primary it is okay to recovery and allow changes and it can't blindly assume who it thinks is primary is really the current primary. The primary is signaled to go offline by the cluster resource changes and the primary can't be certain the offline, indicated from the cluster resource change, was because of a quorum loss, forced failover or drop availability group is taking place. Following the safety protocol offline only takes the database to the restoring state.

The confusion comes with the behavior of step #3.  

  • If the DROP occurs on the original primary the database is restored and changes can be made to the database. This is safe because all secondary's maintain the restoring state. You could recreate the AG and the changes made to the primary while the AG was DROPPED will get propagated to the secondary's.
  • If the DROP occurs on the secondary the primary is left in restoring state, requiring administrator intervention to restore with recovery only to allow changes to the database preventing introduction of something like a split-brain. 

Once I went back and looked at the technical flow of this I understand the behavior but it is worth noting that the behavior triggers different outcomes.

  • If I drop the AG from the primary changes are allowed in the database without active High Availability (HA) capabilities (though they could be restored by recreating the AG.)
  • If I drop the AG from a secondary no changes are allowed so it may impact production applications, preventing split-brain from occurring. 

Bob Dorr - Principal SQL Server Escalation Engineer