Issue: Replica Unexpectedly Dropped in Availability Group

You have deployed AlwaysOn availability groups and observe that one of the replicas unexpectedly disappears from your availability group. In addition to replica loss, there may be issues with the availability group resource and availability.

Why did my replica disappear?

As a clustered resource, the availability group state information is stored and maintained in the Windows Cluster store. SQL Server must be able to communicate and access the availability replica's state information using Windows cluster protocol. SQL Server will drop a replica if it tries to read the replica information from the cluster store (registry) and one of the following conditions occurs:

  • SQL queried the cluster store successfully and the replica did not exist - then SQL Server will drop the replica

  • SQL queried the cluster store successfully and the replica was found, but the data was corrupt

  • SQL queried the cluster store unsuccessfully

If one of these conditions occurs, the replica may be removed. This behavior is by design, but should be very uncommon and may signify a problem with Windows Cluster responsiveness and warrants further investigation.

How do I know SQL Server dropped the replica? 

If SQL Server dropped the replica for one of the reasons described here, SQL Server native error 41096 is raised and reported in the SQL Server error log:

2014-01-21 11:53:16.53 spid30s AlwaysOn: The local replica of availability group 'groupname' is being removed. The instance of SQL Server failed to validate the integrity of the availability group configuration in the Windows Server Failover Clustering (WSFC) store. This is expected if the availability group has been removed from another instance of SQL Server. This is an informational message only. No user action is required.
2014-01-21 11:53:16.53 spid30s The state of the local availability replica in availability group 'groupname' has changed from 'SECONDARY_NORMAL' to 'NOT_AVAILABLE'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

How to Respond 

If the replica removed is a secondary, you can add the replica back into the group without re-initializing the databases so long as the transaction log has not been truncated (due to log backup):

1. Add the replica back to the availability group. Connect to SQL Server hosting the primary and execute ALTER AVAILABILITY GROUP ADD REPLICA command:

use master
go
alter availability group ag
add replica on 'sqlnode2'
with(endpoint_url = 'tcp://sqlnode2:5022',
availability_mode=asynchronous_commit, failover_mode=manual)

 2. Join the availability group. Connect to the SQL Server hosting the secondary and execute the ALTER AVAILABILITY GROUP JOIN command:

 alter availability group ag join

 3. Join each database back to the replica. Connect to the SQL Server hosting the secondary and execute the ALTER DATABASE SET HADR command:

 alter database agdb set hadr availability group = ag

If the command 'alter database...set hadr...' fails with error 1478 it is because a log backup has occurred and the truncation point has advanced, creating a break in the log chain:

Msg 1478, Level 16, State 211, Line 2
The mirror database, "a", has insufficient transaction log data to preserve the log backup chain of the principal database. This may happen if a log backup from the principal database has not been taken or has not been restored on the mirror database. it is because the database log was backed up and the log chain has been broken. To avoid full re-initialization of the database, restore the backed up log to the database at the secondary.

It is still possible to restore the backed up log(s) files to the secondary database WITH NORECOVERY and then successfully join the database back into the group:

If the replica is removed at the primary:

  • If the databases were in the primary role at the time the replica was removed, they are in a recovered state and accessible.

  • If the databases had transitioned to a resolving role at the time the replica was removed, then the databases need to be restored with recovery in order to put them back into production:

 restore database agdb with recovery

In this case, the availability group must be dropped and recreated.