SQL Server Availability Groups – Enhanced Database Level Failover

Article
07/25/2018

Database level health detection failover (DB_Failover) option for Availability Groups was introduced in SQL Server 2016 with the objective to provide a mechanism for availability groups to failover, if one or more databases in the availability had any issues.This feature helps guarantee the high availability for your databases and is a recommended best practice for all availability Groups with mission critical databases. This Microsoft document describes the database level health detection failover option in detail. In its initial implementation the database level health detection option was designed to check the following conditions on the primary replica of the availability group.

DB Status is online,
If the transaction log file for a database was available for writing the transactions

If either of two conditions list above, is not true, the availability group hosting the databases would failover to one of the available synchronous (synchronized) secondary nodes.

In the past multiple customers and users in the SQL Server community have requested for additional checks (like errors arising because of hardware issues) which could potentially leave the database non-operational, to be included as part of database level health check detection. A new implementation of the database level health check detection option is available in the latest servicing release for SQL Server.

SQL2017RTM CU9 (https://support.microsoft.com/en-us/help/4341265/cumulative-update-9-for-sql-server-2017);
SQL2016SP1 CU10 (https://support.microsoft.com/en-us/help/4341569/cumulative-update-10-for-sql-server-2016-sp1);
SQL2016SP2 CU2 (https://support.microsoft.com/en-us/help/4340355/cumulative-update-2-for-sql-server-2016-sp2);

Customers can revert to the original (SQL Server 2016) implementation of database level health detection using TF 9576 as either a startup parameter or enabled using DBCC TRACEON command. This new implementation is currently only available for SQL Server running on Windows and will be ported to SQL Server 2017 on Linux in an upcoming cumulative update.

In addition to the existing checks, the new implementation has the following additional checks.

The new implementation stores and uses a historical snapshot of the database state information to decide if the AG needs to be marked in error state or not. The health check routine caches the database state and associated error information, for the last three executions, which is then compared with the state information from the current execution of the health detection routine. If the same error condition (for the below mentioned error codes) exists in the four consecutive runs of the health detection routine, a failover is initiated. So for example if during the first run, let's say at 22:00:00 there is an error 823, and the same error conditions exists at the subsequent runs at 22:00:10, 22:00:20 and 22:00:30 then the AG is marked in an error state and the control is passed to the cluster. This implementation is intended to provide safeguards against transient errors and issues which can be fixed by the auto page repair capabilities of the availability groups.
The new implementation checks for following additional errors. Majority of these errors are indicative of a hardware issues on the server. Please note, that this is not an exhaustive list of errors which could impact the database availability. There is an outstanding item to include error 824 to this list.

Error	Cause	Documentation
605	Page or allocation corruption.	https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-605-database-engine-error?view=sql-server-2017
823	Checkpoint failures.	https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-823-database-engine-error?view=sql-server-2017
829	Disk corruption.
832	Hardware or memory corruption.
1101	No disk space available in a filegroup.	https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-1101-database-engine-error?view=sql-server-2017
1105	No disk space available in a filegroup.	https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-1105-database-engine-error?view=sql-server-2017
5102	Missing filegroup ID requests.
5180	Wrong file ID requests.
5515		https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-5515-database-engine-error?view=sql-server-2017
5534	Log corruption due to FILESTREAM operation log record.
5535	FILESTREAM data container corruption.
9004	Log Corruption	https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-9004-database-engine-error?view=sql-server-2017

This enhancement is aimed at improving the high availability of user databases in an availability group, thereby guaranteeing a higher uptime for the applications.

Sourabh Agarwal (@SQLSourabh)
Senior PM, SQL Server Tiger Team
Follow us on Twitter: @mssqltiger | Team Blog: Aka.ms/sqlserverteam

SQL Server Availability Groups – Enhanced Database Level Failover

Additional resources