SQL Server Availability Groups – Enhanced Database Level Failover


Database level health detection failover (DB_Failover) option for Availability Groups was introduced in SQL Server 2016 with the objective to provide a mechanism for availability groups to failover, if one or more databases in the availability had any issues. This feature helps guarantee the high availability for your databases and is a recommended best practice for all availability Groups with mission critical databases. This Microsoft document describes the database level health detection failover option in detail. In its initial implementation the database level health detection option was designed to check the following conditions on the primary replica of the availability group.

  1. DB Status is online,
  2. If the transaction log file for a database was available for writing the transactions

If either of two conditions list above, is not true, the availability group hosting the databases would failover to one of the available synchronous (synchronized) secondary nodes.

In the past multiple customers and users in the SQL Server community have requested for additional checks (like errors arising because of hardware issues) which could potentially leave the database non-operational, to be included as part of database level health check detection. A new implementation of the database level health check detection option is available in the latest servicing release for SQL Server.

Customers can revert to the original (SQL Server 2016) implementation of database level health detection using TF 9576 as either a startup parameter or enabled using DBCC TRACEON command. This new implementation is currently only available for SQL Server running on Windows and will be ported to SQL Server 2017 on Linux in an upcoming cumulative update.

In addition to the existing checks, the new implementation has the following additional checks.

  1. The new implementation stores and uses a historical snapshot of the database state information to decide if the AG needs to be marked in error state or not. The health check routine caches the database state and associated error information, for the last three executions, which is then compared with the state information from the current execution of the health detection routine. If the same error condition (for the below mentioned error codes) exists in the four consecutive runs of the health detection routine, a failover is initiated. So for example if during the first run, let's say at 22:00:00 there is an error 823, and the same error conditions exists at the subsequent runs at 22:00:10, 22:00:20 and 22:00:30 then the AG is marked in an error state and the control is passed to the cluster. This implementation is intended to provide safeguards against transient errors and issues which can be fixed by the auto page repair capabilities of the availability groups.
  2. The new implementation checks for following additional errors. Majority of these errors are indicative of a hardware issues on the server. Please note, that this is not an exhaustive list of errors which could impact the database availability. There is an outstanding item to include error 824 to this list.

Error

Cause 

Documentation

605

Page or allocation corruption.  https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-605-database-engine-error?view=sql-server-2017

823

Checkpoint failures.  https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-823-database-engine-error?view=sql-server-2017

829

Disk corruption. 

832

Hardware or memory corruption. 

1101

No disk space available in a filegroup.  https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-1101-database-engine-error?view=sql-server-2017

1105

No disk space available in a filegroup.  https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-1105-database-engine-error?view=sql-server-2017

5102

Missing filegroup ID requests. 

5180

Wrong file ID requests. 

5515

  https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-5515-database-engine-error?view=sql-server-2017

5534

Log corruption due to FILESTREAM operation log record. 

5535

FILESTREAM data container corruption. 

9004

 Log Corruption https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-9004-database-engine-error?view=sql-server-2017

 

This enhancement is aimed at improving the high availability of user databases in an availability group, thereby guaranteeing a higher uptime for the applications.

 

 

Sourabh Agarwal (@SQLSourabh)
Senior PM, SQL Server Tiger Team
Follow us on Twitter: @mssqltiger | Team Blog: Aka.ms/sqlserverteam

Comments (4)

    1. Hi Pavel,
      This is not a bug, but an expected behavior. Setting a max size limit or not configuring auto growth for your mission critical databases is not a recommended best practice. If you do not want DB_Level_Failover to check for error 1101, then I would recommend turning this off by using TF 9576. This would revert the behavior to a SQL 2016 behavior, where we only check if the DB is online or if the log file is unavailable.

      1. Pavel Málek says:

        Hi Sourabh,
        Thank you for your reply. I agree that max size limit or no auto growth is not best practice but is common for many deployments. One example for all is the default for Microsoft System Center Operations Manager database.

        See:
        “By default, the OperationsManager database is not configured with autogrow enabled”
        on https://docs.microsoft.com/en-us/system-center/scom/plan-sqlserver-design

        I am not saying it’s a good thing. I am only saying that performing a failover of a whole AOAG when it CAN’T help in any way is a bug and it only makes the life of customers more complicated.

        1. Thanks for the feedback. While I cannot comment on SCOM and their reasons for disabling auto-growth. I can certainly talk about best practices for SQL Server. As mentioned earlier, if you have a specific scenario where the new failover mechanism is not helpful, I would recommend reverting to the older behavior with TF 9576.

Skip to main content