Enhance AlwaysOn Failover Policy to Test SQL Server Database Data and Log Drives

In SQL Server 2012 and 2014, AlwaysOn health diagnostics detect the health of the SQL Server process in several ways. However, no health detection is performed on the accessibility or viability of the databases defined in AlwaysOn availability groups. If the disk hosting the availability group database or log files is lost, AlwaysOn health does not detect this event, and application runtime errors accessing the database will ensue. Loss of the drive or errors accessing the drive that host your availability database data and log files will affect access to your production data.

As per the Flexible Failover Policy for Automatic Failover of an Availability Group (SQL Server)

Damaged databases and suspect databases are not detected by any failure-condition level. Therefore, a database that is damaged or suspect (whether due to a hardware failure, data corruption, or other issue) never triggers an automatic failover.

NOTE SQL Server 2016 enhances the AlwaysOn health diagnostics with database health detection. If AlwaysOn availability group database health detection has been selected for your availability group and an availability database transitions out of the ONLINE state (sys.databases.state_desc), the entire availability group will failover automatically if configured to do so. For more information see MSDN topic 'CREATE AVAILABILITY GROUP' and section on ‘DB_FAILOVER’

CREATE AVAILABILITY GROUP (Transact-SQL)

You can enhance the availability of your databases by detecting disk health. Add a generic script resource to your availability group resource group that does basic read or write tests against the drives hosting your availability group database data and log files. The following describes how to add a generic script resource as a dependency to your availability group resource to enhance AlwaysOn health detection with a basic disk health check.

Use a generic script resource to do basic health check on SQL Server data and log drives

Here is the high level description of the implementation of the generic script resource to detect availability group database drive health.

Add a generic script clustered resource to the availability group resource group. Make the availability group resource dependent on the generic resource script. That way, if the script resource reports IsAlive failure, Windows Cluster will attempt to restart or failover the availability group resource if configured to do so.

The generic script clustered resource IsAlive test creates a text file in the specified Data drive location and a text file in the specified Log drive location. If the file exists, the script will overwrite it.

Attached to this blog is a zipped file, GenericScript_SQLIsAlive.zip, containing

sqlisalive.vbs <- The generic script written in Visual Basic Script and implements Windows Cluster IsAlive

Add_SQLIsAliveScript.ps1 <- The PowerShell script which adds the generic script resource to your availability group resource group and sets the availability group dependency on your generic script resource

readme.txt <- Step by step instructions for implementing the generic script resource and additional instructions on how to test the script.

Implement the generic script resource

I Configure the generic script, sqlisalive.vbs

Data and Log Drive Paths Currently, the generic script is configured to test the following drive and paths: c:\temp\data and c:\temp\log. For testing purposes, create these paths on the local drive of each replica (primary and automatic failover partner secondary). Later, you can change them to the appropriate drive and paths where your database data and log files live.

    DataDriveFile="c:\temp\data\ScriptFileData.txt"
    LogDriveFile="c:\temp\log\ScriptFileLog.txt"

II Configure and execute the Powershell script to deploy the generic script to your availability group

NOTE This generic script only implements IsAlive which runs every 60 seconds.

1 Ensure your availability group has two replicas configured for automatic failover.

2 Copy the generic script file to an identical local storage location like 'C:\temp\sqlisalive.vbs' on both servers whose replicas are configured for automatic failover.

3 Create the paths for the health check, c:\temp\data and c:\temp\log.

4 The Add_SQLIsAliveScript.ps1 PowerShell script adds the generic script resource to your availability group and adds a dependency to your availability group resource, on the generic script resource. In the Add_SQLIsAliveScript.ps1, change the following variables.

     Set $ag to your availability group name
     Set $listener to your availability group listener name. If your availabiltiy group does not have a listener, set $listener to ""
     Set $scriptfilepath to the path and file name of your sqlisalive.vbs script

5 On the server hosting the primary replica, run the PowerShell script Add_SQLIsAliveScript.ps1 to add the generic script resource to your availability group resource group.

6 Launch Failover Cluster Manager and review the availability group resource group to confirm addition of the generic script resource to the availability group resource group. The generic script should appear and come online in the availability group resource group under the Resources tab.

7 Confirm that the dependency has been created in the availability group resource, on the generic script resource.

NOTES

The attached readme.txt file has instructions on how to test the script resource to ensure that it can failover your availability group resource.

Diagnose failure detection by generic script resource

Generate the cluster log for the node hosting the primary replica and search for 'Data Drive Create File' or 'Log Drive Create File' - to locate success or failure report of the generic script resource IsAlive:

00001b04.00002924::2015/12/07-17:16:41.798 INFO  [RES] Generic Script <sqlisalive>: Entering IsAlive
00001b04.00002924::2015/12/07-17:16:41.801 INFO  [RES] Generic Script <sqlisalive>: Data Drive Create File Succeeded
00001b04.00002924::2015/12/07-17:16:41.801 INFO  [RES] Generic Script <sqlisalive>: Log Drive Create File Succeeded

Or for example, you set the \Data folder to read only:

00001b04.00002924::2015/12/07-17:17:41.801 INFO  [RES] Generic Script <sqlisalive>: Entering IsAlive
00001b04.00002924::2015/12/07-17:17:41.804 INFO  [RES] Generic Script <sqlisalive>: Data Drive Create File Succeeded
00001b04.00002924::2015/12/07-17:17:41.804 INFO  [RES] Generic Script <sqlisalive>: Data Drive Create File Failed
00001b04.00002924::2015/12/07-17:17:41.804 INFO  [RES] Generic Script <sqlisalive>: Permission denied
00001b04.00002924::2015/12/07-17:17:41.804 ERR   [RES] Generic Script <sqlisalive>: 'IsAlive' script entry point returned FALSE.'
00001b04.00002924::2015/12/07-17:17:41.804 INFO  [RES] Generic Script <sqlisalive>: Return value of 'IsAlive' script entry point caused HRESULT to be set to 0x00000001.
00001b04.00002298::2015/12/07-17:17:41.804 WARN  [RHS] Resource sqlisalive IsAlive has indicated failure.

GenericScript_SQLIsAlive.zip