AlwaysON - HADRON Learning Series: Automated Failover Behaviors (Denali - Logging History Information, FCI and Default Health Capture, sp_server_diagnostics)

Quite an encompassing title I will agree but the features all work together to build a really nice product.  I had this tucked away during beta and thought it would be helpful to post.

SQL Server AlwaysON and "Denali" has undergone a significant upgrade to the way the cluster resource (for an Availability Group or a Failover Cluster Instance) detects failover conditions. Instead of the previous behavior of Select @@SERVERNAME, some timeouts and retries, the SQL Server resource DLL executes an internal procedure (sp_server_diagnostics) that returns information on intervals to the resource DLL. The information involves memory, I/O, query processor, scheduling and other behavior and situational information. Based on this information and the failover setting (new private property for the SQL Server (Failover Cluster Instance - FCI) or Availability Group (AG) resource the failover decisions are made.

The easiest way to think of this is the Internet Explorer security setting slider. As you increase the security protection more factors are considered. The following figure shows an example from the build I have installed. For each AG or FCI instance the resource properties contain the "FailoverConditionLevel". The level of detection encompasses the lesser level detections as well. For a very sensitive AG or instance you set the detection to max (5) or you reduce sensitive to environmental changes by setting the level downward, perhaps (3).

image

sp_server_diagnostics (https://blogs.msdn.com/b/psssql/archive/2012/03/08/sql-server-2012-true-black-box-recorder.aspx)
The resource DLL maintains a persistent connection to the fully preemptive worker stream returned from sp_server_diagnostics. The figure below points out some of the sp_server_diagnostics behavior. It is a special note that the information returned from the procedure as well as the failover diagnostics are logged in a series of XEL rollover files. ( \LOG\*Diag*.xel)

Note: This procedure was not designed for use outside the SQL Server product line.

image

It is also of note that the sp_server_diagnostics information is saved in the default health session for ALL SKUS except SQL Server Express (MSSQL\Log directory) . YIPPEE: The DBA or CSS can walk up to any Denali server and look at the history of memory, I/O. QP. blocking, etc…for the past ## or hours. Something we have never had built into any previous version of SQL Server.  Not only can you now answer the question in a clustered (FCI) instance or HADRON/AlwaysOn AG move but on a stand-alone server.

Failover and Availability Modes
SQL Server Always on is designed to allow 3 specific failover models.

  • Automatic Failover - Sync Replica with Automatic Failover
  • High Performance - Async Replica with Manual Failover
  • High Safety - Sync Replica with Manual Failover

Each AG can be configured for different failover condition level tolerances.  The location of the principle and configuration of the replica combine to establish the type of failover for the AG. Take the following configuration for example.

· A: Synchronous commit with Auto failover

· B: Synchronous commit with Auto failover

· C: Synchronous commit with Manual failover

· D: Asynchronous commit with Manual failover

The behavior for each replica depends on 2 configurations levels and the lowest common setting. I numbered them 1 and 2 specifically because it works in my example formula well to help describe the matrix.

The following tables outline the failover and availability modes that can be used in the example configuration outlined above.

Failover Mode: Manual(1) or Automatic(2)

Scenario

A

B

C

D

1

Primary

Sync

Sync

Async

2

Sync

Primary

Sync

Async

3

Sync

Sync

Primary

Async

4

Async

Async

Async

Primary

 

Availability Mode: Async(1) or Sync (2)

Scenario

A

B

C

D

1

Primary

Automatic

Manual

Manual

2

Automatic

Primary

Manual

Manual

3

Manual

Manual

Primary

Manual

4

Manual

Manual

Manual

Primary

I use the following formula: (Take your current primary and determine how the Primary AG is configured and then do the same on the secondary you are evaluating.)

Failover Mode = min( Primary Failover Mode, Secondary Failover Mode)

Availability Mode = min( Primary Avail Mode, Secondary Avail Mode)

Let's start with Primary = A and Secondary = B

min (A=Automatic(2), B=Automatic(2)) = 2= Lowest Failover Mode (Automatic)

min(A=Sync(2), B=Sync(2)) = 2 = Lowest Availability Mode (Sync)

A and B can be Primary and Secondary with Sync / Automatic failover capabilities.

Now, let's look at A and C:

min(A=Automatic(2), C=Manual(1)) = 1 = Lowest Failover Mode (Manual)

min(A=Sync(2), C=Sync(2)) = 2 = Lowest Availability Mode (Sync)

A and C are Manual failover only with sync capabilities. This means A won't fail to C automatically and C can't fail back to A automatically. This is a Manual failover mode channel.

You can apply the same rules to any of the combinations. Find the lowest setting between the Primary and Secondary target and apply it as the behavior outcome. For example when D becomes the primary the lowest is 1 and 1 or Async with Manual failover across all replica targets.

 

A Failover Clustered Instance of SQL Server can host an availability group but it CAN NOT be set to automatic mode. The FCI failover takes precedence and the lowest common failover mode is (Manual) for any of its replica pairings.

 

Bob Dorr - Principal SQL Server Escalation Engineer