SQL 2012 AlwaysOn Availability groups Automatic Failover doesn’t occur or does it – A look at the logs

 

I was dealing with an issue where we had an AlwaysOn availability group with 2 replicas configured for Automatic failover. There was a 30 second glitch where folks could not connect to the Database on the primary and automatic failover did not kick in. Well, that was our initial impression at least. The purpose of this post is to expose you to the different logs available in troubleshooting AlwaysOn Availability group issues, not so much on this particular problem itself.

Symptoms on Primary:  Connections failed for a brief moment with the error below and then was all good.

Error: 983, Severity: 14, State: 1.

Unable to access database 'HADB' because its replica role is RESOLVING which does not allow connections. Try the operation again later.

So there were 3 questions to answer:

a. What was the reason for the error?

b. Why didn’t automatic failover kick in? Or did it?

c. Was it supposed to fail over to the other node?

First of all we need to understand the FailureConditionLevel which controls when failover occurs both from an SQL FCI (failover cluster) or AlwaysOn Availability group Automatic failover perspective.  For detailed information regarding Failover Policies for Failover Cluster Instances, refer to this article:  https://msdn.microsoft.com/en-us/library/ff878664.aspx

 

clip_image002

 

In my case the FailoverConditionLevel is set to 5 (Default is 3).  This setting can be altered with the following TSQL script:

clip_image004

If I look at the article referenced above, I notice that the FailoverConditionLevel has the following attributes:

5

Indicates that a server restart or failover will be triggered if any of the following conditions are raised:

  • SQL Server service is down.
  • SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings).
  • System stored procedure sp_server_diagnostics returns ‘system error’.
  • System stored procedure sp_server_diagnostics returns ‘resource error’.
  • System stored procedure sp_server_diagnostics returns ‘query_processing error’.

 

One thing to note here is that the Cluster action is only if any of the subsystems report an “error”, no action is taken on a warning.

So effectively what happens is the following:

  •          Cluster service runs LooksAlive check
  •          Sp_server_diagnostics results sent to Resource Monitor DLL
  •          Resource Monitor DLL detects any error state and notifies the cluster service
  •          Cluster Service takes the resource offline
  •          Notifies SQL Server to issue an internal command to take the availability group offline.
  • There is also the whole concept of a lease that is explained here:

https://blogs.msdn.com/b/psssql/archive/2012/09/07/how-it-works-sql-server-alwayson-lease-timeout.aspx

 

In order to understand this better I attempted and was able to reproduce the scenario.

I then looked at the Cluster Diagnostic extended event Log, the AlwaysOn extended event log, the cluster log, and the SQL Server error log to try to piece together what exactly happened.

Cluster Diagnostic Extended Event Log:

We see from this log that the System component did throw an error. Which equates to there were N number of dumps created and or a spinlock orhpaned after an Access violation or a detected Memory scribbler condition

The Cluster Diagnostics Logs are located in the Log directory as shown below and are different log files than the cluster log itself.

They are of the format : ServerName_InstanceName_SQLDIAG_*.xel

clip_image005

 

As we can see below, you see the component_health_result indicate that the system component of sp_server_diagnostics returned an error, when the resource monitor than interpretted as a Failure due to the FailureConditionLevel set, and propagated the resource “NOTHEALTH” to the cluster service which triggered the LooksAlive check to return “not alive” or false status.

clip_image006

AlwaysOn Extended Event log

The AlwaysOn Health Extended Event logs cover the Availability Group related diagnostics such as State changes for the Group or Replica or Databases, errors reported, lease expiration and any Availability Group Related DDL that is executed. The format of the logs is: AlwaysOn_health*.xel

 

clip_image008

 

If we look at the log snippet below, we see that the AG lease expired, and that triggered us to attempt a failover which in turn changes the state from PRIMARY_NORMAL to RESOLVING_NORMAL.

 

clip_image010

Cluster Log

Note: The times are in UTC so you have to convert them to match up with the other log files.

To generate the log: How to Generate a Cluster Log on Windows 2008 onwards

00006918.00015978::2013/04/03-18:54:37.251 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'system' health state has been changed from 'warning' to 'error' at 2013-04-03 11:54:37.247

00006918.00014ef4::2013/04/03-18:54:37.970 ERR [RES] SQL Server Availability Group: [hadrag] Failure detected, the state of system component is error

00006918.00014ef4::2013/04/03-18:54:37.970 ERR [RES] SQL Server Availability Group< 2012AG>: [hadrag] Availability Group is not healthy with given HealthCheckTimeout and FailureConditionLevel

00006918.00014ef4::2013/04/03-18:54:37.970 ERR [RES] SQL Server Availability Group< 2012AG>: [hadrag] Resource Alive result 0.

00006918.00014ef4::2013/04/03-18:54:37.970 ERR [RES] SQL Server Availability Group< 2012AG>: [hadrag] Resource Alive result 0.

00006918.00014ef4::2013/04/03-18:54:37.970 WARN [RHS] Resource 2012AG IsAlive has indicated failure.

00019d20.00000e5c::2013/04/03-18:54:37.970 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for '2012AG', gen(0) result 1.

00019d20.00000e5c::2013/04/03-18:54:37.970 INFO [RCM] TransitionToState(2012AG) Online-->ProcessingFailure.

00019d20.00000e5c::2013/04/03-18:54:37.970 INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (2012AG, Online --> Failed)

00019d20.00000e5c::2013/04/03-18:54:37.970 INFO [RCM] resource 2012AG: failure count: 1, restartAction: 2.

00019d20.00000e5c::2013/04/03-18:54:37.970 INFO [RCM] Will restart resource in 500 milliseconds.

 

- If you see the “restart action” highlighted above, a restart is attempted on the current node first before failing over to the other node and in this case the restart is successful so it doesn’t really fail over to the other node. If we take a look at the cluster Availability group Resource Properties, you can confirm that the Restart action does indicate that a restart will be attempted on the current node first

- clip_image012

 

00019d20.00019418::2013/04/03-18:55:06.079 INFO [RCM] TransitionToState(2012AG) DelayRestartingResource-->OnlineCallIssued.

00019d20.00019418::2013/04/03-18:55:06.079 INFO [RCM] HandleMonitorReply: ONLINERESOURCE for '2012AG', gen(1) result 997.

00019d20.00019418::2013/04/03-18:55:06.079 INFO [RCM] TransitionToState(2012AG) OnlineCallIssued-->OnlinePending.

00006918.0001f1c0::2013/04/03-18:55:07.298 INFO [RHS] Resource 2012AG has come online. RHS is about to report status change to RCM

 

SQL Server Errorlog

More details on Lease Expiration: https://blogs.msdn.com/b/psssql/archive/2012/09/07/how-it-works-sql-server-alwayson-lease-timeout.aspx 

2013-04-03 11:54:43.59 Server Error: 19407, Severity: 16, State: 1.

2013-04-03 11:54:43.59 Server The lease between availability group '2012AG' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

2013-04-03 11:54:43.64 Server AlwaysOn: The local replica of availability group '2012AG' is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.

2013-04-03 11:54:43.64 Server The state of the local availability replica in availability group '2012AG' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-04-03 11:54:43.84 spid31s The availability group database "HADB" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.

2013-04-03 11:54:43.84 spid27s AlwaysOn Availability Groups connection with secondary database terminated for primary database 'HADB' on the availability replica with Replica ID: {89c5680c-371b-45b9-ae19-2042d8eec27b}. This is an informational message only. No user action is required.

n The error below can occur if the Local Log records are hardened but quorum is lost with the cluster so the remote harden cannot be completed.

2013-04-03 11:54:45.16 spid58 Remote harden of transaction 'user_transaction' (ID 0x00000000001ee9e2 0000:000006eb) started at Apr 3 2013 11:54AM in database 'HADB' at LSN (37:28:204) failed.

2013-04-03 11:54:46.42 spid31s Nonqualified transactions are being rolled back in database HADB for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

n This phase is after the “restart action” as seen in the cluster log where we are attempting a restart on the same node before failing over to the other node.

2013-04-03 11:55:06.25 spid58 AlwaysOn: The local replica of availability group '2012AG' is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

2013-04-03 11:55:07.27 spid58 The state of the local availability replica in availability group '2012AG' has changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-04-03 11:55:07.55 Server The state of the local availability replica in availability group '2012AG' has changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

 

So in answering the 3 prior questions I had with the logs

a. The reason we got into this state was the system component reported an error ( was a bunch of exceptions), we won’t go into those here

b. Failover was attempted, but initial attempt is to restart on the same node and it did end up coming online on that node.

c. No, it should not have failed over to the other node

Hope the exposure to these logs is helpful in troubleshooting AlwaysON Availability group issues

 

-Denzil Ribeiro – Sr. SQL Premier Field Engineer