In this blog we would discuss about behavior of AlwaysOn availability group where the secondary database doesn't come automatically when the primary instance goes down. The secondary database goes into Resolving state. On the failover cluster manager the resource appears in fail state.
While doing some research, we found that if we stop SQL Service manually on primary replica, it fails over to second node only once. Any further attempts of stopping SQL Service (to test auto failover) would not cause failover. We looked into cluster logs and found below.
From the Cluster Log:
00001b50.00001e54::2012/01/31-03:22:05.170 INFO [RCM] HandleMonitorReply: TERMINATERESOURCE for 'KEXPServices_Staging', gen(3) result 0.
00001b50.00001e54::2012/01/31-03:22:05.170 INFO [RCM] TransitionToState(KEXPServices_Staging) [Terminating to Failed]—>Failed.
00001b50.00001e54::2012/01/31-03:22:05.170 INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (KEXPServices_Staging, Pending --> Failed)
00001b50.00000cb4::2012/01/31-03:22:05.185 WARN [RCM] Not failing over group KEXPServices_Staging, failoverCount 2, failover threshold 4294967295, nodeAvailCount 1.
00001b50.00000cb4::2012/01/31-03:22:05.185 INFO [RCM] Will retry online of KEXPServices_Staging in 3600000 milliseconds.
We looked further into resource in failover cluster manager and found the cause as below.
By default the failover properties for the Resource group is set as follows.
- Maximum Failures in the specified period was set to 1
- Period (Hours) was set to 6
This means that when a failover happens and the resource fails to come online, the cluster service would try to get the resource online only once. It would try only once in 6 hours. Since the cluster service would check the log on how many errors have occurred in the past, it verifies the count set for “Maximum Failures in the specified period” and hence would fail if the number of failures equals or exceeds.
- A manual failover will work here.
- Hence in order for us to get this work automatically, is to have the value set to a higher number for “Maximum Failures in the specified period”.
- Have these values set as follows. This means that within an hour the cluster service will try to get the group online 60 times. In an ideal production environment we don’t expect so many failovers to happen.
1. Maximum Failures in the specified period: set to 60
2. Period (Hours): set to 1
Here is the screenshot showing the properties
DISCLAIMER: The number shown in the blog are for testing purpose only. Please perform testing in your environment before implementing the values.
Arvindh Kalidasan - Support Engineer,Microsoft GTSC
Balmukund Lakhani- Technical Lead, Microsoft GTSC