How It Works: HealthCheckTimeout Interval Activities

As I wrote my recent blog posts and did more research I found that is would be helpful to highlight the HealthCheckTimeout behavior in more detail.

Always On FCI (Failover Cluster Instance) vs Non-FCI Installations Documentation

The first thing that I need to point out is the subtle wording difference in Books Online and other forms of documentation that is easy to over look.

When the documentation references Always On FCI this is a clustered instance of SQL Server and not an AG on a standalone instance of SQL Server.

This is important because things like the HealthCheckTimeout defaults are documented differently. In the case of a FCI instance the default is 60 seconds but for a non-FCI AG the default is 30 seconds. This is outlined in SQL Server Books Online but until I reminded myself to carefully pay attention to the FCI reference it is easy to overlook.

Only 1 sp_server_diagnostics Execution Per Instance

The resource dll (hadrres.dll) hosts the SQL Server failover detection logic for the Availability Group (AG) resource. The logic is designed to only execute a single instance sp_server_diagnostics no matter how many AGs are in use by the SQL Server Instance. This is where the 'How It Works' comes into play.

imageAs the AGs are brought online (or the HealthCheckTimeout is adjusted for the AG) the resource dll calculates the smallest, heath check timeout value.

Smallest Timeout = max(5, min(All Active AGs for Same SQL Server Instance)/3 )

The logic looks at ALL the AG HealthCheckTimeout values for the same SQL Server Instance. It takes the smallest of these values and divides the value by 3, making sure the interval is no less than 5 seconds.

Using the calculated interval the hadrres, health worker establishes a persistent connection to SQL Server and invokes sp_server_diagnostics <<interval>>.  As the results sets are returned the health worker processes the results and broadcasts the updates to the active AG resource health monitors.

This allows a single result set stream to work at the smallest, HealthCheckTimeout interval and each AG (FIsHealthy) can honor the HeathCheckTimeout established for that AG.

2 Instances of sp_server_diagnostics for Same SQL Server Instance

I just documented that a single copy of sp_server_diagnostics is used for all AGs on the same SQL Server Instance. Then why would I add this section?

When the HealthCheckTimeout is changed (new AG brought online or HealthCheckTimeout property updated) the resource dll's logic will establish a second connection and execute sp_server_diagnostics when a smaller timeout needs to be established. As soon as the new connection is properly receiving results the old connection is closed.  It will be a small window to handle the interval change.

FCI Takes Precedence

Please keep in mind that if the instance of SQL Server is clustered (FCI) the FCI behavior takes precedence over the AG behaviors. The AGs will not automatically failover when associated with a FCI.

Bob Dorr - Principal SQL Server Escalation Engineer