Failover Detection Utility – Availability Group Failover Analysis Made Easy


To analyze the root cause for the failover of an Availability group, users are required to perform coordinated analysis of various logs including the Cluster Logs, SQL Error Logs, and the Availability groups extended events logs. This coordinated analysis of the logs can be difficult and requires extensive knowledge of the internals and the various logs associated with Availability Groups. The failover detection utility is meant to make this analysis easier and provide a quick root cause analysis for the unexpected failovers and/or failed failovers. The failover detection utility currently only supports analysis of Availability groups on Windows.

Failover Detection Utility

The failover detection utility can be downloaded here. The download consists of the following files.

  1. FailoverDetector.exe – The Failover analysis utility.
  2. Configuration.json – The configuration file for the failover analysis utility. The file consists of the following fields.
    {
    "Data Source Path": "F:\\DataLocation\\Data",
    "Health Level": 3,
    "Instances":
    [
    "Replica1",
    "Replica2",
    "Replica3"
    ]
    }
    Data Source Path – Location for the Logs from the different replicas in the Availability Group.
    Health Level – The failover policy level configured for the availability group. More information on the different failover conditions can be found here.
    Instance – The list of replicas in the availability groups.
  3. Supporting DLL’s – Supporting DLLs for the Failover analysis utility.
  4. Data and Reports Folders – The Data and Analysis reports folders.

Preparing for Failover Analysis with the Failover Detection Utility

The first step in using the utility is to configure the configuration.json file to include the location of the data files and the details of the availability group for which analysis is being done. For a correct analysis, all replicas in the availability groups needs to be listed in the configuration file.

The next step is to capture the various logs from each of the replicas and add those under the data folder. The following files are required for the analysis.

  • SQL error logs
  • Always On Availability Groups Extended Event Logs
  • System Health Extended Event Logs
  • System log
  • Windows cluster log

The first three categories of logs can be found in the LOG folder of the SQL instance installation path. For example, if you have a SQL Server 2017 instance named "SQL17RTM01" and If the installation path is on C drive, you will find the above three categories of logs under "C:\Program Files\Microsoft SQL Server\MSSQL14.SQL17RTM01\MSSQL\Log", and copy all AlwaysOn XEvent, System Health XEvent and SQL ERRORLOG files.

The last two categories of logs can be found. You can use commands mentioned below to collect these logs.

Get-ClusterLog -Node $server.ComputerNamePhysicalNetBIOS -Destination $server.ErrorLogPath

Get-Eventlog -ComputerName $server.ComputerNamePhysicalNetBIOS -LogName System | Export-CSV -Path $SystemLogExportPath

Organizing the files

The log files collected in the previous step should be placed in a folder with correct subfolder structure. This folder should be the folder path specified in the "Data Source Path" section of the configuration file "Configuration.json". The following is an example of the subfolder structure, assuming there are two replicas (instances) in the AG. One is "Replica1" and another one is "Replica2".

¦ +---Data

¦ ¦

¦ ¦

¦ +---Replica1

¦ ¦ AlwaysOn_health_0_131532583880140000.xel

¦ ¦ AlwaysOn_health_0_131532634780070000.xel

¦ ¦ AlwaysOn_health_0_131532680183890000.xel

¦ ¦ AlwaysOn_health_0_131532701164380000.xel

¦ ¦ ERRORLOG

¦ ¦ ERRORLOG.1

¦ ¦ ERRORLOG.2

¦ ¦ ERRORLOG.3

¦ ¦ SQLDUMPER_ERRORLOG.log

¦ ¦ system_health_0_131532583879830000.xel

¦ ¦ system_health_0_131532634779760000.xel

¦ ¦ system_health_0_131532680183430000.xel

¦ ¦ system_health_0_131532701164070000.xel

¦ ¦ Replica1_system.log

¦ ¦ Replica1_cluster.log

¦ ¦

¦ +---Replica2

¦ AlwaysOn_health_0_131532571347210000.xel

¦ AlwaysOn_health_0_131532578226200000.xel

¦ AlwaysOn_health_0_131532586348180000.xel

¦ AlwaysOn_health_0_131532725682240000.xel

¦ ERRORLOG

¦ ERRORLOG.1

¦ ERRORLOG.2

¦ ERRORLOG.3

¦ ERRORLOG.4

¦ ERRORLOG.5

¦ ERRORLOG.6

¦ system_health_0_131532571346430000.xel

¦ system_health_0_131532578225950000.xel

¦ system_health_0_131532586347860000.xel

¦ system_health_0_131532725681930000.xel

¦ Replica2_system.csv

¦ Replica2_cluster.log

Executing the Utility

The FailoverDetection.exe can be executed with the following flags (or default mode without any flags)

Default Mode (No execution flags defined) - In the default mode, where no parameters are passed to the utility, the analysis tool will load the configuration file, copy log data from data source path to local workspace data folder, examine the files with the configuration settings, perform the analysis, and write the analysis result to a JSON report file. User should make sure the execution credential running tool has access to share folder defined in Configuration.json. When tool runs into access denial issues or the directory does not exist, the tool will exit with proper error message. After the tool copies the log files to the local tool directory, it scans the local data folder and compare with the AG configuration in Configuration.json. If it finds log data is not complete, the tool will show which files are missing. The tool will also alert the user that the failover root cause might not be properly identified due to missing files.

If analysis is successful, the utility persists the analysis result report as JSON file in Result directory. Analysis result will NOT be presented on the console, unless the "--Show" parameter is specified.

--Analyze - When "--Analyze" is specified as a parameter, the utility will load configuration file without copying log data. It assumes the log files have already been copied over. It does everything as default mode except copying log data. This option is useful if you already have the data in the local tool execution subdirectories and want to rerun the analysis.

--Show -The utility after analyzing log data will display the results in the command console. Additionally, the results will be persisted to a JSON file in the results folder.

Currently supported failover root causes

The tool in current form can analyze and report failover reason if a failover is due to the following causes.

Planned Failovers

  • Stopping SQL service: Admin or other process gracefully shuts down SQL Server hosting the AG primary. In this case, if automatic failover is enabled, Windows notifies WSFC to initiate a failover and picks one secondary as new primary.
  • Shutting down Windows Server: The Windows Server which hosts the SQL Server that has the primary replica is shut down or restarted. In this case, if automatic failover is enabled, Windows notifies WSFC to initiate a failover and picks one secondary as new primary.
  • Manual failover: An administrator performs a manual failover through Management Studio or through a direct Transact-SQL statement "ALTER AVAILABILITY GROUP <AG name> FAILOVER" or "ALTER AVAILABILITY GROUP <AG name> FAILOVER WITH DATA_LOSS" on the secondary replica that will become the primary replica.

SQL Service Internal Health Issue

  • Too many memory dumps generated on the SQL Server hosting the primary replica: There are more than 100 SQL Server dumps on the primary replica since the last SQL Server restart, and there is at least one dump in the last 10 seconds. In this case, sp_server_diagnostics running in SQL Server determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
  • Memory scribbler: SQL Server has a write access violation and the write address is more than 64KB. If there are more than three such memory corruptions since SQL Server was started, sp_server_diagnostics running in SQL Server determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
  • Sick Spinlock: After an access violation, a spinlock is marked as sick if it backs off more than three times, which is the threshold. In this case, sp_server_diagnostics running in SQL Server determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
  • Out of Memory: If no memory has been freed in 2 minutes, sp_server_diagnostics running in SQL Server determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
  • Unresolved deadlock: If sp_server_diagnostics detects unresolved deadlock from query processing component, it determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
  • Deadlocked scheduler: If sp_server_diagnostics detects deadlocked schedulers from query processing component, it determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.

System Wide Issues

  • Unexpected crash: SQL Server service was shut down unexpectedly. Resources host service (rhs.exe) does not detect lease check from SQL Server about availability group lease. This results an AG lease timeout signal to WSFC and WSFC will initiate a failover.
  • Long dump: SQL Server is creating a dump file. During the process, threads handling the AG lease are frozen and which may result in a lease timeout. This results an AG lease timeout signal to WSFC and WSFC will initiate a failover if automatic failover is enabled.
  • Quorum loss: AG resource is brought offline because quorum is lost. This could be because of a plethora of issues, which would need to be determined by a closure examination of the cluster logs.
  • Network interface failure: Network interface used to communicate between cluster nodes fails. Primary and secondary replicas cannot communicate. WSFC will initiate quorum vote and determine a new primary to fail over to.
  • Cluster disk failure: Network interface used to communicate between cluster nodes has failed, resulting in a communication loss between the cluster nodes. Cluster will initiate quorum vote and determine a new primary failover to.
  • Cluster Service paused: Cluster service on primary was paused, resulting in primary unable to communicate with other nodes in the cluster. Cluster will initiate quorum vote and determine a new primary to fail over to.
  • Node offline: When primary node is frozen or loses power, WSFC loses connection from and to the primary. Failover cluster decide to initiate failover and pick a primary from other possible nodes.
  • High CPU utilization: System wide performance issue causes SQL Server service to be unable to respond to AG lease handler. For example, it is possible that some process on the node is consuming 100% CPU and hence the lease renewal threads cannot execute. This is a best effort estimation based on the inputs from the Cluster logs.
  • High I/O: System wide performance issue causes SQL Server service to be unable to respond to AG lease handler. For example, it is possible that some process on the node is throttling the disks on the system and as such the lease renewal cannot be completed. This is again a best effort estimate based on the inputs from the cluster logs.

The Failover Detection utility can be used to analyze SQL Server 2012, 2014, 2016 and 2017 Availability groups, provided the servers are on the latest service packs and cumulative updates.

Update 11/21

A few users have reported the following error while using the utility. We are looking into this and will post an update shortly.

“An unhandled exception of type ‘System.BadImageFormatException’ occurred in FailoverDetector.exe
Could not load file or assembly ‘Microsoft.SqlServer.XEvent.Linq, Version=14.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91’ or one of its dependencies. An attempt was made to load a program with an incorrect format.”

Update 11/27

New updated binaries for the Failover Detector Utility have been uploaded to GitHub. Before running the utility, please install the "Microsoft Visual C++ Redistributable for Visual Studio 2017" from here. On the download page, scroll down to the "Other Tools and Frameworks" section to download the redistributable (x64 version).

You may encounter the "Strong name validation failed" error while using the utility. To resolve the error use sn.exe (available in the zip file) to disable strong name validation. To disable strong name use the following command

sn.exe -Vr <<path to the DLLs>>

 

Sourabh Agarwal
Senior PM, SQL Server Tiger Team
Twitter | LinkedIn
Follow us on Twitter: @mssqltiger | Team Blog: Aka.ms/sqlserverteam

Comments (31)

  1. SkreebyDBA says:

    Sourabh,

    This looks amazing. If my replicas have more than the default 6 SQL Server error log files,do I need them all or just the ones that cover the time of the failover I want to analyze?

    The Tiger Team is awesome!

    Thanks,
    Frank

    1. You don’t need to have all the error log files. We do not check for number of files, just check if the files are present. The Utility will try and provide an analysis as best as it could with the files you have provided. If the files are not sufficient, then it would display a warning.

  2. Kimberlin_D says:

    I have tried this on 2 different computers. It runs through the log collection to the local directory but errors when it begins to analyze.

    “An unhandled exception of type ‘System.BadImageFormatException’ occurred in FailoverDetector.exe
    Could not load file or assembly ‘Microsoft.SqlServer.XEvent.Linq, Version=14.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91’ or one of its dependencies. An attempt was made to load a program with an incorrect format.”

    I’ve tried everything I can think of to get passed the error to no avail. Any ideas? i’d love to get this working as I have already spent a decent chunk of time scripting out my log collections.

    1. Kimberlin, Let me check and get back to you on this.

      1. Hi Kimerlin, could you please register the Dll in the GAC and try again.

        1. zydft says:

          Have the same issue.

          Performed “C:\Program Files (x86)\Microsoft SDKs\Windows\v10.0A\bin\NETFX 4.6.1 Tools”\gacutil -i to all 3 dll available
          Does not help

          1. Hi, please refer to the latest update on the blog to resolve this issue.

        2. SVigil says:

          I had the same issue. Was able to register the DLL (“Assembly successfully added to the cache”) but the error persists.

          1. Hi, please refer to the latest update on the blog to resolve this issue.

        3. Sourabh,
          I’m having the same issue as Kimberlin… Do you have any solution for that rather than register the DLL?
          Thank you!
          Regards!

          1. Hi Gonzalo, please refer to the latest update on the blog to resolve this issue.

        4. Kimberlin_D says:

          I tried using gacutil and still the same error. Thanks for your quick response though.

          1. Hi Kimberlin, please refer to the latest update on the blog to resolve this issue.

            1. Kimberlin_D says:

              Mine is now working as well. Thanks for the updates. For anyone else reading, I had to apply the sn.exe -Vr to each of the XE DLLs explicitly as Cody mentioned. It took me a minute to figure that out. From admin command prompt I entered the directory where SN.exe and the DLLS live (C:\FOData for me) and ran the following
              C:\FOData>sn.exe -Vr Microsoft.SqlServer.XE.Core.dll
              and
              C:\FOData>sn.exe -Vr Microsoft.SqlServer.XEvent.Linq.dll
              After that it worked!
              Thanks again.

        5. Cody Konior says:

          FYI I tried registering the bad DLL in the GAC and I still received the BadImageFormat exception.

          1. Cody, please refer to the latest update on the blog to resolve the issue.

            1. Cody Konior says:

              Thanks it’s now working with the re-download, the VC++ installation, and running sn.exe on the two provided XE DLLs.

  3. SkreebyDBA says:

    Sourabh, I just ran this successfully and the report shows the following message: Failover root cause cannot be determined in this case.

    However, when I look at the System log for the primary replica at time of failover, I see that the cluster lost communication with its file share witness which caused the cluster service to restart. Any idea why the results don’t flag that as root cause?

    Also, is there any reason the log files cannot be copied directly into the Data folder?

    Thank you for making this available. It saved me a ton of time bouncing between logs to determine root cause.

    Frank

    1. Thanks Frank!! The analysis is based on certain decision trees we built in the code and this sounds like one of the conditions we do not check in the decision tree (the complete list is in the blog). We do plan to collate the feedback and add more decision trees/condition in the future. Regarding your 2nd question. You can certainly copy all the data to the data folder and run the analysis with the –Analyze flag. In the –Show (or the default) mode the utility will try to copy the files from the Share (mentioned in the config file) to the local data folder.

    2. zydft says:

      Hi Frank,

      Please advise how have you made it running?
      Did you have any dll issue?
      Probably tell us more about your environment.

  4. Hello Sourabh,

    I am getting below error. This execution points to a different DLL than the one mentioned in your post…

    Start to analyze failover root cause.
    Saving results to JSON File.

    Unhandled Exception: System.BadImageFormatException: Could not load file or assembly ‘Newtonsoft.Json, Version=10.0.0.0,
    Culture=neutral, PublicKeyToken=30ad4fe6b2a6aeed’ or one of its dependencies. The module was expected to contain an ass
    embly manifest.
    at FailoverDetector.ReportMgr.SaveReportsToJson()
    at FailoverDetector.Program.Main(String[] args)
    PS C:\Alwayson_Failover>

    1. Hi Vivek, please refer to the latest update on the blog to resolve this issue.

  5. Rob Sewell says:

    cant wait for you to resolve the DLL errors. I have tried running this on several machines and registering the DLLs in the GAC without success, what I have done though is to write a PowerShell function to download the files, connect to an instance, get all of the required info from all of the replicas and run the tool

    You can find it here

    https://gist.github.com/SQLDBAWithABeard/2fc271e1a4be4554d0e14de31a83f889

    1. Hi Rob, please refer to the latest update on the blog to resolve this issue.

      1. Rob Sewell says:

        I can confirm that the exe is working, I have also written a PowerShell wrapper to download and gather all of the required information in case anyone finds it useful
        https://sqldbawithabeard.com/2018/11/28/gathering-all-the-logs-and-running-the-availability-group-failover-detection-utility-with-powershell/

        1. zydft says:

          It works. Thank you a lot for the best step-by-step explanation and automation

  6. Cody Konior says:

    In the example structure you have both Replicax_system.log and Replicax_system.csv. Which is the right file extension?

    Also are there plans to open source this code?

    1. CSV is preferred, but .log will also work. We do have a plan to open source the code in due time.

  7. BigRedDBA says:

    I keep getting:
    Unhandled Exception: System.IO.IOException: The process cannot access the file ‘
    G:\FailoverDetectionUtility\Data\servername\AlwaysOn_health_0_131854776100910
    000.xel’ because it is being used by another process.
    But I’m sure the file isn’t in use, I can rename it or delete it but then get the same error on the next file. Running under administrator in cmd window and I’m owner of the file. Logged off and on again but still get the same error.

    1. Hi This is most likely because of the conflict between the Data locations (shared folder and local data Cache for the Utility) and the Execution option being used. If you already have the data in the data folder (under the application directory) you can analyze the data using –Analyze switch. If you are using the default mode, then we will try to copy the files from the shared path to the local data folder.

Skip to main content