New in SSMS - Always On Availability Group Latency Reports

With SQL Server 2012 we introduced Always On Availability Groups, and the Always On Availability Group Dashboard in SQL Server Management Studio (SSMS). This dashboard can be utilized by database administrators to view the current health of an Availability Group and its availability replicas and databases. While the dashboard can be configured to provide information regarding the latency between Primary and Secondary Replicas (can be calculated using the Commit LSN, Sent LSN and harden LSN values), it does not provide insights into the reason for the latency. To understand the reason for latency, requires capture and analysis of Extended Events and Performance Monitor counters. This activity can be time consuming and requires extensive knowledge of the Extended Events associated with Always On.

With the new SSMS 17.4 release, we are introducing the Availability Group Latency data collection and reporting built into the Availability Group dashboard. This feature masks the capture and analysis of the Extended Events from the end user and provides an easy to understand report detailing the time spent during the various phases of the Log Transport process.

What to use it for?

The Latency data collection functionality and the associated reports allows a database administrator to quickly discern the bottleneck in the log transport flow between the Primary and the Secondary replicas of an Availability Group. This feature does NOT answer the question "Is there latency in the Availability Group deployment? " but rather provides a way to understand why there is latency in the Availability Group Deployment. This functionality provides a way to narrow down the potential cause of latency in an Availability Group deployment.

How does it Work?

As seen below, the Availability Group Latency Data Collection functionality can be accessed from the Availability Group Dashboard.

When an admin clicks on the "Collect Latency Data" link, SSMS behind the scenes connects to all the Replicas in the Availability and performs the following tasks.

  1. Checks if SQL Agent is running or not. For the latency Data collection to work, we require SQL Agent to be running on at least one secondary replica and the Primary Replica.
  2. Create a SQL Agent Job with the name "AlwaysOn_Latency_Data_Collection". This job has 7 steps for
    1. Collect Availability Group Information. This information is persisted (till the next SQL Server restart or the next execution of the job) in a table "AGInfo" in the TempDB database.
    2. Create an Extended Event Session for Availability Group data collection on the Replica
    3. Wait for the Extended Event to complete. At present we run the collection for a duration of 2 minutes. While this duration can be modified, we do not recommend users to modify this value. Modifying the duration affects the amount of time the Extended event session runs (and the data it collects) and will have a direct impact on the subsequent data extraction phase.
    4. End Extended Event session
    5. Extract the XML data from the Extended Event session into a temporary table and then parse the XML event information into "DMReplicaEvents" table in the TempDB database.
    6. Populate the results set for the Latency Reports
    7. Drop the Extended Event Session.

Once the job execution has completed on all the replicas, the latency information can be viewed using the new Latency Reports, Primary Replica Latency Report and the Secondary Replica Latency Report. As the name suggests, the primary Replica latency report applies to the Primary Replica, while the Secondary Replica Latency Report applies to all the secondary replicas in the Availability Group.

These reports are available as Standard Reports for the Availability Groups as shown below

Primary Replica Latency Report

The primary replica latency report has 3 sections. The first section provides Replica information for the AG.

Always On Replica Roles (at data collection time)
Availability Group Name Availability Replica Name Role
LatencyDemo SQLNode2 PRIMARY
LatencyDemo SQLNode1 SECONDARY

The Second section provides a graphical view of the avg. Commit time on the Primary Replica and the avg. remote Hardening time for all the secondary replicas. This information is based on the values retuned by the hadr_db_commit_mgr_harden and the hadr_receive_harden_lsn_message. For more information on these extended events refer here.

The third and final section of the report provides numerical values for commit duration, remote hardening duration, time spent in compression and send duration for the log block.

The following values are reported by the Primary Replica Latency Report

  1. Commit Time – Avg. Time to commit a transaction on the Primary Replica.
  1. Remote Harden Time – Time elapsed between sending a log block to a secondary replica and getting the associated harden_lsn message back from the secondary replicas. This would include the following
    1. Time spent in sending the log block to the UCS layer - Reported in the Primary Replica Report
    2. Time spent in sending the Log block to the Secondary over the network – NOT reported in any of the reports.
    3. Time spent in receiving the log block from the UCS layer on the Secondary Replica - Reported in the Secondary Replica Report
    4. Time spent in decompressing the log Block on the secondary replica. - Reported in the Primary Replica Report
    5. Time spent in writing the log block to the disk on the secondary replica - Reported in the Primary Replica Report
    6. Time spent in send the acknowledgement to the UCS Layer on the Secondary Replica - Reported in the Primary Replica Report
    7. Time spent in sending the acknowledgement to the Primary Server over the network – NOT reported in any of the reports
  2. Compression Time – Time spent in compression the log block before sending to the secondary replicas. In SQL Server 2016 and above, log transport to a Synchronous secondary replica is NOT compressed, while the log transport to Async replicas is compressed.
  3. Local Flush Time – Avg. time spent in writing a log block to the LDF file. This value is calculated using the Log_Flush_Start and Log_Flush_Complete Extended events.
  4. Send Time – Time spent in sending the log block to the UCS layer in SQL. This does not include the time spent in the network layer (i.e. while in transit between the primary and secondary replicas)

Secondary Replica Latency Report

The Secondary Replica Latency information provides a breakdown of the time spent on the secondary replica during the log block transport. This report also has 3 sections as shown below.

The following values are reported by the secondary replica latency report

  1. Local Flush Time – Avg. time spent in writing a log block to the LDF file. This value is calculated using the Log_Flush_Start and Log_Flush_Complete Extended events.
  2. Decompression Time – Time spent in decompressing the log block after receiving it from the Primary. In SQL Server 2016 and above, log transport to a Synchronous secondary replica is NOT compressed, while the log transport to Async replicas is compressed.
  3. Receive Time – Time spent in receiving the log block from the UCS layer and enqueuing it for further processing.
  4. Send Time – Time spent in sending the acknowledge message to the UCS layer in SQL. This does not include the time spent in the network layer (i.e. while in transit between the primary and secondary replicas)

Note - In the sample report images above the latency is being caused by the high network roundtrip latency between the primary and the secondary replicas. This demo setup included servers in different geographies (US West and US East), with an average roundtrip latency of 25 millisecond between the two servers.

Permissions Required

The Availability Group Latency data collection feature is only enabled for sysadmin (only Windows Authenticated) accounts and requires that the current user account be an admin on all the replica instances.

Limitations and Restriction

The current implementation of the Latency data collection has the following limitations

  1. This functionality can only be used to collect latency data for one Availability Group at a time.
  2. The names for the TempDB objects used for this functionality are hardcoded. Which means only one collection can run at any time.
  3. Latency data is stored in TempDB tables, which means this information would be lost if the SQL Server instance has restarted since the last capture.
  4. The Collect Latency Data is NOT enabled for SQL Authenticated user or logins, even if the login is an admin on the server.
  5. Given that the functionality is only enabled for Windows Authenticated users, it will not work for Availability Group deployments in a Workgroup or in cross domain deployments where there is no trust between the domains.
  6. This functionality does not work for Distributed Availability Groups.
  7. This functionality requires the following minimum versions for SQL Server
    1. SQL Server 2017 RTM
    2. SQL Server 2016 SP1
    3. SQL Server 2014 SP2

 

Sourabh Agarwal Senior PM, SQL Server Tiger Team Twitter | LinkedIn Follow us on Twitter: @mssqltiger | Team Blog: Aka.ms/sqlserverteam