Experiencing Data Latency for Many Data Types – 6/3 (Second Occurrence) – Resolved


Final Update: Monday, 6/8/2015 18:39 UTC

We have confirmed that all systems are back to normal with no customer impact as of 6/8, 09:50 UTC. Our logs show the incident started on 6/3, 19:15 UTC and that during the 110 hours that it took to completely resolve the issue some customers experienced data latency of up to 7 hours and temporary gaps in data that were backfilled.

Root Cause: The failure was due to performance issues in a new loader topology.  These topologies have been reverted to a previous version and we have not seen the problem reappear.
Lessons Learned: This issue did not appear in any of our nonproduction environments, and we are actively investigating to understand full root cause.  We are also making significant changes to our backlog processing topology that should allow greater speed and accuracy if we need them again in the future.
Incident Timeline:  

  • Initial Latency: 24 Hours & 35 minutes - 6/3, 19:15 UTC through 6/4, 19:50 UTC - all data current at this time with 3 known, contained windows for backlog processing
  • Backlog Processing: 86 Hours - 6/4, 19:50 UTC through 6/8, 09:50 UTC - all backlogged data has been processed

We understand that customers rely on Application Insights as a critical service and apologize for any impact this incident caused.

-Application Insights Service Delivery Team


Update: , 6/7/2015 18:08 UTC

We have identified 3 separate periods that required individual processing and are working through them.  Reprocessing for 2 periods is completed and the third is in progress.  All other data is current, but customers may see gaps in data for the in-progress period until reprocessing is completed.

Period Start 
       Period End 
         Reprocessing Status

6/2
13:00 
           6/2 21:30            
In Progress

6/4
16:00 
           6/4 20:30 
           Complete

6/5
14:00 
           6/5 22:30            
Complete

Work Around: none
Next Update: Before 6/8/2015 17:00 UTC

-Application Insights Service Delivery Team


Update: Saturday, 6/6/2015 16:59 UTC

The catch-up process was paused for a time to implement some additional tuning specific to these catch-up loaders.  We have finished the tuning and are re-engaging the catch-up process.

We apologize for any inconvenience this may have caused.

Work Around: none
Next Update: Before 6/7/2015 17:00 UTC

-Application Insights Service Delivery Team


Update: Saturday, 6/6/2015 00:25 UTC

We have deployed the planned optimizations for the loader topologies and we are seeing an increase in the loader throughput rate to manageable levels.

Additional loader topologies have been deployed to handle the backlogged data and we are processing through that as quickly as the secondary topology will allow while maintaining a focus on the most-recent data.

Work Around: none
Next Update: Before 6/6/2015 17:00 UTC

-Application Insights Service Delivery Team


Update: Friday, 6/5/2015 20:39 UTC

We are currently seeing approximately 7 hours of latency due to the slow loader topology.  We are investigating several options to optimize throughput.

Within the next 2 hours, we expect to complete evaluation of the optimization strategies and will ensure that the most-recent data is loaded on priority.  Any backlogged data will be processed on a lower-priority thread.

Work Around: none
Next Update: Before 17:00 UTC

-Application Insights Service Delivery Team


Update: Friday, 6/5/2015 17:07 UTC

Root cause has been isolated to instability and slow write times in our loader topology which was causing delays for data getting indexed and available to customers. To address this issue we adjusted and redeployed the loader topology. Indexing is now working as expected. Some customers may experience data latency and we are cautiously optimistic about the progress in recovering the ongoing latency.

Work Around: none
Next Update: Before 21:00 UTC

-Application Insights Service Delivery Team


Update: Friday, 6/5/2015 12:38 UTC

We have recovered the application clusters and have been working through the backlog at an accelerated rate.  To address the backlog of data, we have been carefully monitoring and making adjustments where needed over night. The system is now working as expected. Some customers may continue to experience data latency and we estimate 6 hours before all residual data latency is addressed.

Work Around: none
Next Update: Before 17:00 UTC

-Application Insights Service Delivery Team


Update: Friday, 6/5/2015 07:57 UTC

The upstream issue with Azure has been mitigated, and we are now working on recovering our application clusters to fully resolve the impact from that issue.  Some customers will continue to see latency across multiple data types.  Recovery of the application clusters is proceeding according to expectations.  As soon as the recovery is completed, we will start work on recovering the latency.

Work Around: none
Next Update: Before 10:00 UTC

-Application Insights Service Delivery Team


Update: Friday, 6/5/2015 05:56 UTC

Our DevOps team continues to investigate issues within Application Insights. We are currently impacted by an issue with Azure that is causing additional latency and data access issues. Some customers continue to experience data latency and intermittent data access issues.

As soon as the upstream issue is resolved, we will be able to make further progression on recovering the latency.

Work Around: none
Next Update: Before 08:00 UTC

-Application Insights Service Delivery Team


Update: Friday, 6/5/2015 00:31 UTC

Our DevOps team continues to investigate issues within Application Insights. Root cause is not fully understood at this time and previous mitigation efforts have not resolved all latency. Some customers continue to experience data latency. We currently have no estimate for resolution.

Work Around: none
Next Update: Before 6/5/2015 05:00 UTC

-Application Insights Service Delivery Team


Update: Thursday, 6/4/2015 08:19 UTC

Root cause has been isolated to some issues with our several of our back-end nodes experiencing instability. We have gathered the appropriate telemetry and are evaluating it for full RCA. Our systems are now working as expected, but some customers may experience continued data latency and we estimate several hours before all latency is addressed.

Work Around: none
Next Update: Before 17:00 UTC

-Application Insights Service Delivery Team


Update: Thursday, 6/4/2015 05:44 UTC

Our DevOps team continues to investigate issues within Application Insights. Root cause is not fully understood at this time. Some customers continue to experience data latency and occasional query failures.  We currently have no estimate for resolution.

Work Around: none
Next Update: Before 10:00 UTC

-Application Insights Service Delivery Team


Update: Thursday, 6/4/2015 00:10 UTC

Our DevOps team continues to investigate issues within Application Insights. Root cause is not fully understood at this time. Some customers continue to experience latency.  We are working to establish the start time for the issue, initial findings indicate that the problem began at 6/3/2015 19:15 UTC. We currently have no estimate for resolution.

Work Around: none
Next Update: Before 6/4/2015 05:00 UTC

-Application Insights Service Delivery Team


Initial Update: Wednesday, 6/3/2015 21:25 UTC

We are aware of issues within Application Insights and are actively investigating. Some customers may experience Data Latency. The following data types are affected: Customer Event, Dependency, Exception, Metric, Page Load, Page View, Performance Counter, Request.

Our logs indicate this issue started at approximately 6/3/2015 19:15 UTC.

Work Around: none
Next Update: Before 6/4/2015 00:00 UTC

We are working hard to resolve this issue and apologize for any inconvenience.

-Application Insights Service Delivery Team

 
 
 
 
 
 
 
 
 
 
 
 
 
 

Skip to main content