Experiencing Latency,Data Loss and Data Access Issue for Multiple Functional Areas - 03/12 - Resolved


Final Update: Sunday, 13 March 2016 13:02 UTC

We've confirmed that all systems are back to normal with no customer impact as of 03/13, 13:00 UTC. Our logs show the incident started on 03/12, 07:30 UTC and that during the 29 hours 30 minutes that it took to resolve the issue some customers would have experienced data latency and data access issues.
  • Root Cause: The failure was due to maintenance of dependent service of Application Insights which once recovered mitigated the issue.
  • Lessons Learned: We have collected required telemetry data and will be investigating more using the same  to avoid such occurrences in future.
  • Incident Timeline: 29 Hours & 30 minutes - 03/12, 07:30 UTC through 03/13, 13:00 UTC

We understand that customers rely on Application Insights as a critical service and apologize for any impact this incident caused.

-Praveen


Update: Sunday, 13 March 2016 07:20 UTC

Application insights processing service is chewing up the backlog data at a healthier rate without any hiccups and no latency in the current data. Since root cause has been isolated to a maintenance of underlying platform service, it is worth mentioning that the underlying platform is still going through maintenance so there is risk of similar service interruptions until the maintenance concludes. We estimate 6 more hours before all the backlog data is processed. Some customers will continue to experience data latency for the back log data.

  • Work Around: None
  • Next Update: Before 03/13 13:30 UTC

-Praveen


Update: Saturday, 12 March 2016 19:46 UTC

Root cause has been isolated to a maintenance of underlying platform service that caused some corruption on a few of our nodes. To address this issue we fixed the underlying hosts for corruption. Some customers may experience data latency and data gaps until all the backlog data is processed. With current estimate it could take around 12 hours for all the data to recover.

It is worth mentioning that the underlying platform is still going through maintenance so there is higher than normal risk of similar service interruptions until the maintenance concludes. 

  • Work Around: None
  • Next Update: Before 03/13 08:00 UTC

-Pankaj


Update: Saturday, 12 March 2016 13:57 UTC

We are aware of issues within Application Insights and we continue
to investigate issues within Application Insights. The issue is caused by
maintenance activity on one of our dependent services. Some customers will
continue to experience Latency, Data Access and Data gaps issues. Initial
findings indicate that the problem began at 03/12 ~07:30 UTC. We currently
estimate 6 hours for resolution.

  • Work Around: None
  • Next Update: Before 03/12 20:00 UTC

-Praveen


Initial Update: Saturday, 12 March 2016 08:19 UTC

We are aware of issues within Application Insights and are actively investigating. Some customers may experience Latency, Data Access and Data gaps issues. The following data types are affected: Customer Event,Dependency,Exception,Metric,Page Load,Page View,Performance Counter,Request,Trace.
  • Work Around: None
  • Next Update: Before 03/12 14:30 UTC

We are working hard to resolve this issue and apologize for any inconvenience.
-Durga


Comments (0)

Skip to main content