Experiencing Data Latency & Data Access for Multiple Metrics – 9/11 – Resolved


Final Update: Saturday, 9/12/2015 01:31 UTC

We’ve confirmed that all systems are back to normal with no customer impact as of 9/12, 01:00 UTC. Our logs show the incident started on 9/11, 15:00 UTC and that during the 8 hours that it took to resolve the issue few of the customers experienced errors when querying. The issue is now mitigated and we continue to monitor health of the service. We also learnt later that for the summarized metrics type,  data processing was normal and there is no impact during the incident time window s mentioned earlier.

Root Cause: The failure was due to query configuration in one our services. 
Incident Timeline: 7 Hours - 9/11, 18:00 UTC through 9/12, 01:00 UTC

We understand that customers rely on Application Insights as a critical service and apologize for any impact this incident caused.

-Application Insights Service Delivery Team


Update: Friday, 9/11/2015 22:28 UTC

 

We are investigating two different issues , one for the latency and the other for data access when querying.

For the data latency, root cause has been isolated to an indexing issue. To address this issue we are actively changing our indexing mechanism. In addition to this we found a memory issue which was impacting data access for queries and we took steps to address that. Some customers may experience latency for the summarized metrics and we estimate 6-9 hours before all data is caught up.

Next Update: Before 9/12 03:00 UTC

-Application Insights Service Delivery Team


Update: Friday, 9/11/2015 19:41 UTC

We just found out that the scope of impact is much larger than expected. We know now for fact that some customers will experience errors when accessing their data in addition to the latency described in earlier posts. All data types are affected except Trace and Messages.

We are working on resolving the issue.

• Next Update: Before 9/11 23:00 UTC 

-Application Insights Service Delivery Team


Update: Friday, 9/11/2015 18:57 UTC

The recovery is taking much longer than expected as we hit another issue where some more machines went to unhealthy state. We have fixed some of these machines and working on fixing the rest of them. We currently have no estimate for resolution.

Next Update: Before 9/11 23:00 UTC 

-Application Insights Service Delivery Team


Update: Friday, 9/11/2015 16:56 UTC

Root cause has been isolated to few of our machines going to bad state which was impacting summarized data. To address this issue we have taken steps to bring the machines to healthy state. Some customers may experience latency and we estimate 3 hours before all pending latency issues are addressed. This affects only the aggregated metrics and raw data is not affected.

Next Update: Before 9/11 19:00 UTC

-Application Insights Service Delivery Team


Initial Update: Friday, 9/11/2015 15:25 UTC

We are aware of issues within Application Insights and are actively investigating. Some customers may experience Data Latency. The following data types are affected: Summarized metrics. Raw metrics are not affected.

Next Update: Before 9/11 17:00 UTC

We are working hard to resolve this issue and apologize for any inconvenience.

-Application Insights Service Delivery Team

 

 

 

 

 
 


Skip to main content