Experiencing Data Access for all Data Types – 7/24 – Resolved


Final Update: Saturday, 7/28/2015 23:22 UTC

We wanted to follow-up on our effort to recover from data gaps that resulted from this incident.  Our analysis shows that some customers have approximately 0.05% of their data currently not available for querying . This data is summarized telemetry & exceptions data types.  We have this data in our storage queue and plan to load it into our query service however it will take approximately 1-2 week to complete this process as it requires new feature implementation.  We apologize for the delays in processing this data and are working to automate this process end-to-end to reduce out time to resolve going forward. 

Update: Saturday, 7/25/2015 10:18 UTC

We’ve confirmed that all systems are back to normal with no customer impact as of 7/25, 10:00 UTC. Our logs show the incident started on 7/24, 18:40 UTC and that during the ~15 hours that it took to resolve the issue 20% of customers experienced ; However this issue has caused dropping of small amount of trace data from current query system, so a  subset of customers will still see partial data return if data is queried against this. We plan to replay this data soon (no ETA yet) and then all data will be accessible. Also customers will see data gap in availability report for initial impact window (7/24 18:40- 20:00 UTC).

Root Cause: The failure was due to Microsoft Azure storage downtime that initiated imbalance in our data nodes.
Incident Timeline:  14 Hours & 40 minutes – 7/24, 18:40 UTC through 7/25, 10:00 UTC

We understand that customers rely on Application Insights as a critical service and apologize for any impact this incident caused.

-Application Insights Service Delivery Team


Update: Saturday, 7/25/2015 05:23 UTC

We are still working to mitigate the issue completely. System health has been restored partially and impact has been reduced to a subset of customers who are hosted on impacted data nodes. At this moment we don’t have any ETA for full recovery but we provide an update as we progress.

Next Update: Before 17:00 UTC

-Application Insights Service Delivery Team


Update: Saturday, 7/25/2015 00:08 UTC

We continue to restore the system health. Initial issue that was preventing users to access data has been completely fixed and at present impact is limited to data latency. Our system is recovering but on slow processing rate. DevOps team is looking into further options to mitigate the situation as soon as possible.

Next Update: Before 05:00 UTC

-Application Insights Service Delivery Team


Update: Friday, 7/24/2015 21:05 UTC

Root cause has been isolated to storage services errors caused in Microsoft Azure Storage which was impacting Application Insights. To address this issue Microsoft Azure team has applied mitigation and our services are recovering fast. Some customers may experience data latency till issue is completely resolved.

Next Update: Before 7/25 00:00 UTC

-Application Insights Service Delivery Team


Initial Update: Friday, 7/24/2015 19:12 UTC

We are aware of issues within Application Insights and are actively investigating. Some customers may experience Data access issue for all data types.

Work Around: None
Next Update: Before 21:00 UTC

We are working hard to resolve this issue and apologize for any inconvenience.

-Application Insights Service Delivery Team

 
 
 
 
 
 
 
 

Skip to main content