Data Loss issue with Web Test Data – 2/13 – Resolved


Final Update: Saturday, 2/14/2015 03:54 UTC

We’ve confirmed that all systems are back to normal with no customer impact as of 2/13/2015 20:48 UTC. Our logs show the incident started on 2/13/2015 17:30 UTC and that during the 3 hours that it took to resolve the issue customers experienced 25% – 50% data loss for application dependency data.  

Root Cause: The failure was due to noisy neighbor, which impacted our ability obtain a blob lease in storage.
Chance of Reoccurrence: Low
Lessons Learned: We are investigating additional service improvements and optimizations to avoid this issue in the future. 
Incident Timeline: 3 Hours & 18 minutes – 2/13/2015 17:40 UTC through 2/13/2015 20:48 UTC

We understand that customers rely on Application Insights as a critical service and apologize for any impact this incident caused.

-Application Insights Service Delivery Team


Update: Saturday, 2/14/2015 02:31 UTC

While restoring the original configuration, two instances became non-responsive. DevOps is investigating those instances. There is currently no impact to customers.

Next Update: 2/14/2015 06:00 UTC

-Application Insights Service Delivery Team


Update: Friday, 2/14/2015 00:30 UTC

Root cause has been isolated to a noisy neighbor, which impacted our ability obtain a blob lease in storage. We have isolated the noisy neighbor’s traffic and are in the process of restoring the original configuration to the affected systems. There is currently no impact to customers.

Next Update: Before 2/14/2015 02:00 UTC

-Application Insights Service Delivery Team


Update: Friday, 2/13/2015 21:54 UTC

Root cause has been isolated to dynamic logic in our service that determines what instances should write to storage. There was an underlying issue with obtaining blob leases which caused the impact. To work around this, we’ve implemented a manual configuration. We’re continuing to determine how to resolve the lease issue so the original configuration can be restored.

Customers will see data gaps (25% – 50% data loss) for application dependency data from 2/13/2015 17:40 UTC – 2/13/2015 20:48 UTC. Please note that any alerting rules for web tests were unaffected during the incident.

• Work Around: None
• Next Update: Before 23:30 UTC 

-Application Insights Service Delivery Team


 

Update: Friday, 2/13/2015 19:20 UTC

Our DevOps team continues to investigate issues with web test data.  Some customers may not be able to see all dependency data for applications. Root cause is not fully understood at this time.  We are working to establish the start time for the issue, initial findings indicate that the problem began at 02/13 ~17:30 UTC. We currently have no estimate for resolution.

Work Around: None
Next Update: Before 21:30 UTC 

-Application Insights Service Delivery Team


Initial Update: Friday, 2/13/2015 18:37 UTC

We are actively investigating issues with web test data. Some customers may experience gaps in their web test results starting at approximately 2/13/2015 17:30 UTC. We currently have no estimate for resolution.

Work Around: None
Next Update: Before 20:30 UTC

We are working hard to resolve this issue and apologize for any inconvenience.

-Application Insights Service Delivery Team

 

 

 
 
 
 
 
 

Skip to main content