We’ve confirmed that all systems are back to normal with no customer impact as of 03/02, 15:00 UTC. Our logs show the incident started on 03/01, 16:48 UTC and that during the 10 hours &12 minutes that it took to resolve the issue small percentage of customers experienced data latency.
- Root Cause: The failure was due to exception caused due to invalid data sent by a customer application.
- Lessons Learned: We have deployed a hot fix which will avoid re-occurrence of such issues in future.
- Incident Timeline: 10 Hours & 12 minutes – 3/1, 16:48 UTC through 3/2, 15:00 UTC
We understand that customers rely on Application Insights as a critical service and apologize for any impact this incident caused.
Application Insights services are still working on processing some of the delayed data. Processing backlog is taking more time than expected but we are monitoring current progress.Some customers may still experience data latency for data sent between 3/1 18:00 UTC and 3/2 00:00 UTC and we estimate additional 6 hours before all backlog data is processed.
- Work Around: None
- Next Update: Before 03/02 16:00 UTC
Root cause has been isolated to lack of certain key data validations in our pipeline which was impacting our processing components. To address this issue we have now deployed fixes to get these processing components back to normal processing rates. The processing components are now working as expected. However, a subset of our customers will continue to see data gaps for their data ingested between 3/1 18:00 UTC and 3/2 00:00 UTC. We expect this process to take another 6-10 hours.
- Next Update: Before 03/02 11:00 UTC
We continue to investigate issues within Application Insights. We continue to see corruption issues in one of our processing components. A subset of our customers will continue to see data gaps for their data ingested between 3/1 18:00 UTC and 3/1 22:00 UTC. We currently have no estimate for resolution.
- Next Update: Before 03/02 04:30 UTC
Root cause has been isolated to deployments which were executed prior to this incident. To address this issue we rolled back the aforementioned deployments and this has brought our processing components back to normal state. Customers will now see their latest data in the portal. However, a subset of out customers may experience a gap in their data ingested between 3/1 16:00 UTC and 3/1 18:00 UTC as residual affect. We’ve put in place additional processing components to back fill this data and estimate another 4 hours for this activity to complete.
- Next Update: Before 03/01 23:30 UTC
We are aware of issues within Application Insights and are actively investigating. Some customers may experience Data Latency. The following data types are affected: Availability,Customer Event,Dependency,Exception,Metric,Page Load,Page View,Performance Counter,Request,Trace.
- Next Update: Before 03/01 19:00 UTC
We are working hard to resolve this issue and apologize for any inconvenience.