Experiencing Data Latency for Many Data Types – 5/30 – Resolved


Final Update: , 5/31/2015 05:19 UTC

We’ve confirmed that all systems are back to normal with no customer impact as of 6/1, 04:35 UTC. Our logs shows the incident started on 5/30, 15:00 UTC and that during the ~14 hours that it took to resolve the issue of data latency for all telemetry data types.

Root Cause: The failure was due to stuck thread pool queue for indexing. This was caused by very large shard in our system so we initiated a manual task to clear this indexing and it helped in restoring the health fast.

Incident Timeline:  13 Hours & 35 minutes - 5/30, 15:00 UTC through 6/1, 04:35 UTC

We understand that customers rely on Application Insights as a critical service and apologize for any impact this incident caused.

-Application Insights Service Delivery Team


Update: Saturday, 5/30/2015 23:51 UTC

Our system remains in unhealthy state as earlier and we are working hard to restore it. We found that some nodes hosting old data are taking long for processing due to that whole system is in unstable state. Customer might see a latency of around 3 hours while querying telemetry data. At this moment we have no estimate for resolution but we provide an update as we make progress.

Next Update: Before 6/1 06:00 UTC

-Application Insights Service Delivery Team



Update: Saturday, 5/30/2015 20:11 UTC

We continue to work for restoring the system health. We don't see any errors since 05/30 19:45 and latency is going down slowly for affected streams. Root cause is not yet fully confirmed and investigation is continue to understand the problem and apply a fix accordingly.

Next Update: Before 6/1 00:00 UTC

-Application Insights Service Delivery Team


Update: Saturday, 5/30/2015 17:50 UTC

Our DevOps team continues to investigate issues within Application Insights. Root cause is not fully understood at this time. Some customers continue to experience data latency for telemetry data . We are working to establish the start time for the issue, initial findings indicate that the problem began at 05/30 ~15:36 UTC. We currently have no estimate for resolution.

Next Update: Before 20:00 UTC

-Application Insights Service Delivery Team

 
 
 
 

Skip to main content