Experiencing data latency issues in Azure Portal for Application Insights Service in US regions- Mitigated


RCA Update: Saturday, 08 September 2018 01:41 UTC

We’ve obviously received a lot of questions over the last couple days for more specifics about why this incident affected more than just the South Central region, which was impacted by the datacenter cooling outage on September 4th.  You can read more about this here:
https://azure.microsoft.com/en-us/status/history/

 

Application Insights resources that were located in the South Central US data center, plus some resources from the East US data center were most impacted.  These resources were unavailable to manage during the duration of the initial incident.

 

However, all Application Insights resources across all regions experienced some impact during this incident.  This was caused by impact in non-regional services such as Azure Active Directory, Azure Resource Manager and internal components that provide capability for data routing used by other regional components.  This resulted in global impact to Application Insights, including the ability to
query data, significant delays in ingestion, and update and manage some types of resources, such as Availability Tests.  This was not a result of customer data being stored in the South Central data center; customer data stored within Application Insights resides in the geography it is sent to as described here: https://docs.microsoft.com/en-us/azure/application-insights/app-insights-data-retention-privacy

Recovery from this incident took longer than usual because of continued authentication issues and scaling issues.  Application Insights ingestion occurs at the closest ingestion endpoint.  This ingestion continued across all regions during the outage, but due to the issues described above, this data could not be routed to the regional storage location.  This resulted in a backlog of data which needed to be cleared before new data could be persisted and would be available to query.  The impact of this latency in data ingestion surfaced in many ways, including gaps in data as seen in the Azure portal, Log Search alerts firing based on latent ingested data, latency in reporting billing data to Azure commerce, and delays in seeing the results of Availability tests in the Azure portal.

 

Due to historical reasons Application Insights status is posted on this Application Insights Service Blog.  We are working to retire this blog and post all new service status on the Azure Service status page in the future.  We understand that Application Insights is an important service for many of you, and apologize again for the impact this incident caused.  We are continuing to invest in improvements to the resiliency of the service to ensure future incidents with regional impact do not impact resources in other data centers.

-Suraj


Final Update: Friday, 07 September 2018 19:08 UTC

We've confirmed that all systems are back to normal. We will post a brief RCA about the incident shortly.

We understand that customers rely on Application Insights as a critical service and apologize for any impact this incident caused.

-Suraj


Update: Friday, 07 September 2018 12:13 UTC

Latency in data is still seen in US region where our service has been deployed. The services are getting recovered and the mitigation is in progress. However few customer in US might experience latency issue until the services are completely recovered.

  • Work Around: None
  • Next Update: Before 09/07 18:30 UTC

-Varun


Update: Friday, 07 September 2018 05:48 UTC

We are still seeing data latency in US locations where our services have been deployed. The mitigation is in progress and the services are showing recovery. However a very small subset of customers in US may still experience data latency issues until the complete recovery of services. We don't have an ETA for full recovery.

  • Work Around: None
  • Next Update: Before 09/07 12:00 UTC

-Varun


Update: Friday, 07 September 2018 00:37 UTC

Our services are recovering and latency is reducing, but we are not completely recovered yet. The mitigation is in progress and the services are showing recovery. However still customers in US may experience data latency issue until the complete recovery of services, and we don't have an ETA as of now for full recovery.

  • Work Around: None
  • Next Update: Before 09/07 07:00 UTC

-Suraj


Update: Thursday, 06 September 2018 17:54 UTC

We are still seeing data latency in US locations where our services have been deployed. The mitigation is in progress and the services are showing recovery. However still some customers in US may experience data latency issue until the complete recovery of services, and we don't have an ETA as of now for full recovery.

  • Work Around: None
  • Next Update: Before 09/07 00:00 UTC

-Suraj


 

Update: Thursday, 06 September 2018 11:24 UTC

This is caused due to the issue in the US datacenter where our services have been deployed. The mitigation is in progress and the services are showing recovery. However still some customer from US may experience data access and latency issue until the complete recovery of services.

  • Work Around: None
  • Next Update: Before 09/06 17:30 UTC

-Varun


Update: Thursday, 06 September 2018 07:15 UTC

Root cause has been isolated to networking issue in our US data center where storage and other services are deployed. We have applied mitigation and we are seeing expected recovery in our services. However US customers would continue to experience data access and latency issues until full service is restored. We provide more updated in next 4 hours.

  • Work Around: None
  • Next Update: Before 09/06 11:30 UTC

-Varun


Update: Thursday, 06 September 2018 04:05 UTC

Root cause has been isolated to the networking issues in our US data center which was impacting our backend service.  We have taken the required mitigation steps and are seeing recovery. However some customers may experience data access and latency issues until the service is fully restored. We will provide more updates in the next 3 hours.

 

  • Work Around: None
  • Next Update: Before 09/06 07:30 UTC

-Varun


Update: Thursday, 06 September 2018 00:49 UTC
Root cause has been isolated to networking issue in our US data center where storage and other services are deployed. We have applied mitigation and we are seeing expected recovery in our services. However US customers would continue to experience data access and latency issues until full service is restored. We provide more updated in next 2 hours.

 

  • Next Update: Before 09/06 03:00 UTC

-Arvind


Update: Wednesday, 05 September 2018 19:45 UTC

We continue to investigate issues within Application Insights. Datacenter outage in South Central US has caused impact on Application Insights customers. 18% of the customers in East US continue to experience data access, alerting and ingestion delays.  We currently have no estimate for resolution.

  • Work Around:
  • Next Update: Before 09/06 02:00 UTC

-Suraj


Update: Wednesday, 05 September 2018 19:10 UTC

Azure South Central US datacenter issues are being mitigated. There is no expected ETA for the complete mitigation. Some customers continue to experience data access, alerting and ingestion delays in South Central US.

  • Work Around: None
  • Next Update: Before 09/06 01:30 UTC

-Suraj


Update: Wednesday, 05 September 2018 11:07 UTC

Azure South Central US datacenter issues are being mitigated. There is no expected ETA for the complete mitigation. Some customers continue to experience data loss in South Central US, alerting and data access issues.

  • Work Around: None
  • Next Update: Before 09/05 17:30 UTC

-Varun


Update: Wednesday, 05 September 2018 04:32 UTC

Azure South Central US datacenter issues are being mitigated. There is no expected ETA for the complete mitigation. Some customers continue to experience data loss in all regions, alerting and data access issues till all problems are addressed.

  • Work Around: None
  • Next Update: Before 09/05 11:00 UTC

-Varun


Update: Tuesday, 04 September 2018 22:38 UTC

Azure South Central US datacenter issues are being mitigated. There is no expected ETA for the complete mitigation. Some customers continue to experience data loss in all regions, alerting and data access issues till all problems are addressed.

  • Work Around: None
  • Next Update: Before 09/05 03:00 UTC

-Suraj


Update: Tuesday, 04 September 2018 18:37 UTC

Azure South Central US datacenter is experiencing an outage which has led to ingestion data loss for Application Insights. The team is bringing up the datacenter back online as of now. Some customers continue to experience data loss in all regions, alerting and data access issues.

Initial findings indicate that the problem began at 04/09 ~10:00 AM UTC. We currently have no estimate for resolution.

  • Work Around: None
  • Next Update: Before 09/04 22:00 UTC

-Suraj


Update: Tuesday, 04 September 2018 15:59 UTC

We continue to investigate issues within Application Insights. Root cause is not fully understood at this time. Some customers continue to experience issues while accessing their data. Customers may also see data gaps and face alerting issues. Initial findings indicate that the problem began at 04/09 ~10:00 AM UTC. We currently have no estimate for resolution.

  • Work Around: None
  • Next Update: Before 09/04 20:00 UTC

-Varun


Update: Tuesday, 04 September 2018 12:05 UTC

We continue to investigate issues within Application Insights. Root cause is not fully understood at this time. Some customers continue to experience issues while accessing their data. Customers whose data resides in South Central US region may also see data gaps and face alerting issues. We are working to establish the start time for the issue, initial findings indicate that the problem began at 04/09 ~10:00 AM UTC. We currently have no estimate for resolution.

  • Work Around: None
  • Next Update: Before 09/04 16:30 UTC

-Varun


Initial Update: Tuesday, 04 September 2018 10:25 UTC

We are aware of issues within Application Insights and are actively investigating. Some customers may experience Data Access Issue in Azure Portal.

  • Work Around: None
  • Next Update: Before 09/04 12:30 UTC

We are working hard to resolve this issue and apologize for any inconvenience.
-Varun



Skip to main content