Warnings sent to customers when Azure is about to be updated

I sometimes get customers asking me about the warnings they’ll get when updates are rolled out across Azure. Well – at the bottom of this post is an example email sent to me. Notice the emphasis placed on:

  • Putting multiple VMs in to availability sets
  • Creating multiple instances of each role in Cloud Services

I can’t remember exactly but Igal Figlin from Microsoft did some background research in to this and found that 40% (it might be higher I can’t remember exactly – you can watch the video here) of deployments are not in availability sets. Have a read of the email below and you’ll start to realise how much risk you are putting yourself to if you don’t use multiple VMs in availability sets.

When you put VMs in to availability sets they are also distributed across up to 5 update domains. When Microsoft updates Azure, they’ll walk from one update-domain to the next. You can see what they are saying in this email – they’ll leave 30 minutes between updating each update domain. Let’s say you have 2 machines in an availability set. They’ll be spread across 2 fault domains and 2 update domains. That means if an infrastructure fault occurs (like say power or a network segment), only one of your VMs will be affected. It also means if Microsoft has to do an update, it will take one of your machines out of the configuration at a time.

If you want to be super-cautious, you could protect against the scenario that while Microsoft is walking the update domains in your availability set, you also get an infrastructure failure – that could take out a further machine. The table below shows how.

  Update Domain 0 Update Domain1 Update Domain 2
Fault Domain 0 Instance 0   Instance 2
Fault Domain 1   Instance 1  

Imagine the update process had done the update on the instance in Update Domain 0, it had then walked on to Update Domain 1 and was in the middle of updating that instance. Instance 1 is now offline. At the same time a power failure occurs to the rack on Fault Domain 0. That would cause Instance 0 and instance 2 to also be taken offline. You’d now have an availability set with no running machines. You can counter this by adding a VM to the availability set. Because there can only ever be one Update Domain in an availability set undergoing an update – you are protected. Let’s say you are in the middle of updating one of the services yourself. Your update will be stalled, the Microsoft update will complete and then your update will continue. In other words updates are applied to an update domain synchronously. And if you are in the middle of updating one Update Domain, Microsoft won’t start simultaneously updating a different Update Domain. So the following table will remove all risk from simultaneous Update Domain and Fault Domain operations.

 

  Update Domain 0 Update Domain1 Update Domain 2 Update Domain 3
Fault Domain 0 Instance 0   Instance 2  
Fault Domain 1   Instance 1   Instance 3

The failure of any Fault Domain will take out 2 instances and a simultaneous update can take out only one Update Domain. This means a maximum of 3 instances can be offline because of simultaneous Update Domain/Fault Domain operations. That would leave you with one running instance.

You’d have to be very unlucky to get an infrastructure failure occur while an update is going on. The availability SLA takes the above scenarios in to consideration – you only have to have 2 instances in your availability sets to enjoy the uptime guarantee. If you are unlucky enough to suffer a double problem and the availability drops below the guarantee then Microsoft compensates you.

I made a post about Update Domains and Fault Domains a couple of weeks ago. Interesting stuff if you’re going to take the Azure Infrastructure exam.

 

Anyway – here’s the email:

---- cut here -----

   

Upcoming maintenance will affect deployments of Azure Virtual Machines in availability sets and Cloud Services.

 

Azure

 
   

As part of our ongoing commitment to performance, reliability, and security, we sometimes perform maintenance operations in our Azure regions and datacenters. We want to notify you of upcoming maintenance operations that will impact Virtual Machines in an availability set and Cloud Services. Note: Currently, we’re only able to provide 2 days' advance notice for updates that impact Virtual Machines in availability sets and Cloud Services. We’re working to provide more advance notice in the future. The following are the planned start times for infrastructure-as-a-service (IaaS) and platform-as-a-service (PaaS) maintenance operations, provided in both Coordinated Universal Time (UTC) and United States Pacific Daylight Time (PDT). Impacted deployments are listed at the bottom of this email.

   
   
 

Region

 

PDT

 

UTC

 
 

North Central US

 

08:00 Monday, June 1, 2015

 

15:00 Monday, June 1, 2015

 
 

North Europe

 

08:00 Tuesday, June 2, 2015

 

15:00 Tuesday, June 2, 2015

 
 

East US

 

08:00 Wednesday, June 3, 2015

 

15:00 Wednesday, June 3, 2015

 
   
   

Microsoft Azure Virtual Machines (IaaS)

Maintenance operations are split between virtual machines (VMs) that are and are not in an availability set. This maintenance will impact VMs in an availability set. VM deployments referenced below will reboot during this maintenance operation, but temporary storage disk contents will be retained. We expect the update to finish within 48 hours of the start time. Note: If you have a single VM in an availability set, it will still be impacted by this maintenance operation. In addition, all VMs in the same availability set are not taken down at the same time—these VMs are spread across five update domains. Only VMs in the same update domain for the availability set may be rebooted at the same time, and there will be at least a 30-minute interval between processing each update domain. VMs that are in different availability sets may be taken down at the same time. For more information, please visit the availability sets documentation webpage.

If you’re not already, we recommend using availability sets in your architecture to ensure higher availability of your service. You can read our multiple instances service level agreement (SLA) commitment for Virtual Machines.

To learn more about our planned maintenance, please visit the Planned maintenance for Azure virtual machines documentation webpage. If you have questions, please visit the Azure Virtual Machines forums.

To ensure higher availability, the maintenance is scheduled in region pairs. To help determine whether the reboot you observed on your VM is due to a planned maintenance event, please visit the Viewing VM Reboot Logs blog post.

Microsoft Azure Cloud Services (PaaS)

All Cloud Services running web and/or worker roles referenced below will experience downtime during this maintenance. Cloud Services with two or more role instances in different upgrade domains will have external connectivity at least 99.95 percent of the time. Please note that the SLA guaranteeing service availability only applies to services that are deployed with more than one instance per role. Azure updates one upgrade domain at a time. For more information about distribution of roles across upgrade domains and the update process, please visit the Update an Azure Service webpage. If you have questions, please visit the Azure Cloud Services forums.

Please note that email addresses provided for any of the following account roles also received this communication: account and service administrators, and co-administrators.

Thank you, Your Azure Team

   
   

Account Information

Subscription ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

Cloud Services Name(s): XXXXXXX, XXXXXXX, XXXXXXX, XXXXXXX, XXXXXXX

   

Have fun – Planky == @plankytronixx