I’ve booked a couple of exams. The Azure Infrastructure Exam (533) on 15th May. and the Dev exam (532) on 29th May.
I thought I’d just write a few articles about stuff I’ve found out between now and then which others who are taking the exam might find interesting.
This post: Fault Domains and Update Domains: I’ve known about them for years but never really thought about how they are actually calculated and how machines are provisioned in to them.
First – Azure VMs.
When you put VMs in to an availability set, Azure guarantees to spread them across Fault Domains and Update Domains. A Fault Domain (FD) is essentially a rack of servers. It consumes subsystems like network, power, cooling etc. So 2 VMs in the same availability set means Azure will provision them in to 2 different racks so that if say, the network or the power failed, only one rack would be affected.
I discovered there are always only 2 fault domains: FD0 and FD1. It makes it seem like your VMs only get spread across 2 racks but that’s not the case. They can be spread across more racks if you’ve got lots of VMs. But as far as your availability set is concerned FD0 and FD1 are a way of saying “This bit of infrastructure (FD0) is different to this bit (FD1). As you boot VMs in to an availability set, they get allocated like this – FD0, FD1, FD0, FD1, FD0, FD1 and so on. The pattern never changes. You’ve probably seen this diagram hundreds of times:
Figure 1: Fault Domains and Availability Sets
You can see IIS1 and 2 are the web-front end. They’re both in different fault domains. If something happens to the power going to rack 1, IIS1 will fail and so will SQL1 but the other 2 servers will continue to operate.
Now, if you add more servers to each availability set, this is what happens:
Figure 2: FD0 and FD1 are populated.
Azure continues to distribute them across fault domains. Looking at a list of the 4 IIS VMs would give a table like this:
They are allocated to FDs in the order in which they boot. So if I’d booted these systems in reverse order then they’d all be in different FDs.
Sometimes you need to update your app, or Microsoft needs to update the host on which your VM(s) are running. Note that with IaaS VMs, Microsoft does not automatically update your VMs. You have complete control (and responsibility) over that. But say if a serious security vulnerability is identified and a patch created. It’s in Microosft’s interest to get that applied to the host underneath your VM as soon as possible. So how is that done without taking your service offline? Update Domains. It’s similar to the FD methods, only this time, instead of an accidental failure, there is a purposeful move to take down one (or more) of your servers. So to make sure your service doesn’t go offline because of an update, it will walk through your update domains one after the other.
Whereas FDs are assigned in the pattern 0, 1, 0, 1, 0, 1, 0, 1…. UDs are assigned 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4….
Both FDs and UDs are assigned in the order that Azure discovers them as they are provisioned. So if you provision machines in the order Srv0, Srv1, Srv2, Srv3, Srv4, Srv5, Srv6, Srv7, Srv8, Srv9, Srv10, Srv11 you’ll end up with a table that looks like this:
|VM||Fault Domain||Update Domain|
…you can see that UDs loop around a count of 5 (0, 1, 2, 3, 4).
You can see that in the following screen shot of a collection of 9 VMs in a single availability set.
Figure 3: Fault and Update Domains in a Cloud Service comprised of Azure VMs
Second – Azure Cloud Services
With Azure VMs, FDs and UDs are assigned to the VMs in an availability set in the order in which they are provisioned. With Cloud Servces it’s almost the same but roles are used instead of availability sets. For example you might have a web role with 8 instances. The role would be assigned FDs and UDs as the instances are provisioned and discovered by Azure. The order of the instance numbers is not necessarily the order in which they are successfully provisioned. It’s just a fact of life that some machines that start the provisioning process slow down in the middle and machines that started later, catch up and overtake them.
|Instance Number||Fault Domain||Update Domain|
You can also see in this shot of a Cloud Service, the same pattern:
…but notice how by the time we get to page three (there are 100 servers in this cloud service), it starts to break down:
…that’s because UDs and FDs are assigned in the order that instances are provisioned. Some of them provision more quickly than others and that causes the pattern to break down. But there are still the correct number of FDs and UDs.
In Cloud Services, you can also set the number of update domains in the service model’s .csdef file. By default it’s set to 5 but you can increase that to a maximum of 20.
Q: If you add a new VM to an availability set, how many extra fault domains and update domains will you get if there are already 4 instances in the availability set?
|Srv0||0||0||1 UD and 1 FD|
|Srv1||1||1||2 UDs (UD0 & UD1) and 2 FD (FD0 & FD1)|
|Srv2||2||0||3 UDs and 2 FDs|
|Srv3||3||1||4 UDs and 2 FDs|
|Add new VM = Srv4||4||0||You get one extra UD (UD4), but no extra FDs|
Q: In a Cloud Service that has a Web Role with 12 instances, how many FDs and UDs will you get by default?
A: 2 FDs (FD0 and FD1). 5 UDs (UD0, UD1, UD2, UD3, UD4).
Q: You have set the maximum number of UDs in the .csdef for a Cloud Service to 20. You use Azure Virtual Machines to provision 18 VMs. How many Update Domains will you have as a result?
A: This is a bit of a trick question – the .csdef is only used for Cloud Services, not for VMs. So regardless of what you set or even how you try to do it, Azure VM UDs come in groups of 5. With 18 VMs, that means you’ll have 5 UDs. UD0 – to – UD4 a la:
Q: If you have 13 VMs in an availability set, how many VMs will be in UD0?
A: Use the table above. You can see UD0 lines up with VM0, VM5 and VM10. So there will be 3 VMs in UD0.
I hope you find this useful, if you’re going to take the Azure Infrastructure exam (533).
Planky == @plankytronixx