Best Practices for Higher Availability & Recovery (IaaS)

Article
11/06/2014

Este post comenta las formas de minimizar el impacto que pueda haber causado la perdida de servicio en máquinas virtualeds de Azure (IaaS) y proporciona inforamcion sobre la plataforma y opciones de recuperación de las máquinas virtuales

Se incluyen los siguientes temas:

1. Possible causes of Downtime

2. Check List for Higher Availability

3. Recovery Scenatios

1. Possible causes of Downtime

There are several reasons that might cause Azure IaaS Virtual Machines being restarted or taken offline, which affects its availability, like:

• Automatic upgrade of the gest Operating System

• Service Healing (Hardware or Storage connectivity problem)

• Data Center Outages

• Network Issues

• Application failure/misconfigurations within the Virtual Machine

Automatic upgrade of the Host Operating System

In order to ensure reliability, performance, and security, Azure requires periodical updates to the underlying platform infrastructure where your virtual machines run on. This will require a reboot to your virtual machines and it is performed roughly once per month. You can find more details about it in the following links:

Windows Azure Host Updates: Why, When, and How

https://blogs.technet.com/b/markrussinovich/archive/2012/08/22/3515679.aspx

Fault Domain and Upgrade Domain

The concepts of fault domains and upgrade domains are born to prevent a single point of failure.

Fault Domains refers to how hardware is arranged within a rack of computers in a datacenter. By placing fault domains in separate racks, you separate instances of application deployment to different hardware and it’s unlikely that all would fail at the same time. Furthermore, a failure of one domain should not affect the other.

Windows Azure Compute service SLA guarantees the level of connectivity uptime for a deployed service only if two or more instances of each role of a service are deployed.

Update Domains allows you to group logical instances (or units of deployment) to ensure an application stays up and running, while undergoing an update of the application or the guest operating system.

When upgrading a deployment, Windows azure will make sure that the instances are upgraded one domain at a time. The steps are: stopping the instances running in the first upgrade domain, upgrading the application, bringing the instances back online followed by repeating the steps in the next upgrade domain- Windows Azure ensures that an upgrade takes place with the least possible impact to the running service.

See more at: https://blogs.technet.com/b/yungchou/archive/2011/05/16/window-azure-fault-domain-and-update-domain-explained-for-it-.aspx#sthash.QqChyCiV.dpuf

Single instance VM will receive notifications for planned maintenance in advance.

Service Healing

One major advantage of running virtual machines on Azure is that it can keep your VMs available even when there are problems. When Azure detects a problem with a node (This may include local network failures, local disk failures, or other rack level failures), it proactively moves the VMs to new nodes so they are restored to a running and accessible state. This is called service healing. When this occurs, you lose connectivity to VM during the service healing process and after the service healing process is completed, when you connect to VM, you will likely to find an event log entry indicating VM restart/shutdown (either gracefully or unexpected).

Data Center Outages

Service outages unfortunately sometimes happen with the cloud.

You can check the Dashboard for any outages in real time or past dates for any of the Data Centers.

Additionally, when a service outage incident occurs that affects your apps, you will be able to see a notification in the Portal

You will find alerts for the following types of incidents:

•Partial Performance Degradation

•Partial Service Interruption

•Full Performance Degradation

•Full Service Interruption

•Advisory

If you click OK within the notification window, you will see a dialog that provides more details about the incident(s):

This dialog will include key information such as the timestamp of the incident, name of the service and the incident type, description of the latest update related to the incident, and the SubscriptionID (where available) of the subscriptions you have that use the service in question. With this release, the SubscriptionID will be provided for incidents involving Virtual Machines, Cloud Services, Storage, SQL Databases, Service Bus and Web Sites. You may see “Not Available” for other services, but we are working to add these in the future releases.

From this incident details dialog, you can navigate to the Operation Logs page by clicking on the link at the bottom of the dialog. This page will give you the filtered view of history for incidents that carry the same SubscriptionID information. This will allow you to see full details for every past incident involving this service (along with start and end times of the incidents).

Taken from: https://weblogs.asp.net/scottgu/azure-expressroute-dedicated-networking-web-site-backup-restore-mobile-services-net-support-hadoop-2-2-and-more

Network Issues

You might not be able to connect to your virtual machine in the cloud due to Network issues. The problem might vary from Firewall misconfiguration within the virtual machine and network outages in the Azure Data Center to Proxies, Routers, Internet providers, VPN configuration and DNS issues.

You could use the following steps to try to identify where the problem lies and troubleshooting:

If using VPN for connection

· Verify the VPN tunnel is still up, from the portal and on your VPN device, try to access the VM using the VIP to bypass the VPN

· Run a trace route into the VM private IP and verify the path followed goes through the VPN connection, Ie tracert 10.0.0.1

· If ping was previously enabled across the VPN tunnel and VM verify it still works

· When the issue is related to DNS issues or no internet on the VM, and the DNS is reached using the VPN tunnel verify there is communication with the DNS server and that its responding the queries accordingly, if possible validate this communication using Wireshark or Netmon

If not using VPN

· Verify the VM is responding to RDP request tools such as PsPing, PortQry, Telnet, or Nmap.

· Verify your LAN is allowing traffic to the VM public ports, some LANs block ports for security reasons, if possible try a different connection

· Verify if public IP changed, it’s possible to be using an out of date RDP file

· Verify if you can RDP to the VM from another VM in azure.

Firewall Misconfiguration

There are several steps you can try if you suspect that the problem is in the firewall or RDP of the virtual machine:

· Use PowerShell to enable RDP or reset the Password with the VM Agent (note: the VM Agent needs to be installed in the machine)

Enable RDP or Reset Password with the VM Agent

https://blogs.msdn.com/b/wats/archive/2014/03/06/enable-rdp-or-reset-password-with-the-vm-agent.aspx

· Or you can attach the OS disk to another VM in the same cloud service and perform any modifications needed like

o Editing the registry to disable Network Location Awareness (NLA), which lets you see the logon screen when attempting to do an RDP logon so you can reset an expired password

o Disable the guest firewall if you inadvertently blocked RDP or SSH.

o And more generally you can view the event logs to investigate a possible no-boot or hang issue.

Troubleshoot Azure VM by attaching OS disk to another Azure VM

https://social.technet.microsoft.com/wiki/contents/articles/18710.troubleshoot-azure-vm-by-attaching-os-disk-to-another-azure-vm.aspx

· Additionally, in the following link you can find generic steps you can try to restore the connection like restarting, resizing, recreating the virtual machine…

Troubleshooting Endpoint Connectivity (RDP/SSH/HTTP, etc. failures)

https://social.msdn.microsoft.com/Forums/windowsazure/en-US/538a8f18-7c1f-4d6e-b81c-70c00e25c93d/troubleshooting-endpoint-connectivity-rdpsshhttp-etc-failures?forum=WAVirtualMachinesforWindows

Application failure/misconfigurations within the Virtual Machine

Moving applications to the cloud require to perform some adaptations to the new infrastructure. Although this is not the topic of this document, downtime can sometimes being caused by the inadequate configuration of applications.

2. Check List for Higher Availability

Avoiding

Data Loss:

Temporary Disk D:

The temporary storage drive, labeled as the D: drive is not persisted and is not saved in the Windows Azure Blob storage. It is used primarily for the page file and its performance is not guaranteed to be predictable. Management tasks such as a change to the virtual machine size, resets the D: drive. In addition, Windows Azure erases the data on the temporary storage drive when a virtual machine fails over. The D: drive is not recommended for storing any user or system database files, including tempdb.

This is equivalent to “/dev/sdb1” on a Linux VM.

More information:

Understanding the temporary drive on Windows Azure Virtual Machines

https://blogs.msdn.com/b/wats/archive/2013/12/07/understanding-the-temporary-drive-on-windows-azure-virtual-machines.aspx

Availability Set

Configure multiple virtual machines in an Availability Set for redundancy and high availability. This warranties SLA. More information:

Manage the availability of virtual machines

https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-manage-availability/

Load Balance

Azure infrastructure services provide load balancing capabilities to distribute the client requests among the role instances and serves as backups in case the primary becomes unavailable.

Load Balancing for Azure Infrastructure Services

https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-load-balance/

Azure Traffic Manager

Distribute user traffic to similar hosted services within the same data center or in different data centers.

Traffic Manager applies an intelligent policy engine to the DNS queries on your domain names so that you can send traffic to the best data center for performance, business continuity, price, compliance, legal, or tax purposes.

Traffic Manager Overview

https://msdn.microsoft.com/en-us/library/azure/hh744833.aspx

Mitigating Storage Failure

Because you might have several Azure services in the same storage, Azure applies a number of resource throttles that you must take into account. If you exceed the throttle limit for a resource, a further request for that resource will result in an exception. Use multiple storage accounts when data or bandwidth exceeds quotas.

Additionally, Storage can be a single point of Failure and it might be advisable to spread different services in different Storage Accounts depending on the current needs.

Azure Storage Scalability and Performance Targets

https://msdn.microsoft.com/library/azure/dn249410.aspx

How to Monitor for Storage Account Throttling

https://blogs.msdn.com/b/wats/archive/2014/08/02/how-to-monitor-for-storage-account-throttling.aspx

Storage

Geo Replication

Data in your storage account is replicated to ensure durability that is also highly available. You can learn more about Local Redundant Storage, Zone Redundant Storage and geo-redundant storage in the following link:

Replicate your storage account data for durability and high availability

https://azure.microsoft.com/en-us/documentation/articles/storage-manage-storage-account/#georeplication

VM Backups

You can use the Azure Backup Agent to back up your Virtual Machines, and setup a schedule:

Configure Azure Backup to quickly and easily back-up Windows Server

https://azure.microsoft.com/en-us/documentation/articles/backup-configure-vault/

Limits of the platform

In order to foresee the capabilities of our growth in the future It is good to take into account the limitations of the Azure Platform:

Azure Subscription and Service Limits, Quotas, and Constraints

https://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/

Copy VHD

Use the Blob Copy API (AzureStorageBlobCopy) to duplicate VM Disks.

Create Backups of Virtual Machines in Windows Azure by using PowerShell

https://blogs.technet.com/b/heyscriptingguy/archive/2014/01/24/create-backups-of-virtual-machines-in-windows-azure-by-using-powershell.aspx

3. Recovery Scenarios

· Restart, Resize (This operation will move the VM to a different node), recreate Endpoints

· Recreating a Virtual Machine from existing VHD

· Mounting disks to another VM – Windows

· Mounting disks to another VM – Linux

Recreating a Virtual Machine from existing VHD

The steps for deleting a VM and recreating from the original VHD are as follows:

1. In the Microsoft Azure Portal select the VM that you wish to recreate from VHD. Make sure to make a note of the VM name.

2. With the VM selected, click on “Delete”, this will bring up the Delete Menu.

3. Make sure to select “Keep the attached disks”

4. There will be a prompt to confirm the delete. Verify that it states, “The attached disks and there VHD files won’t be deleted from your storage account.”

5. After verifying the disks will be maintained, select “Yes”

6. The status of the VM will change to “Running (Deleting)”. Once the VM is deleted it will be removed from the list

7. In order to recreate the VM, in the Management Portal, in the lower left, select “New”

8. Next select “From Gallery”

9. From the “Choose an Image” Menu, select “MY DISKS”. This will bring up the list of available disks

.

10. Select the disk with the VM name from Step 1.

11. On the right side of the “Create A Virtual Machine” Window, notice the Location. This will be the region the VM will need to be deployed.

Mounting disks to another VM - Windows

Follow steps 1 to 6 from section Recreating a Virtual Machine from existing VHD

7. Attach the OS disk as Data Disk to another VM which is up and running.

8. Connect to the VM and make sure the attached disk is Visible.

Perform any change needed on the disk like changing any registry settings….

9. Now detach the disk and create a new VM from the disk.

10. You can now logon to the VM and remove any firewall rules that you created

Mounting disks to another VM - Linux

A = Original VM (Inaccessible VM)

B = New VM (New Temp VM)

1) Stop VM A via the Azure management portal

2) Create a temporary VM in the same cloud service if you wish to retain the VIP, alternatively if you want to delete also the Cloud Service just create a temp VM

3) Delete VM A BUT select “keep the attached disks”

4) Once the lease is cleared, attach the Data Disk from A to VM B via the Azure Portal, Virtual Machines, Select “A”, Attach Disk

5) On VM B you will need to locate the drive you have attached

a. First locate the drive name to mount, on VM B by looking in relevant log file (note each linux is slightly different)

grep SCSI /var/log/kern.log (ubuntu)

grep SCSI /var/log/messages (centos, suse, oracle)

Example kernel: [ 9707.100572] sd 3:0:0:0: [ sdc ] Attached SCSI disk

b. You will not be able to mount the file system so check and double check that you are going to run fsck on the correct un-mounted file system and not one of your mounted file systems.

c. fdisk –l will return the attached disks, combine this output with your df –h

fdisk -l

Disk /dev/sdc: 32.2 GB, 32212254720 bytes

255 heads, 63 sectors/track, 3916 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Sector size (logical/physical): 512 bytes / 512 bytes

I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk identifier: 0x000c23d3

Device Boot Start End Blocks Id System

/dev/sdc1 * 1 3789 30432256 83 Linux

/dev/sdc2 3789 3917 1024000 82 Linux swap / Solaris

df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda1 29G 2.2G 25G 9% /

tmpfs 776M 0 776M 0% /dev/shm

/dev/sdb1 69G 180M 66G 1% /mnt/resource

d. sda1 and sdb1 are mounted sdc1 is not mounted, we will run fsck against /dev/sdc1

fsck -yM /dev/sdc1

fsck from util-linux-ng 2.17.2

e2fsck 1.41.12 (17-May-2010)

/dev/sdc1: clean, 57029/1905008 files, 672768/7608064 blocks

6) Detach disk from VM B via the management portal

7) Recreate the original VM (Create VM from Gallery, Select My Disks) you will see the Disk referring to VM A

Maria Esteban Garcia
Windows Azure Support Engineer

Best Practices for Higher Availability & Recovery (IaaS)

Additional resources