Running Mission Critical Solutions on Windows Azure

I wanted to follow up on my previous post for Common Tip and post something that covers how Upgrades work, how to achieve maximum availability and scale along with deployment and monitoring recommendations.

With Windows Azure there are a lot of options on how to manage and implement deployments, upgrades, scale and availability. Mission Critical applications require more effort and planning on which Azure features to leverage based on its availability requirements. 

Understanding Updates & Upgrades

There are three types of Update events that can occur on the Windows Azure Platform.

  • ·         Application Upgrade
  • ·         Guest OS Update
  • ·         Host OS Update

An Application Upgrade occurs when you do an in-place upgrade on your application, options for this is covered in the Deployment section.

The Guest OS Version is controlled via the Azure Service Configuration by setting osFamily and the osVersion attributes. The osFamily attribute currently can be a 1, Windows Server 2008, or a 2, Windows Server 2008 R2. The osVersion controls a group of OS patches and updates for a given time, by setting the osVersion to “*” instead of an explicit version, this will install OS patches automatically. For Mission Critical applications you will have to determine if having the latest patches, including security vulnerabilities installed automatically is more of risk to your solution, than manually performing the upgrade. There are two ways to perform this update manually either through the Service Configuration or through “Configure OS” within the Management Portal.

The Host OS Update occurs one the Host system. Windows Azure tenants have no control as to when this update happens. To ensure the 99.95% SLA, Windows Azure will not update all hosts at once but perform host updates on host on different Fault domains, which is why you must have at least 2 instances for each role for the guaranteed 99.95% SLA. The Scalability & Elasticity section goes into more detail on how you can ensure capacity during these updates.

Increasing Scalability & Elasticity

A lot of times it is easy to choose the largest instance possible, surely adding more memory or CPU will increase the performance of your application? The problem is that most application are not written specifically to leverage multiple CPU cores and I have yet see an application that actually needed 14 GB of memory that the Extra Large VM provides.

Which is better: 4 Mediums or 2 large instances?

I would argue that 4 Mediums would be “better” as it would give you elasticity to increase OR decrease the number of instances based on my load at any given time, yet reducing cost by not over provisioning resources. For example if you have 2 Large Instances – You couldn’t scale down without sacrificing the 99.95% SLA provided for roles that have at least 2 instances and you couldn’t scale up without paying for an entire new Extra Large VM.   

Also by choosing the medium instances over the large instances, it would allow you to increase the number update domains from 2 to 4, this allows you to have higher capacity availability during Host OS updates, Guest OS updates and Application updates. For example if a Host OS update occurs, which you cannot control when it happens, and you were using 2 Large instances, during this time frame your solution could only handle 50% of your maximum capacity. On the other hand if you had used 4 Medium instances you could choose to create 4 update domains, one for each instance, which means during any update your solution, would be able to handle 75% of your maximum capacity.   

What is your solutions tolerance for reduced capacity? Maybe 25%-50% reduced capacity for a short period is acceptable, if not what are your options?

The simplest option is to create an additional update domain that runs the same number of instances as all your other domains. In the example above of using 4 Mediums instances, this would mean running a 5th update domain with 1 additional medium instance.

Another option would be to consider auto scaling or reducing feature availability for features that are CPU and/or memory intensive. The Enterprise Library Integration pack for Windows Azure includes WASABi , Windows Azure Autoscaling Application Block. This application block allows you to create rules to scale up/down and to reduce resource intensive features on the fly. Developers can download this application block from: https://www.microsoft.com/en-us/download/details.aspx?id=28189

Increasing Availability with Geo-Redundancy

With two fault domains in a single datacenter and multiple upgrade domains, you are setup nicely in the event of an inner data center failure. What about a complete datacenter failure? As with an inner datacenter failure, you have to decide what is your tolerance for a complete datacenter failure? If your tolerance is low, then you should consider deploying your application into two data center and if you are not using auto scaling type features – this means running a complete duplicate of all of your instances in a secondary data center to run at 100% capacity you had prior to the failure. Once you have your solution deployed into two data centers, you can leverage the Traffic Manager feature to create a new ‘Failover’ policy for your secondary deployment. 

It sounds simple right? Well there are other considerations when planning to have a complete failover, such as are you leveraging any other Windows Azure features, such as storage, Service Bus, Cache or Access Control? Since these services are datacenter dependent in the event of a failure, these services may not be available.

This is not a simple task and will take additional development and planning. While I’ve geared this section to be towards for a complete data center failure, this could be leveraged for a specific service failure. For example, if your solution leverages Table Storage, if the storage service for the data center your application has a failure, your Windows Azure Compute instances will be running, but any features leveraging storage will not be available and depending on the feature this could be a critical feature of the application.

Deployment Process Recommendations

With Windows Azure there are multiple options for deploying applications. You can use Visual Studio, the management portal, or something custom using the Management Service API. There isn’t one way that is better than the other, but you should deploy your application in a consistent method across multiple environments: Development, Testing, Staging and Production.   

If it is possible within your team or organization, you should consider automated deployments. This option takes additional time, but allows you to deploy consistently. Automated deployments can be achieved in various ways, but currently require a custom development effort. The largest benefit to automated deployments is for operations teams to be able to rapidly re-deploy services in other data centers in of a catastrophic data center failure. For thoughts around how an automated deployment could be implemented, read the “Automating Deployment and using Windows Azure Storage” in the Moving Applications to the Cloud, written by Microsoft’s Patterns and Practices team. https://msdn.microsoft.com/en-us/library/ff803365

If automate deployment isn’t feasible, you should consider leveraging Windows Azure PowerShell Cmdlets, available for download here: https://wappowershell.codeplex.com/

These PowerShell Cmdlets provides consistent management accessibility that conforms to other Microsoft products that are leveraging PowerShell as a management interface.

Another common question is “Should I use a VIP Swap or leverage the In Place Upgrade?” With the 1.5 Windows Azure SDK – In Place upgrades features are on-par with a VIP Swap. So what is the advantage of using a VIP Swap? The advantage is the ability to follow a process, to smoke test your application prior to it being published “live”. IF you were to use an in-place upgrade, once complete the application is “live”. Mistakes happen in IT, what if someone published the wrong version? If you used VIP swap, you could smoke test the application first and even after you perform the VIP swap, you could revert back to the previous version instantaneously!

Solution Monitoring

Even if you are using an auto scaling framework such as WASABi – Monitoring is an important aspect. Monitoring will assist in diagnosing issues within your application and knowing important key metrics about your application performance, errors and load. While Windows Azure can log Windows Events, Performance Counters and Logs to your storage accounts there is no “Out of the Box” Solution to view this data in a user friendly, graphical format.

System Center Operations Manager offers a Windows Azure management pack, which allows you to view alerts for Events and Performance metrics of your applications in a friendly manner. 

The Windows Azure Management pack is available for download:  https://www.microsoft.com/en-us/download/details.aspx?id=11324

Disaster Recovery

While it is true that Windows Azure does backup your Storage and SQL Azure data in triplicate, however this isn’t for you to leverage to restore your data on-demand. These backups are used by the Windows Azure teams in case of a catastrophic failure so they can restore an entire datacenter and not necessarily recover data for individual Windows Azure tenants. While this provides some relief, it is best to have backups of your application, configurations and data so that you can restore onto another Azure datacenter manually. While creating backups of Storage data would be a custom development effort, relational data stored on SQL Azure can be backed up and restored to a file using the Import/Export feature in the Windows Azure.