Authors: Lisa Erickson (Symantec), Konstantin Dotchkoff (Microsoft)
One of the fundamental promises of cloud computing is cost benefits from economy of scale. Running applications in a public cloud is typically less expensive than in an on-premises environment. But there is even more potential for cost reductions. Because cloud computing makes the costs for running an application very transparent, developers and operations teams tend to spend iterations of optimizations to further reduce the operation costs. Developers can compare the cost of different implementation and design options to understand how they could tune the app for cost. Using dynamic auto-scale to respond to changes in demand is a common practice too. The ultimate goal is to use exactly as much resources as needed and thus to pay for only what you really need.
Obviously, the potential for savings depends on the workload profile. Disaster Recovery (DR) is an interesting case, because for most of the time a DR site and resources are not actively being used. Only in a case of a disaster the secondary site will be fully needed. Using Azure as a DR site is an interesting option since you generally pay for only what you use and inbound data transfers are free. This would allow to greatly reduce the expenses for a DR site (in addition to shifting Capex to Opex cost).
Let’s take a look at an example of a disaster recovery solution and some of the design considerations applied to achieve cost optimizations. Symantec, having a huge expertise and a long tradition in delivering enterprise high availability and disaster recovery solutions, partnered with Microsoft to provide a cost effective solution for DR to Microsoft Azure. Symantec Disaster Recovery Orchestrator (DRO) enables businesses to automate and manage the takeover and failback of Windows-based applications residing on either physical or virtual machines (VMs) to Microsoft Azure. The solution is application centric and provides fully automated and orchestrated end-to-end application recovery.
In a traditional DR approach the infrastructure required for the DR site will be provisioned and typically a 1:1 replication relationship or pairing between the source and the target resources will be established. Some businesses may instead decide to keep “cold” stand-by capacity for DR, but this greatly increases the RTO and still has costs associated with the procured hardware.
On the other hand, in a public cloud environment you pay for only what you use. In Azure, VMs are billed per minute and there are no charges for stopped (deallocated) VMs. This influenced the architecture design of the solution. It resulted in an implementation where application data is replicated to the DR site without the need for all application recovery VMs to be up and running, since they are not actively needed until a takeover event occurs. The target VMs can be provisioned when needed for takeover, providing a just in time disaster recovery. The replication streams for protected applications are terminated using a central component running in Azure, acting as a consolidated replication target. However, it is necessary a) to keep the data separated for each application and b) to provide the ability for each application recovery VM to access the application data in a case of disaster. To fulfil these requirements the consolidated replication target runs within a separate Azure VM (Controller VM in the graphic below) that writes replicated data to attached target disk, minimum of one per application as shown below:
As you can see from the graphic, application recovery VMs are not actively running in this passive state. Those VMs are provisioned during the initial configuration of a protected application, but are removed with the system disk retained at the end of the configuration process. Upon a takeover operation, the application recovery VMs will be recreated out of the saved system disk and the application recovered leveraging the disks used for application data replication by detaching from the Controller VM and attaching to the just created application recovery VM.
(Note: At the time of writing up to 16 disks with up to 1TB can be attached to a VM; if necessary multiple DRO instances can be used for greater scale.)
In addition to the replication engine, the Controller VM also hosts the solutions automation and orchestration components and management portal. All these components are deployed on the Controller VM to reduce the number of VMs and associated costs to a minimum.
Here are the detailed tasks the Disaster Recovery Orchestrator will handle during an application takeover to the DR site:
- Offline the on-premises application (when it is available and running)
- Pause the replication and wait for it to return to up-to-date state
- Provision application recovery VM in Azure
- Detach the replicated disks for the application from the controller and attach to the application recovery VM
- Update the replication primary to the application recovery VM (this will reverse the replication from Azure to on-premises target, when available)
- Online the application in Azure
The following graphic shows the configuration after a takeover operation in a DR case:
After the takeover the application VMs are running with the application data disks attached to them. The replication direction has been reversed from the application VMs in Azure to the application nodes on premises and once the primary site becomes available the DRO will synchronize the sites to enable a fail back operation.
This example demonstrates how the solution design was optimized for cost reduction. Symantec Disaster Recovery Orchestrator leverages a consolidated replication target as the central component for applications that need to be recovered in the cloud. This approach reduces the number of VMs needed in Azure, while providing the flexibility of just in time recovery of an application in the cloud.