SQL Server VM Disaster Recovery between AZURE and AMAZON

Some days ago, a recent article from former Microsoft employee Michael Washam (http://michaelwasham.com) captured my attention:

Connecting Clouds – Creating a site-to-site Network with Amazon Web Services and Windows Azure

http://michaelwasham.com/2013/09/03/connecting-clouds-site-to-site-aws-azure

Wow! Today we cannot (yet! :-) ) have an Azure Virtual Network/VPN crossing more than one Azure datacenter, but we can have a Virtual Network/VPN spanning two different Cloud providers…. Awesome!

My mind immediately went to the possible implications of new high availability and disaster recovery scenarios, such as building a solution that is not tied to a single Cloud Provider: working with partners on several Azure projects, I heard this kind of request several times since they want to ensure at least Disaster Recovery (DR), maybe also High Availability (HA), can be achieved even if a single Cloud Provider will fail completely.

  • Reading the article, the procedure is pretty simple:
  • Create a Virtual Private Cloud (VPC) on Amazon;
  • Create a Virtual Network (VNET) on Azure with a Gateway;
  • Deploy a Linux VM in Amazon VPC to host OpenSwan VPN software and configure parameters to connect to the Azure VNET Gateway;

NOTE: OpenSwan is a complete IPsec implementation for Linux, for more information see this link:  https://www.openswan.org/projects/openswan .

The overall configuration process is simple, but there are some caveats:

  • Even if OpenSwan, configured as in the article, seems to satisfy all the technical requirements for Azure Virtual Network Gateway connection, it’s not officially supported by Microsoft; pretty obvious that you will not able to open a Support Case with Microsoft complaining OpenSwan it’s not working;
  • For a list of Azure gateway requirements and supported VPN devices, see the links below:

http://msdn.microsoft.com/en-us/library/windowsazure/jj156075.aspx#BKMK_VPNGateway

http://msdn.microsoft.com/en-us/library/windowsazure/jj156075.aspx#bkmk_VPNDevice

  • While in Azure the VPN endpoint is highly-available, since backed up by TWO distinct (and hidden) Azure VMs, the architecture described in the article above presents a single point of failure on the OpenSwan server: I don’t know if that piece of software supports some kind of HA, but definitely you should investigate and evaluate;

But wait a moment: Why I have to use OpenSwan and Linux in the Amazon VPC, since it’s not officially supported by Azure? You can use a Windows Server 2012 VM and its RRAS feature and that’s it! It’ officially supported as you can read in the link below:

http://msdn.microsoft.com/en-us/library/windowsazure/jj156075.aspx#bkmk_VPNDevice

IMPORTANT: At least at my knowledge, there is no way to make Windows Server 2012 RRAS highly-available, then also in this case the proposed solution is more suitable for DR purposes, not HA.

 

Ok, now that you know the whole story, which HA/DR scenarios we can build? Since I’m still a SQL Server guy, let me focus on SQL Server (in Azure IaaS VMs) for the purpose of simplicity.

The starting point is provided in the white-paper below, where you can find all the possible HA/DR scenarios, without considering what we are discussing in this blog post:

High Availability and Disaster Recovery for SQL Server in Windows Azure Virtual Machines

http://msdn.microsoft.com/en-us/library/jj870962.aspx 

Specifically, I’m interested in using SQL Server 2012 AlwaysOn Availability Groups (AG) to implement a DR scenario between AMAZON and AZURE, like the one below:

 

 

Here are my considerations:

  • Since all AG nodes must be of the same Windows Cluster, Active Directory connectivity is required, also by the node in AMAZON: in the picture above, I placed a Domain Controller also on the AMAZON VPC for high-availability and performance reasons, it’s highly recommended to place at least one Domain Controller per Cloud provider;
  • Please note that all 3 nodes are part of the same Windows Cluster: the majority type used is “Node Majority” since we have an odd number of nodes;
  • As on-premise, the quorum vote mechanism should be adjusted on the secondary DR site, AMAZON in my example picture above; for details, see the section “Quorum Model and Node Votes” in the white-paper mentioned at the end of this post;
  • SQL Server AG replica node in AMAZON should be configured for asynchronous replication (allow data loss) and then not for automatic failover, due to the network latency; if you require zero data loss, you can also change to synchronous replica, but be sure to test the performance impact carefully;
  • The two nodes on the Azure side, should be configured for synchronous replication and automatic failover;
  • Be aware of the costs: here you are paying for Gateway traffic on the Azure side; obviously, there are additional costs also on the AMAZON side;
  • Be aware of the bandwidth: I don’t know on AMAZON, but on the AZURE side, there is a limit of approximately 60MB/sec, due to the fact that the Azure VMs used to implement the VPN Gateway are “SMALL” sized;
  • Finally, I used AZURE as the primary cloud provider and AMAZON as the secondary, obviously you can do the converse, but I prefer to assume AZURE will have higher availability :-) ;

Now, what will happen in case of a complete AZURE or AMAZON failure?

In the scenario proposed in the picture, in case of a complete AMAZON failure, the AZURE side of the architecture will not be affected at all and SQL Server will remain up and available. Conversely, in case of a complete AZURE failure, Windows Cluster will not have the necessary quorum to remain online, then it will shut down and SQL Server will be not available: this is expected in a DR scenario, manual intervention will be required to force the AMAZON side survivor node to start and SQL Server AG to perform a forced failover (with potential data loss).

If you are interested in the recovery steps at the Windows Server Cluster and SQL 2012 AG, look at the white-paper below (section “Recovering from a Disaster”):

AlwaysOn Architecture Guide: Building a High Availability and Disaster Recovery Solution by Using Failover Cluster Instances and Availability Groups

http://msdn.microsoft.com/en-us/library/jj215886.aspx

That’s all folks…. I would like to know your opinion and eventually your experience implementing this kind of scenarios.

Regards.