SQL Server 2014 High-Availability and Multi-Datacenter Disaster Recovery with Multiple Azure ILBs

Article
12/22/2014

I already posted in my blog several articles related to high-availability (HA) and disaster recovery (DR) for SQL Server in Azure Virtual Machines (IaaS VM), but this time I’m going to add something new to what you may already know, then let me recap the situation. Today, if you want HA in Azure for SQL Server, you have to use AlwaysOn Availability Group (AG) since SQL Server Mirroring is a deprecated feature and Failover Clustering (AlwaysOn FCI) is not supported yet, due to the lack of shared storage being possible in Azure VMs.

Prolog

The first scenario I covered, more than one year ago, was related to implementing a simple AlwaysOn Availability Group, with two SQL Server instances in a single Azure datacenter and no support for Azure Internal Load Balancer (ILB):

SQL Server 2012 AlwaysOn Availability Group and Listener in Azure VMs: Notes, Details and Recommendations

http://blogs.msdn.com/b/igorpag/archive/2013/09/02/sql-server-2012-alwayson-availability-group-and-listener-in-azure-vms-notes-details-and-recommendations.aspx

As you can easily realize, this architecture presents two major weak points:

No DR site for protection from a complete Azure datacenter loss;
An internet facing endpoint for SQL Server AlwaysOn Availability Group (AG) listener must be exposed;

The second scenario that I worked on, with the possibility to connect Azure Virtual Networks (VNETs) in different regions, including a DR site for Geo-Disaster Recovery over a second Azure datacenter:

Deep Dive: SQL Server AlwaysOn Availability Groups and Cross-Region Virtual Networks in Azure

http://blogs.msdn.com/b/igorpag/archive/2014/07/03/deep-dive-sql-server-alwayson-availability-groups-and-cross-region-virtual-networks-in-azure.aspx

At this time, Azure Internal Load Balancer (ILB) was not supported yet, and then you had to expose a SQL Server endpoint over the Internet, and use the Cloud Service Virtual IP (VIP) to provide access to SQL Server instances through the AG Listener.

Architecture

Finally, Microsoft recently announced support for Azure ILB usage for AlwaysOn AG Listener, that is the last missing piece to have the perfect HA and Geo-DR architecture. This brings me to the third (and last for now) scenario that I recently implemented for one of my partners:

Let me recap here the main architectural choices and points of attention in Azure:

No public endpoint exposed over the Internet for SQL Server: in both the primary and the secondary sites I used Azure ILBs (1 for each site), then only accessible from services and VMs in the same VNETs used here.
- Be aware that only one ILB per Cloud Service can be used.
I created one VNET in the primary Azure datacenter and one VNET in the secondary Azure datacenter, then I connected them using Azure VPN.

- In this this specific scenario, high-performance Azure VPN Gateway has been used since more than enough in term of supported bandwidth, be sure to review its characteristics before deciding to adopt it:

Azure Virtual Network Gateway Improvements

http://azure.microsoft.com/blog/2014/12/02/azure-virtual-network-gateway-improvements

- If this VPN Gateway will not satisfy your bandwidth requirements, you need to consider Azure Express Route as indicated in the link below:

ExpressRoute or Virtual Network VPN – What’s right for me?

http://azure.microsoft.com/blog/2014/06/10/expressroute-or-virtual-network-vpn-whats-right-for-me

In the primary Azure datacenter, I installed two SQL Server 2014 VMs in the same Cloud Service (CS) and Availability Set (AS) to ensure 99,95% HA as requested by Azure Service Level Agreement (SLA). I also installed them in the same VNET and same subnet. Each SQL VM uses static IP address inside the VNET.

Manage the availability of virtual machines

http://azure.microsoft.com/en-us/documentation/articles/virtual-machines-manage-availability

For the above SQL Server instances, I configured synchronous data replication with automatic failover: in this way, in case of a single SQL Server VM failure, AlwaysOn AG will take over and failover to the second SQL Server VM with no data loss and no manual intervention:

NOTE: there is nothing here preventing you to allow readable secondaries, you can change this AlwaysOn configuration setting dynamically without any service interruption.

In the secondary Azure datacenter, I installed a third single SQL Server VM in its own CS, AS, VNET and subnet, using a static IP address: this is the minimum requirement, to have an effective DR solution, but if you want to have more protection you can also install a fourth SQL Server VM here to maintain HA also in the case of complete primary datacenter loss.
As you can see in the picture above, I used asynchronous data replication for the third SQL Server instance for the following reasons:
- SQL Server 2014 allows only 3 sync replicas (1 primary + 2 secondaries) but only 2 instances for automatic failover;
- Enabling synchronous data replica between remote datacenters will hurt the database performances on the primary instance since each transaction must be also committed by the remote SQL Server instance;

IMPORTANT: Async data replication means possible data loss (RPO>0) in case of a complete primary datacenter loss. If you want zero data loss (RPO=0), you should configure synchronous data replication also for the SQL Server instance in the DR site, but it’s highly recommended to test the performance impact of network latency between the two remote Azure datacenters. Be also aware that automatic failover between datacenters is not possible today with SQL Server 2014 (RTO>0).

All the above SQL Server instances are part of the same AlwaysOn AG and Cluster. Since it is a requirement for Cluster and AlwaysOn AG, all the VMs in both datacenters are part of the same Active Directory (AD) Domain: for this reason I installed two Domain Controllers (DCs) in the primary datacenter (same own Cloud Service, VNET, subnet and Availability Set) and one DC in the secondary datacenter (separate Cloud Service, VNET, subnet and Availability Set). Each DC is also a DNS Server and uses static IP address inside the VNET. The DNS zone used by Active Directory should be “Active Directory Integrated”, that is the DNS zone data will be replicated using Active Directory replication to all DCs/DNS.

In order to avoid the “Island” problem in DNS replication and name resolution, I strongly recommend you to adopt the following settings for DNS clients on all the VM, including the DCs:
- For all the VMs in the primary VNET:
  - Primary DNS Server = the first DC/DNS in the primary VNET;
  - Secondary DNS Server = the second DC/DNS in the primary VNET;
- For all the VMs in the secondary VNET:
  - Primary DNS Server = one of the DC/DNS in the primary VNET;
  - Secondary DNS Server = the DC/DNS in the secondary VNET

To complete the Cluster configuration required for this scenario, I removed the quorum vote for the SQL Server VM in the secondary DR site and created a new VM in the primary datacenter (same Cloud Service, VNET, subnet and Availability Set as for SQL Server VMs): no SQL Server installed in this VM, its only purpose is to be the Cluster Witness and provide a cluster vote to reach the quorum and then ensure Cluster healthy state in case of secondary datacenter loss.

Configure and Manage the Quorum in a Windows Server 2012 Failover Cluster

http://technet.microsoft.com/en-us/library/jj612870.aspx

Finally, I used Network Security Groups (NSG) to harden the security configuration and have strict control over the possible network communications between different subnets in both VNETs:

Network Security Groups

http://azure.microsoft.com/blog/2014/11/04/network-security-Groups

Since I used Azure ILB, then an internal non-internet facing IP (DIP), and due to the network restrictions on supporting AlwaysOn AG in Azure, I installed the application VMs in a different Cloud Service (CS), subnet and Availability Set (AS), but inside the same VNET in each site. This is necessary in order for the application to be able to access SQL databases over the Azure AG Listener.
Azure Storage accounts are an important part here since all the VM OS disks and additional data disks must reside on persistent Azure Blob storage. If for some (or all) VMs you are going to use multiple disks, as it’s likely to happen for SQL Server VMs, it’s highly recommended to do not use Azure geo-replication for storage accounts since not supported. Each disk is a blob and storage geo-replication in Azure is asynchronous and can ensure write ordering (then consistency) only at the single blob level, not between multiple blobs (= disks). Please note that geo-replication (GRS) is enabled by default when creating a new Azure storage account, be sure to use “Locally Redundant” (LRS) instead:

Azure Storage Redundancy Options

http://msdn.microsoft.com/en-us/library/azure/dn727290.aspx

In addition to Azure storage account replication mode, you also need to carefully consider how many storage accounts you need to use. For Azure Standard Storage, there is a global limit of 20K IOPS, this means a maximum number of 40 disks 1TB each (500 IOPS per disk). Depending on the VM size, you can use up to a certain amount of disks, then since you have to accommodate at least two SQL Server VMs, in addition to the Domain Controllers (DCs) and the Witness VM, be sure to do your maths correctly and eventually use more than one storage account in each Azure datacenter. In the first link below, Azure Premium Storage (currently in preview) is also mentioned: with this option, you can have up to 5K IOPS per single disk and up to 50K IOPS per VM.

Azure Storage Scalability and Performance Targets

http://msdn.microsoft.com/library/azure/dn249410.aspx

Virtual Machine and Cloud Service Sizes for Azure

http://msdn.microsoft.com/en-us/library/azure/dn197896.aspx

Inside SQL Server VMs, I used the maximum amount of disks permitted by the specific VM size, then I used Windows Server 2012 R2 (Guest OS) “Storage Pool” technology to group together all the physical disks and present a unique logical volume with increased IOPS, more details are reported at the link below:

Best Practices & Disaster Recovery for Storage Spaces and Pools in Azure

http://blogs.msdn.com/b/igorpag/archive/2014/06/10/best-practices-amp-disaster-recovery-for-storage-spaces-and-pools-in-azure.aspx

Multiple ILBs Configuration

The SQL team already published a step-by-step procedure on how to create an Azure Listener for AlwaysOn AG using Azure Internal Load Balancer (ILB), let’s use the article below as a starting point:

Tutorial: Listener Configuration for AlwaysOn Availability Groups

http://msdn.microsoft.com/en-us/library/azure/dn425027.aspx

This article describe how to setup a Listener using ILB but only for a single datacenter, then using a single ILB. Here is what you have to do to build a multi-datacenter configuration using one ILB for each site:

Step[2.9]: Execute first for the primary datacenter, then repeat again for the secondary datacenter. Be sure to include for each VNET only the VMs allocated inside it, and obviously be sure to sure different Cloud Services, ILB names, Subnet namez and static IPs:

Step[5.6]: Here you need to have 2 IP addresses, one for each VNET, be sure to configure the dependencies of the CAP (Client Access Point) network name on these IPs using “OR” and not “AND”:

Step[5.11]: Execute for each VNET and related SQL VM Cloud Service:
- For the Cluster network name, be sure to use the right network name for your VNET as shown in the Failover Cluster Manager:

- For the IP Resource Name, be sure to use the one in the list of dependencies mentioned at the previous point that is related to the current VNET;
- For the ILB IP, be sure to use the ILB created locally for each VNET:

Now you should complete the procedure mentioned in the article and test the failover behavior of the AG and its associated Listener. If everything is setup correctly, the AG should come always online succesfully, but with only one underlying IP as shown in the picture below:

As a final step, after the Client Access Point has been added to the cluster, it will be necessary to set its RegisterAllProvidersIP property to 1, if multi-subnet failover is desired and the application connecting to the listener will be using MultiSubnetFailover=True. By default, when the CAP is added to the cluster manually, this property is set to 0.

Create or Configure an Availability Group Listener (SQL Server)

https://msdn.microsoft.com/en-us/library/hh213080(v=sql.120).aspx#RegisterAllProvidersIP

SQL Server 2014 and Azure

Even if not strictly related to HA & DR, if you want to have the maximum from SQL Server 2014 when installed in Azure, I strongly recommend you to evaluate the following “integration features”:

SQL Server 2014 Backup to Azure Blob Storage: SQL Server 2014 has the possibility to use Azure Blob storage as target media for database backups: no VM disks are necessary to store your backups, you can directly use Azure Blob storage and then have benefits from 3 local replicas and remote 3 replicas for geo-DR. Backup set will be encrypted. You can read more details at the following link:

SQL Server Backup to URL

http://msdn.microsoft.com/en-us/library/dn435916.aspx

SQL Server 2014 Managed Backup: This a new way to manage backup in SQL Server: the engine itself will take care of taking backup of existing and future (non-existing) databases, based on policies that the DBA can define; this is not simply scheduling, it’s a dynamic and intelligent mechanism: the backup strategy used by SQL Server Managed Backup to Windows Azure is based on the retention period and the transaction workload on the database. This feature will use Azure Blob storage as media target, as explained at the previous point. You can read more details at the following link:

SQL Server Managed Backup to Windows Azure

http://msdn.microsoft.com/en-us/library/dn449496(v=sql.120).aspx

Azure D-SERIES local temporary SSD storage: Azure D-SERIES VMs, come with very powerful local temporary SSD disk, you can everage this important resource in the following SQL Server 2014 features:
- Buffer Pool extension: with this feature, SQL Server is able to use SSD storage to enhance its cache and then providing an extra-layer for data caching, you can read more details at the link below:

Buffer Pool Extension

http://msdn.microsoft.com/en-us/library/dn133176.aspx

- TEMPDB allocation: since the VM local SSD drive is temporary and not fully persistent, using it for TEMPDB database allocation is an optimal and recommended choice. You can then have high-performant TEMPDB without wasting Azure persistent data dsks that can be fully used for user databases. You can read more details at he following links:

Using SSDs in Azure VMs to store SQL Server TempDB and Buffer Pool xtensions

http://blogs.technet.com/b/dataplatforminsider/archive/2014/09/25/using-ssds-in-azure-vms-to-store-sql-server-tempdb-and-buffer-pool-extensions.aspx

That’s all folks! This is my last blog post in the current year, I wish you a merry Christmas and happy new year. Let me know if you have any feedback or question, as usual you can follow me on Twitter ( @igorpag). Regards.

SQL Server 2014 High-Availability and Multi-Datacenter Disaster Recovery with Multiple Azure ILBs

Additional resources