An overview of High Availability and Disaster Recovery solutions available for SQL Server

Article
06/15/2013

With the more recent releases of SQL 2012 and Windows 2012, we are seeing the landscape of available options for SQL HA/DR specific architectures is improving significantly. Below is a summary of the high availability and disaster recovery solutions available for a SQL environment.

To begin, it is important to outline some key terminology.

· RTO (Recovery Time Objective): The duration of acceptable application downtime, whether from unplanned outage or from scheduled maintenance/upgrades. The primary goal is to restore full service to the point that new transactions can take place.

· RPO (Recovery Point Objective): The ability to accept potential data loss from an outage. It is the time gap or latency between the last committed data transaction before the failure and the most recent data recovered after the failure. The actual data loss can vary depending upon the workload on the system at the time of the failure, the type of failure, and the type of high availability solution used.

· RLO ( Recovery Level Objective): This objective defines the granularity with which you must be able to recover data — whether you must be able to recover the whole instance, database or set of databases, or specific tables.

· High Availability (HA): The principal goal of a high availability solution is to minimize or mitigate the impact of downtime.

· Disaster Recovery (DR): Disaster recovery efforts address what is done to re-establish availability after an outage. It refers to restoring your systems and data to a previous acceptable state in the event of partial or complete failure of computers due to natural or technical causes.

· Quorum: Generally speaking, a quorum is the minimum number of members of an assembly, who must be present before the members can conduct business. The requirement for a quorum protects against unrepresentative action in the name of the body by a disproportionately small number of individuals (adapted from Bing Dictionary and Wikipedia).

o In the context of Windows Clustering, the quorum configuration determines the number of failures the cluster can sustain.

o In the context of SQL database mirroring or availability groups with automatic failover, quorum makes sure that a database is owned by only one partner or replica at a time.

Technical description of various HA/DR solutions:

1. Single Site SQL Failover Clustered Instance (FCI)

a. Primarily an HA feature

b. Uses time tested Windows Clustering technology. Advanced heartbeat, failover, quorum model, and quorum voting options available in the latest versions of Windows.

c. Protects against Machine failures (ex. CPU/motherboard causes machine to become unresponsive or failed)

d. Protects against OS failures (ex. blue screens)

e. Protects against SQL Instance failures (ex. SQL hangs/ AV’s)

f. Application connects using one virtual server name, which is not tied to a specific machine name. The current owning node is abstracted via the virtual server name.

g. Works well with most other features such as log shipping, replication, and asynchronous availability groups

h. Manages external dependencies well upon failover. All system and user databases reside on a shared drive – failover simply provides access to the same shared drives to the new owning node. Registry keys are replicated via cluster checkpoints.

i. Supports rolling upgrades scenarios

j. Instance-level protection without data redundancy (Instance [-]DATA RLO)

a. Considerations:

- There is no concept of a secondary database. Does not maintain a redundant copy of the data and so does not protect against an I/O subsystem failure

- No special requirements with respect to database recovery models

- Must ensure nodes are maintained properly (patching levels should match)

- A major outage like a datacenter power failure, or failure of the network link to the primary data center is not addressed because all nodes are within the same datacenter.

2. Database Mirroring:

With Automatic Failover:

a. Primarily a DR solution. Can be used as HA when used with synchronous/automatic failover options.

b. Protects against I/O subsystem failure on the primary server or datacenter

c. Log stream compression available for increased performance

d. Automatic page repair

e. Automatic failover is possible, requires a 3^rd witness server

f. Automatic redirection to secondary is possible

g. Can choose between several mirroring and failover options: synchronous or asynchronous streaming, automatic or manual failover

h. DB Level protection (database RLO)

i. Considerations:

- Database must be using the FULL recovery model

- Database on secondary is not writeable

- Database on secondary is not readable (can create database snapshots to work around this, but this can quickly become burdensome)

- Additional management overhead of third witness server

- Deprecated in SQL 2012

- Only 1 secondary allowed

- Can only failover at database granularity - cannot group a set of databases to failover together

- If automatic client redirection is required, the client must change the connection string to include the new FAILOVER_PARTNER option. Also, the client connectivity components must be at a version which supports the new connection string – so may not be suitable for legacy applications.

- Database failover occurs at the user database level, not at the server level. Note that the system databases master, tempdb, and model cannot be mirrored and that only one secondary database is supported with database mirroring in SQL Server 2008. Databases have dependencies on objects that reside in other databases such as the master database. These objects often include logins, jobs, certificates, and server audits but can include other objects. A process (normally involving scripts) needs to be adopted to keep objects that are external to the database synchronized across the servers. Suitable for environments where these external dependencies are not changed very often (which tends to be true in most environments).

- Some features are not supported, such as cross database transactions.

- Special configuration considerations when using Replication (https://msdn.microsoft.com/en-us/library/ms151799.aspx).

3. Transaction Log Shipping

a. Primarily a DR solution

b. Can use compressed backups

c. Very good way to validate the transaction log backups you are already taking

d. Can run in a delayed restore mode to protect secondary from human error on primary site (ex. If TableX was deleted on Primary by mistake, and there is a 2-hour delay, you may have enough time to be able to recover TableX from the Secondary)

e. DB Level protection (database RLO)

f. Can have multiple secondaries

g. Very good option when other DR options are not possible or supported by a vendor since it is essentially a constant transaction log backup and restore sequence.

h. Considerations:

- Database must be in Full or Bulk Logged recovery model

- No automatic failover

- No automatic redirection

- Can only failover at database granularity - cannot group a set of databases to failover together

- Database on secondary not readable during restore

- Database on secondary not writeable

- Databases have dependencies on objects that reside in other databases such as the master database. These objects often include logins, jobs, certificates, and server audits but can include other objects. A process (normally involving scripts) needs to be adopted to keep objects that are external to the database synchronized across the servers. Suitable for environments where these external dependencies are not changed very often (which tends to be true in most environments).

4. Peer-To-Peer Transactional Replication

a. Primarily a scalability solution

b. With Peer-To-Peer replication, we have the capability to keep multiple copies of the data on multiple sites, providing a scale-out, HA (if bandwidth permits) and DR solution. Because data is maintained across the nodes, peer-to-peer replication provides data redundancy, which increases the availability of data.

c. Database can be in Simple recovery model

d. Database on secondary is both readable and writeable

e. No automatic redirection

f. No concept of a failover since all “nodes” are always “active”

g. Can have multiple nodes participate in the topology

h. Can choose to replicate only a subset of tables within the database (table level or filtered rows from table RLO)

i. Since all databases are writeable, you can create different indexes optimized for reporting environment, optimal solution if you are running very expensive reporting queries which need custom indexes

j. Considerations:

- Client will have to change connection string to an available node if the node it is connected to becomes unavailable

- Replication is mainly intended for reporting offloading. Therefore the architecture builds an interim queue in the distribution database, causing higher latencies than would be observed with the Availability Groups architecture.

- Does not handle conflicts gracefully, application owner will need to ensure data modified at different sites does not conflict (app should logically partition the updates). P2P replication will simply overwrite data (lost updates) unless configured to fail when a conflict occurs, which requires manual intervention (https://technet.microsoft.com/en-us/library/bb934199.aspx).

- Does not support initialization and re-initialization via a snapshot

- Replicated tables must have a primary key

5. Multi-Site Clustering with SAN Replication

a. HA and DR solution using Failover Clustering in combination with SAN replication

b. Protects against I/O subsystem failure on the primary datacenter

c. Data disks can be synchronously or asynchronously mirrored

d. External dependencies handled well as in typical single site failover clusters

e. Instance-level protection WITH data redundancy (Instance [+] DATA RLO)

f. SQL 2012 supports multi-subnet environments, eliminating the need for a VLAN

g. Considerations:

- Additional expense of SAN replication technology

- Additional complexity of managing and configuring SAN replication technology

- Quorum model and node vote options should be configured to ensure an outage at the DR site will not cause the cluster to lose quorum

6. Multi-Site Clustering with NO SAN Replication and with NO shared storage

a. SQL 2012 supports multi-subnet environments, eliminating the need for a VLAN

b. Will become even more relevant:

- SQL 2012 now supports databases on SMB

- Windows 2012 clusters support 64 nodes

- Hyper-V Replicas feature on Windows 2012

- Hyper-V support for SMB

c. Used often to support DR within a SQL 2012 Availability Group topology

7. SQL Availability Groups

a. HA (synchronous, local datacenter) and DR(asynchronous, remote datacenter) solution

b. Protects against I/O subsystem failure on the primary server or datacenter

c. Log Stream compression available for increased performance

d. Automatic Page Repair

e. Automatic redirection to secondary is possible via an AG listener

f. Automatic failover is possible, does not require a 3^rd witness server, windows cluster is used for quorum

g. Up to 4 secondaries

h. Quorum is based on all nodes of the underlying Windows Cluster

i. Can failover a set of databases as a group (database set RLO)

j. Can choose between several mirroring and failover options: synchronous or asynchronous streaming, automatic or manual failover

k. Can offload T-Log backups on secondary

l. Considerations:

- Database must be using the FULL recovery model

- Database on secondary is not writeable

- Database on secondary can be set as readable (Readable or READ-INTENT secondary options)

- Note that the system databases master, tempdb, and model cannot be part of an AG. Databases have dependencies on objects that reside in other databases such as the master database. These objects often include logins, jobs, certificates, and server audits but can include other objects. A process (normally involving scripts) needs to be adopted to keep objects that are external to the database synchronized across the servers. Suitable for environments where these external dependencies are not changed very often (which tends to be true in most environments).

- Some features are not supported, such as cross database transactions (https://msdn.microsoft.com/en-us/library/ms366279.aspx).

- Special configuration considerations when using Replication (https://technet.microsoft.com/en-us/library/hh403414.aspx).

- Quorum model and node vote options should be configured to ensure an outage at the DR site will not cause the cluster to lose quorum

8. Windows 2012 Hyper-V Farm

a. Hyper-V Host Cluster with multiple standalone VM’s or Guest-Clustered VM’s – all VM’s Highly available within Host Cluster

b. New Windows 2012 cluster limits – 64 nodes

c. VM failover prioritization(Start order, Stop order, Node maintenance mode)

d. Anti-affinity VM rules

e. SMB 3.0 support for VHDX + Snapshots for flexibility, easier storage migration, performance

f. Guest clustering with virtual Fibre Channel adapters provide port virtualization by exposing host bus adapter (HBA) ports in the guest operating system

g. CSV, Clustered file system, enables multiple-servers to simultaneously access a volume

- Speeds up failover

- No drive letter restrictions

- Integration with SMB 3.0

- VMs not bound to storage

- Restrictions with clustered instances of SQL https://technet.microsoft.com/en-us/library/jj612868.aspx

h. Scale out file server: leverages CSV, allows active-active clustered file shares with fault-tolerance and with no downtime (app just sees slowness in I/O during failover). Client connections are distributed across nodes for better throughput.

- SQL Server is supported when using File Server for scale-out application data storage. SQL Server 2008 R2 and SQL Server 2012 are supported in this scenario. SQL Server 2008 R2 is supported in a stand-alone configuration, and SQL Server 2012 adds support for clustered servers. https://technet.microsoft.com/en-us/library/hh831815.aspx

i. Hyper-V Replica (can use VSS snapshot for Replica, which is app consistent snapshot) for DR. SQL Server on Hyper-V Replica is supported provided the EnableWriteOrderPreservationAcrossDisks

flag is set. https://support.microsoft.com/kb/956893

j. VM maintenance mode for testing DR

k. Guest NUMA, SQL is NUMA aware and can take advantage of this

l. Live Storage Migration

m. VHDX size up to 64TB

n. Topology can be managed via SCVMM

Microsoft Virtual Machine Manager in System Center 2012 sp1

How to configure SQL 2012 AlwaysON Availability Groups in System Center 2012 Virtual Machine Manager Sp1)

Putting it together:

1. RPO/RTO capabilities of each feature, from Microsoft SQL Server AlwaysOn Solutions Guide for High Availability and Disaster Recovery

2. You can combine many technologies and features to come up with a solid HA + DR plan which will meet your RLO, RTO, RPO objectives. For example you can address HA and DR by bringing together FCI, SQL Availability Groups, and a Multi-Site Cluster – no SAN replication: