Setting up SharePoint Disaster Recovery Sites with SQL AlwaysOn

Article
04/24/2015

Something I’ve been promoting as an essential high-availability strategy for SharePoint for a while now is having a Disaster Recovery (DR) site for SharePoint. A SharePoint DR site is simply an entirely separate SharePoint farm that uses a copy of the content data in case the first farm runs into problems for some reason, or for if we need to patch SharePoint without losing uptime.

Edit: If you want a quick overview of SharePoint DR, then this new post is for you.

Disaster recovery with SQL Server AlwaysOn is done by syncing content databases from the primary/active SharePoint farm to the secondary/passive farm with the secondary content databases in read-only mode until a failover. On failover we make the passive farm the new active farm and switch everything to read/write until we fix the old primary site.

This was traditionally done using log-shipping; a technology now that’s always worked pretty well but is somewhat basic. In the year 2015+ however there’s a more fitting technology that can do this content database synchronising – namely SQL Server AlwaysOn.

AlwaysOn I’ve written about before for setting up the entire farm with a single logical SQL Server instance with every database synchronously mirrored between X SQL Server instances + automatic failover between the SQL Server nodes if needed. That’s one use case for SQL Server AlwaysOn for SharePoint but what this post is about is using SQL Server AlwaysOn to asynchronously mirror just content databases to a disaster recovery farm SQL Server instance – like log-shipping but just better for all sorts of reasons (see below).

This diagram nicely shows SharePoint DR, which you might’ve seen before:

Logically it's the same thing as SharePoint DR with log-shipping, just now with much better automated updates to the DR content databases.

Disaster Recovery with SQL Server AlwaysOn vs Log-Shipping

SQL Server AlwaysOn is a much nicer way of synchronising databases between sites. Why?

Transactions are processed one-by-one instead of batches every X minutes (TRN files).
A big one: switching over from primary à secondary à primary again doesn’t require a full backup and restore. It’s automatic each failover; the new passive node starts getting updates from the new active without further intervention needed. If you have sizeable content databases this is a big deal.

It’s not all roses though; AlwaysOn does require more configuration and is more complicated to setup (namely Windows failover clustering and the complications that clustering can add). But still, once running it is a far more resilient and flexible setup.

Each SharePoint Database in One AlwaysOn Group

There is a small limitation we have to work around in that SQL Server databases can only be used in a single AlwaysOn group. What that means is just that any database used in AlwaysOn for a synchronous, automatic-failover configuration for a single farm won’t be able to be used for disaster recovery too – you have to pick one usage or the other because of this limitation.

It’s no biggie; you could in theory have everything but the content databases for each farm in their own synchronous/auto-failover AlwaysOn availability group and then a 3^rd group for just content databases. This setup would give you something like this diagram:

Edit: I've explained more about this setup here.

If you wanted to go true expert-mode now there’s seemingly a single point of failure per farm then each SQL instance could also be a failover cluster of course. That aside, the point is that there’s no local failover in AlwaysOn-land for our content databases as the AlwaysOn setup will be used to just ship-out content database updates to the disaster recovery farm.

You could also technically replicate synchronously replicate the content databases to SQL-N1 in addition to sending updates asynchronously to SQL-N3 but then without a listener (which would use all replica members) the value of this would be diminished. Also adding another synchronous replica on SQL-N4 for consistency in the DR site would mean we lose the speed-bump of running asynchronously to the DR site. In summary it’s probably best to just have two asynchronous replicas for content-database AlwaysOn groups with one replica instance in each site.

But I digress; for now we’re just going to focus on what it looks like to get that middle section working; the asynchronous content databases working as the rest I’ve covered pretty well already.

Setup AlwaysOn Group for Content Databases

Create a SQL Server AlwaysOn availability group; give it a sensible name and select your content databases you want to ship to the DR site. The next page in the wizard is what’s interesting:

We can’t failover automatically with asynchronous commits as by definition there’s never a confirmation the secondary committed so SQL Server will only failover by force to save any potential automatic data loss. That’s fine as automatic failovers between farms isn’t a goal as it’s a pretty manual task in the year 2015 still so just bear in mind that failing-over to the DR/inactive farm will require a manual failover in SQL Server for the content-databases AlwaysOn group.

We do want both nodes to be “readable secondary” though so the inactive farm can use the content DBs in read-only mode at least.

Test DR/Passive Site

Once the AlwaysOn group is setup and joined we can add the content-database to the 2^nd farm like any other, albeit in read-only mode.

Add the content database to the application there and check it works in the browser. Then you’ll want to get your search configured on the 2^nd farm too; check content sources & crawl schedules are as you want.

Just for clarity I’ve left the URL as “sp15-failover” for now but in a real-life scenario you’d have common alternative-access-mappings for both farms.

SharePoint Farm Failovers

Two things need to happen at a minimum when a failover to the passive site happens:

SQL Server AlwaysOn failover for the content databases so the passive farm gets read/write access.
- Highly recommended: dismount/mount refresh of content databases so passive farm can refresh site-collection list (see below).
DNS update for your HTTP traffic to your passive network-load-balancer/web-front-end.
- If you have a reverse-proxy this is even easier – switch the back-end address for the application and no DNS updates are needed. Good times.

That’ll switch the secondary farm to read/write mode and should start to send your users to the new site.

First the SQL Server failover for the content databases:

This will give all sorts of warnings as again, with asynchronous commits there’s no confirmation all the transaction have been written so SQL Server forces you to acknowledge data loss is a (small) possibility.

SQL Server will failover at gunpoint; once done you’ll see:

Well done – the DR site is now the read/write site!

As mentioned, it’s highly recommended you also dismount/mount the content database again so any new site-collections are seen by the passive farm.

Edit: actually, there’s a better method to refresh the content DBs instead of dismounting them - SPContentDatabase.RefreshSitesInConfigurationDatabase. Running this script will refresh all the content-databases:

Get-SPContentDatabase | foreach { $_.RefreshSitesInConfigurationDatabase(); }

Now you should be ready to redirect users to the 2^nd farm.

Very Important: Resume Data Synchronisation After Failover

Once a failover is complete in SQL Server availability-groups configured for asynchronous commits (which is what we’re doing here), it’ll be necessary to manually resume data synchronisation on the previously primary SQL instance that’s just been demoted to read-only secondary.

This is actually for a good reason; the old primary SQL instance can’t know for sure the now current primary SQL instance applied all the latest transactions before the failover happened (because we configured it this way). So what the old primary instance does does is freeze it’s own contents from any further updates (including any synchronised from the new active instance) until you, Mr Admin, confirm you’ve done everything humanly possible to ensure the new primary has everything and normal operation/synchronisation can resume.

If you’re worried about this potential problem, this is a good time to get a proper DBA involved in the conversation – SQL Server just makes sure nothing changes on the previously “most-up-to-date” database(s) until you’re sure you want to resume. Again, this is only an issue for availability-groups configured for asynchronous commits – synchronous commit availability groups don’t have this extra consideration.

Anyway, to resume data synchronisation you’ll need to run this command on the old-primary/now-secondary node once you’re sure it’s ok:

ALTER DATABASE WSS_Content SET HADR RESUME

Once executed, everything will go back to normal again and once all the changes have synchronised up we’re ready for another failover if necessary. See https://msdn.microsoft.com/en-us/library/ff877956.aspx for full documentation for this command.

All green – just the way we like it.

Test Failover to SharePoint DR Site

SharePoint will pick-up immediately that it’s now got read/write access to the content databases and continue as normal.

Here we are then; the failover site now working as the new primary SharePoint farm. If we look at the SQL-N2, it’s been relegated to read-only replica, and is now getting updates from the new primary replica in the group – SQL-N3 in the DR site:

…which is a big reason why AlwaysOn is far superior to log-shipping for disaster-recovery sites. If we want to switch now back from the failover site to the old primary, it’s just a case of failing-over the SQL Server group again just like we did before. With log-shipping we’d now have to backup/restore the content DBs & re-setup log-shipping which is a massive pain, but with AlwaysOn we don’t have to touch the databases.

Wrapping-Up

This guide shows the principal of running disaster-recovery with AlwaysOn. Not everything’s covered but this should give an idea how it works in principal at least. If anyone’s got any doubts then I’m happy to expand on the subject – please use the comments below!

Cheers,

// Sam Betts