Setting up SharePoint Disaster Recovery Sites with SQL AlwaysOn


Something I’ve been promoting as an essential high-availability strategy for SharePoint for a while now is having a Disaster Recovery (DR) site for SharePoint. A SharePoint DR site is simply an entirely separate SharePoint farm that uses a copy of the content data in case the first farm runs into problems for some reason, or for if we need to patch SharePoint without losing uptime.

Edit: If you want a quick overview of SharePoint DR, then this new post is for you.

Disaster recovery with SQL Server AlwaysOn is done by syncing content databases from the primary/active SharePoint farm to the secondary/passive farm with the secondary content databases in read-only mode until a failover. On failover we make the passive farm the new active farm and switch everything to read/write until we fix the old primary site.

This was traditionally done using log-shipping; a technology now that’s always worked pretty well but is somewhat basic. In the year 2015+ however there’s a more fitting technology that can do this content database synchronising – namely SQL Server AlwaysOn.

AlwaysOn I’ve written about before for setting up the entire farm with a single logical SQL Server instance with every database synchronously mirrored between X SQL Server instances + automatic failover between the SQL Server nodes if needed. That’s one use case for SQL Server AlwaysOn for SharePoint but what this post is about is using SQL Server AlwaysOn to asynchronously mirror just content databases to a disaster recovery farm SQL Server instance – like log-shipping but just better for all sorts of reasons (see below).

This diagram nicely shows SharePoint DR, which you might’ve seen before:

SharePoint AlwaysOn

Logically it's the same thing as SharePoint DR with log-shipping, just now with much better automated updates to the DR content databases.

Disaster Recovery with SQL Server AlwaysOn vs Log-Shipping

SQL Server AlwaysOn is a much nicer way of synchronising databases between sites. Why?

  • Transactions are processed one-by-one instead of batches every X minutes (TRN files).
  • A big one: switching over from primary à secondary à primary again doesn’t require a full backup and restore. It’s automatic each failover; the new passive node starts getting updates from the new active without further intervention needed. If you have sizeable content databases this is a big deal.

It’s not all roses though; AlwaysOn does require more configuration and is more complicated to setup (namely Windows failover clustering and the complications that clustering can add). But still, once running it is a far more resilient and flexible setup.

 

Each SharePoint Database in One AlwaysOn Group

There is a small limitation we have to work around in that SQL Server databases can only be used in a single AlwaysOn group. What that means is just that any database used in AlwaysOn for a synchronous, automatic-failover configuration for a single farm won’t be able to be used for disaster recovery too – you have to pick one usage or the other because of this limitation.

It’s no biggie; you could in theory have everything but the content databases for each farm in their own synchronous/auto-failover AlwaysOn availability group and then a 3rd group for just content databases. This setup would give you something like this diagram:

SharePoint AlwaysOn

Edit: I've explained more about this setup here.

If you wanted to go true expert-mode now there’s seemingly a single point of failure per farm then each SQL instance could also be a failover cluster of course. That aside, the point is that there’s no local failover in AlwaysOn-land for our content databases as the AlwaysOn setup will be used to just ship-out content database updates to the disaster recovery farm.

You could also technically replicate synchronously replicate the content databases to SQL-N1 in addition to sending updates asynchronously to SQL-N3 but then without a listener (which would use all replica members) the value of this would be diminished. Also adding another synchronous replica on SQL-N4 for consistency in the DR site would mean we lose the speed-bump of running asynchronously to the DR site. In summary it’s probably best to just have two asynchronous replicas for content-database AlwaysOn groups with one replica instance in each site.

But I digress; for now we’re just going to focus on what it looks like to get that middle section working; the asynchronous content databases working as the rest I’ve covered pretty well already.

Setup AlwaysOn Group for Content Databases

Create a SQL Server AlwaysOn availability group; give it a sensible name and select your content databases you want to ship to the DR site. The next page in the wizard is what’s interesting:

SharePoint AlwaysOn

We can’t failover automatically with asynchronous commits as by definition there’s never a confirmation the secondary committed so SQL Server will only failover by force to save any potential automatic data loss. That’s fine as automatic failovers between farms isn’t a goal as it’s a pretty manual task in the year 2015 still so just bear in mind that failing-over to the DR/inactive farm will require a manual failover in SQL Server for the content-databases AlwaysOn group.

We do want both nodes to be “readable secondary” though so the inactive farm can use the content DBs in read-only mode at least.

Test DR/Passive Site

Once the AlwaysOn group is setup and joined we can add the content-database to the 2nd farm like any other, albeit in read-only mode.

SharePoint AlwaysOn

Add the content database to the application there and check it works in the browser. Then you’ll want to get your search configured on the 2nd farm too; check content sources & crawl schedules are as you want.

SharePoint AlwaysOn

Just for clarity I’ve left the URL as “sp15-failover” for now but in a real-life scenario you’d have common alternative-access-mappings for both farms.

SharePoint Farm Failovers

Two things need to happen at a minimum when a failover to the passive site happens:

  • SQL Server AlwaysOn failover for the content databases so the passive farm gets read/write access.
    • Highly recommended: dismount/mount refresh of content databases so passive farm can refresh site-collection list (see below).
  • DNS update for your HTTP traffic to your passive network-load-balancer/web-front-end.
    • If you have a reverse-proxy this is even easier – switch the back-end address for the application and no DNS updates are needed. Good times.

That’ll switch the secondary farm to read/write mode and should start to send your users to the new site.

First the SQL Server failover for the content databases:

SharePoint AlwaysOn

This will give all sorts of warnings as again, with asynchronous commits there’s no confirmation all the transaction have been written so SQL Server forces you to acknowledge data loss is a (small) possibility.

SharePoint AlwaysOn

SQL Server will failover at gunpoint; once done you’ll see:

SharePoint AlwaysOn

Well done – the DR site is now the read/write site!

As mentioned, it’s highly recommended you also dismount/mount the content database again so any new site-collections are seen by the passive farm.

Edit: actually, there’s a better method to refresh the content DBs instead of dismounting them - SPContentDatabase.RefreshSitesInConfigurationDatabase. Running this script will refresh all the content-databases:

Get-SPContentDatabase | foreach { $_.RefreshSitesInConfigurationDatabase(); }

Now you should be ready to redirect users to the 2nd farm.

Very Important: Resume Data Synchronisation After Failover

Once a failover is complete in SQL Server availability-groups configured for asynchronous commits (which is what we’re doing here), it’ll be necessary to manually resume data synchronisation on the previously primary SQL instance that’s just been demoted to read-only secondary.

SharePoint AlwaysOn

This is actually for a good reason; the old primary SQL instance can’t know for sure the now current primary SQL instance applied all the latest transactions before the failover happened (because we configured it this way). So what the old primary instance does does is freeze it’s own contents from any further updates (including any synchronised from the new active instance) until you, Mr Admin, confirm you’ve done everything humanly possible to ensure the new primary has everything and normal operation/synchronisation can resume.

If you’re worried about this potential problem, this is a good time to get a proper DBA involved in the conversation – SQL Server just makes sure nothing changes on the previously “most-up-to-date” database(s) until you’re sure you want to resume. Again, this is only an issue for availability-groups configured for asynchronous commits – synchronous commit availability groups don’t have this extra consideration.

Anyway, to resume data synchronisation you’ll need to run this command on the old-primary/now-secondary node once you’re sure it’s ok:

ALTER DATABASE WSS_Content SET HADR RESUME

Once executed, everything will go back to normal again and once all the changes have synchronised up we’re ready for another failover if necessary. See https://msdn.microsoft.com/en-us/library/ff877956.aspx for full documentation for this command.

SharePoint AlwaysOn

All green – just the way we like it.

 

Test Failover to SharePoint DR Site

SharePoint will pick-up immediately that it’s now got read/write access to the content databases and continue as normal.

SharePoint AlwaysOn

Here we are then; the failover site now working as the new primary SharePoint farm. If we look at the SQL-N2, it’s been relegated to read-only replica, and is now getting updates from the new primary replica in the group – SQL-N3 in the DR site:

SharePoint AlwaysOn

…which is a big reason why AlwaysOn is far superior to log-shipping for disaster-recovery sites. If we want to switch now back from the failover site to the old primary, it’s just a case of failing-over the SQL Server group again just like we did before. With log-shipping we’d now have to backup/restore the content DBs & re-setup log-shipping which is a massive pain, but with AlwaysOn we don’t have to touch the databases.

Wrapping-Up

This guide shows the principal of running disaster-recovery with AlwaysOn. Not everything’s covered but this should give an idea how it works in principal at least. If anyone’s got any doubts then I’m happy to expand on the subject – please use the comments below!

Cheers,

// Sam Betts

Comments (22)

  1. Kevin says:

    Hi Sam, great article. It should be noted that you must be running SQL Server 201X ENTERPRISE to use Always On Availability Groups.

    msdn.microsoft.com/…/cc645993.aspx

  2. Hi Kevin,

    Yep that's correct – thanks for pointing that out.

    // Sam

  3. Paul says:

    Whta happen I need both SQL Server High Availability in Primary SharePoint Farm, plus failover to DR SharePoint Farm?

  4. Hi Paul,

    You could make both content instances (sql-n2 & sql-n3 in this instance) failover cluster instances which would give you high-availability for the content DBs on both sites. I actually did something similar in my 1st AlwaysOn post; each AlwaysOn member was in fact a failover cluster in itself – blogs.msdn.com/…/sharepoint-2013-and-sql-server-alwayson-high-availability-sharepoint.aspx

    // Sam

  5. LenC says:

    Sam, great article.  I still have some colleagues saying that SharePoint can use AO both HA and DR, simultaneously.  Sync locally, async to DR, where the replicated DB's are deliberately used by web applications and even many of the service apps.  What are the reasons this is –not– possible?

  6. Jish says:

    Do you have something for SP2010?

  7. LenC says:

    Another question: why not replicate the service DB's?  services such as MMS (with Keyword stores) and Profile DB (with user-provided properties) could be important.

  8. Hi LenC,

    MMC replication I'm going to look at but in general I wouldn't replicate much more than content as each farm really should be a logical separation from the other to give you the best shot at handling errors. More isolation gives you a better shot at this.

    You could use AO for HA + async DR with separate AO groups quite happily. There's a diagram that shows this above under "Each SharePoint Database in One AlwaysOn Group" and some discussion there about it – let me know if something's not clear.

    // Sam

  9. Hi Jish,

    AlwaysOn isn't supported for SharePoint 2010 officially. That's not to say it won't work or anything (it probably will) just we've never tested it sufficiently to say we do support it. Use it at your own risk, albeit a low risk.

    // Sam

  10. Steve says:

    Hi Samuel, great article. One question – if I sync the content databases to my DR environment, is it possible to mount these to the second SharePoint enviroment? I want to run the crawl of the DR Search to get an up to date search in case of an failure.

    // Steve

  11. Yip that should still work no problem.

  12. J Langston says:

    Just to be clear based upon what I see in the AO configuration, the secondary is read-only, not read/write – correct?

  13. Jurjen says:

    Hi Sam,

    according to this technet article: technet.microsoft.com/…/hh913923(v=office.14).aspx AlwaysOn and Sharepoint 2010 is supported.

    Great article tho!

  14. Sushanta says:

    Article for SharePoint Farm patching with it's DR (Hot Standby) with SQL Server AlwaysOn: The current version on my SharePoint 2013 + Project Server 2013 enterprise is RTM and I want to update this to 2014 September CU for both SharePoint and Project Server. While there are documentation available on the SharePoint Farm upgrade which also included the DR farm (in my case hot stand by) for the SQL server log shipping mechanism blogs.msdn.com/…/patching-sharep…, I could not find any reference or article on SharePoint patching on a SQL Server 2012 AlwaysOn setup with Asynchronous commit. Although there is an article which says AlwaysOn Asynchronous replication is supported by SharePoint 2013, but in terms of SharePoint patching and what should be the approach nothing that I can find of. If you are aware of any article, preferably Microsoft article or any blog post giving information on the above, please let me know. Thanks in advance, Sushanta

  15. Frank Norris says:

    SharePoint disaster recovery tool performs easy and convenient recovery of files from corrupt SharePoint .mdf database.  It restores and recovers from corrupt mdf database files, and rebuilds the database and uploads again to SharePoint server. – See more at: http://www.mozesoft.com/sharepoint-server-recovery.html

  16. Tony L says:

    Hi Sam,

    We are in the process of trying to set up both HA and DR for our Sharepoint environment using AO. The Sharepoint site is already running using a single instance of SQL. Our intent is to convert this to a cluster with a local node for HA and another node in our DR site for the DR Sharepoint farm. I have been looking for good information on being able to synchronously replicate the content to the local node and asynchronously replicate to the DR node, your article is very informative. So to get around the issues you mention in  the "Each SharePoint Database in One AlwaysOn Group" section, couldn't you have another listener defined for the SPContent AO group, so that the primary farm would use this listener to connect to the content db's whether the group is on SQL-N1 or SQL-N2. If it were to be moved manually to SQLN-3 then the DR sharepoint farm would then connect using the same listener. As it is your diagram doesn't show how each farm connects to the SPContent group.

  17. Hi Tony,

    The reason I avoided creating a listener for the SPContentDatabases is basically because I don't want the primary farm using the databases if it fails over to the DR cluster. Having two farms ever being able to update the same content DBs would literally be a supportability disaster in the making, so it's local only. Given the AG is to sync updates between farms, a listener for it too would mean both farms use the content DBs at the same time by design unless you tightly controlled when they were mounted on each farm.

    I hope that makes sense.

    // Sam

  18. Matt says:

    We inherited an AlwaysOn cluster that has 2 nodes in our data center and one node in another data center for DR. That third node is configured to Manual and we know it will show Data Loss in the Dashboard but in Cluster Manager the IPs for the DR site are always offline. Is that normal? I’m used to seeing everything in Cluster Manager as Online.

  19. John says:

    I want to create two aoag for sharepoint (AppAG and SearchAG)
    One for the Content database, one for the search database.
    how can do this?
    I need create two listeren for two aoag?
    How to set the search database use “SearchAG” and content database use “AppAG”

  20. Paul says:

    We have just set up two SharePoint 2013 environments (Primary and DR) with SQL 2014. When attaching a content directly database to one of the SQL instances I just get a 404 error on the site collection. If I attach is to the group listener it works from one farm, but not the failover farm.

    If attach either farm to a content database not in the availability group they work fine.

  21. Paul says:

    Will this work for a SharePoint 2013 environment hosting sites in SharePoint 2010 mode?

  22. Kristine says:

    This is a great blog, and we implemented via your instructions for our 2013 instance.
    Question though, while your databases are in read-only mode, on the secondary/passive site, is there a way to avoid the constant event error logged 4971 Failed to update database “DBName” because the database is read-only. Thanks!

Skip to main content