Hot-Standby/Disaster-Recovery SharePoint Farms – Basic Setup & Failover

Something that’s often required for high-availability SharePoint installations is the ability to failover to another web-farm entirely when needed because of a failure of some kind on the 1st farm, or wanted because of maintenance taking the 1st farm offline completely.

Edit: If you want a quick overview of SharePoint DR, then this new post is for you.

It’s simple enough in concept but isn’t particularly easy to setup in SharePoint. Sure it’s not cheap, but as a business we care about uptime first and this is pretty much the only way of being sure we’ve removed any single point of failure for our super-important SharePoint farm + all the apps that run on it.

Update: SharePoint DR is even more awesome with SQL Server AlwaysOn – check out how to do this in my new post here!

In this guide I’ll build a second “datacentre” to mirror the first one then we’ll see how we can switch users to the 2nd farm while the 1st one goes offline. Farms going offline can happen for all sorts of reasons; both deliberate (farm patching) or accidental (whoops, something fatal happened). Either way, the goal is clear: keep the SharePoint applications online at all costs.

So for our test we’re going to assume we have a nice SharePoint 2013 farm that we want a hot-standby for in case it goes offline for any reason (hint: there are many). The goal is simple; when farm 1 goes dark we want farm 2 to service requests instead and we don’t want users to realise that we’ve had a farm-eating problem. Our target setup will be something like this:

High Availability SharePoint - Target Setup

We’re building everything on the right so that when a failure occurs we can just redirect users there while the whole of the left-side SharePoint farm recovers from whatever knocked it offline. This is prepared for a complete lights-out scenario if necessary, in which case we switch to this:

High Availability SharePoint - Farm Failover

Building a Hot-Standby/Disaster Recovery SharePoint Farm

We call it a DR farm for short and broadly speaking setting up such a secondary farm involves:

  1. Setting up a secondary SharePoint farm with its own configuration DB, search app, etc, not with no web-application yet.
  2. Enabled content DB(s) log shipping.
  3. Once the content databases are backed-up to the 2nd farm, configure/add the applications now to use them (don’t create new content DBs).
  4. Finally on failover, decide what you want the users to be able to do. The options are: read-only mode until farm 1 lives again (easiest), or full read/write in which case we’re going to have to figure out migrating the content changes back to the primary farm. More on this below.

Prepare Secondary Farm

We need a completely separate, second farm configured for when our first farm goes offline for whatever reason. The whole point of this exercise is so we can divert users to a second farm entirely, but one that magically just as up-to-date as the first farm so the first farm can go completely dark if necessary – failovers are never predictable so a true high-availability SharePoint environment will need x2 of everything locally – AD, SQL, and SharePoint obviously because any one of those factors could knock out our primary farm.

Install & Configure SharePoint

I’m going to assume that there’s already a perfectly running SharePoint web-farm. The first step is obviously to install the 2nd farm & make sure it has the same SharePoint patch-level as the first. Really, make sure those SharePoint patch-levels are identical if you want your failover to work.

Enable Log Shipping from Primary SQL Server to Secondary SQL Server

There’s three steps to log-shipping once its set-up and all three of these steps will be need to be configured:

1. Backup.

    • Copy database transaction logs to backup target.

2. Copy.

    • Copy from server 1 to server 2. This can be done via Distributed File System automatically if configured after log-shipping is setup.

3. Restore.

    • Apply the copied transaction logs to the secondary copy.

4. Pause & repeat. Every 15 minutes by default.

High Availability SharePoint - Log-Shipping Process

Purple lines are done by the primary SQL instance; Green by the secondary. For the purposes of this test we’re making the two file-shares the same thing, but in a real-world example this process wouldn’t work so well if the two datacentres are located in different countries for example. The backup/restore process are not really suitable for gently pushing/pulling bits across a WAN gracefully but rather reading/writing to the core DB as quickly as possible from/to a local staging area, hence this extra copy stage.

This whole process does however need to be setup first as obviously a copy of the database needs to exist on the 2nd server, and in the right mode. SQL Server Management Studio will give you the option of initializing the content automatically if you want which means it can perform a full backup of the content database and restore it to a location on the 2nd server ready for log-shipping to begin. This is what we’ll do in this guide as it’s the simplest.

For the purposes of this demo SQL-SP2013\SP15 is our primary SQL instance with the content databases in read/write mode. SQL2-SP2013\SP15B is our secondary read-only with its own configuration DB, service-applications, etc.

So first, find the content databases we want to enable log-shipping for in Management Studio and enable them for log-shipping like so…

High Availability SharePoint - Our Target Content DB

Click the back-up settings button to configure how transaction-log backups will happen.

You need to specify where the logs will copied to first; this should be somewhere “locally” to this SQL instance; that’s to say, somewhere close by on the network (the whole idea of a second farm is to have it far away so any local disasters won’t affect it).

High Availability SharePoint - Setting up Log-Shipping

This job will back-up the most recent transaction-log data since the last backup to a file-share. You’ll probably want to leave the default values here. Click OK and close.

Now we have to configure what instance we’re restoring to – there has to be a copy of the content databases we’re log-shipping on the target 2nd SQL Server at some point and given we’ve not created any web-application yet on the 2nd farm we have no content-databases on the SQL Server yet. For this test we’re going to get SQL Server to do the hard work of restoring a copy on the 2nd instance for us, so after connecting to the 2nd SQL Server let’s select “generate a full backup and restore it onto the secondary”.

High Availability SharePoint - Setting up Log-Shipping

Connect your 2nd SQL instance here. We want SQL Server to initialise the target content DB too – on the 2nd database server there should be no sign of our content database WSS_Content yet. If there is, this operation will fail.

Now click on “Copy Files” to configure where the transaction logs are copied to having been backed up by the 1st SQL Server.

High Availability SharePoint - Setting up Log-Shipping

It’s the job of the 2nd SQL Server to pull the files from the backup location to this location we’re configuring now but it’s not essential if you have your own system of transferring files from one location to a remote location, like Distributed File System for example. In our example the source/destination are the same so we’re kinda skipping this step just for simplicity for now.

High Availability SharePoint - Setting up Log-Shipping

Finally, the transaction-log restore job. Here there are some things you really need to not leave to default.

Important SharePoint Specific Configuration!

First & foremost make sure you allow standby mode for the restore or the secondary farm won’t be able to read the DB until it’s restored to a usable mode manually. While transaction-logs are incoming we can use it in read-only mode if we select “standby mode”. This allows SharePoint to use the database at least, even if nothing can be written.

Also very important is the “disconnect users” option for the restore. This is because we need the restore to work above all else; if SharePoint has a connection open for example then the restore would fail without this option selected. That doesn’t sound so bad but if the restore wasn’t possible for long enough then old (still unapplied) transaction log-files would be deleted as per the backup clean-up settings and now you have a broken transaction-log chain, meaning the only way to update the secondary copy would be a full backup & restore again. This obviously could be a big deal, depending on the database size. In short, make sure that no matter what, SQL Server can restore transaction-log backups no matter what – if we need to use the database instance because of an outage at SQL Server 1 then we’ll not be getting any new transaction logs anyway so disconnects won’t happen.

For the record, you’ll know if SharePoint tried to access the database & failed because of a restore happening because the login will be denied – you’ll see event ID 3760 as if the login isn’t setup correctly; in reality it is, but SQL has booted the connection SharePoint had open as per our configuration, is in “single-user” mode while it restores the transaction log and is denying all remote logins until it’s done.

Finish Log-Shipping Configuration

Click OK. This should now run:

High Availability SharePoint - Setting up Log-Shipping

What’s actually happened is its run x2 scripts, one on each SQL Server to create x3 SQL Server Agent jobs to do the above mentioned tasks – back-up; copy; restore.

High Availability SharePoint - Log-Shipping Setup

Here you see the jobs created needed for the transaction log shipping and below we have our content database in the correct mode for log-shipping to take place:

High Availability SharePoint - Log-Shipping Setup

The database is mounted on the secondary SQL Server but in read-only mode.

Mount 2nd Database Copy

Once we have a content database we can now create a SharePoint application to use it but we need to add an application without a database because we’re going to add the synced content DB after we have the application as if you just supply the synced database name with the new application settings you’ll get a constant stream of errors from the search app complaining the DB is read-only. So either create a new app giving a temporary database name that you delete afterwards or use PowerShell to create an application without any content database given – then add the content database with Mount-SPContentDatabase or via Central Administration.

Once done & open your web-application with the content DB read-only copy and you’ll see something like this:

High Availability SharePoint - DR Farm

The database is in read-only mode until we stop the log-shipping.

Testing the Log Shipping & Failover Farm

So let’s make sure our updates are being published to the 2nd farm then. We want to make a change to the 1st farm web-application and check the change is replicated to the 2nd farm.

Here we make a change to “sfb-sp15-wfe1” (slightly odd name for a SharePoint web-app but adding a NLB + a proper DNS name per farm was just a bit too much for this guide :P). Here we’ve change the title & we see that a transaction log-file has been shipped after the page-edit:

High Availability SharePoint - Log-Shipping in Action

Here we can see the 1st farm with an edit applied and in the file-share the backup uses we can see a transaction log generated after the edit, which should contain the edit. Now we just have to wait for the destination server to pull the file off and restore the transaction log to the WSS_Content database. We’ve left the defaults in so this job runs every 15 minutes, so once the new transaction log becomes available to the 2nd server it can be anything up-to 15 minutes from then.

So if we look at our offsite farm SQL Server we see the job has run successfully after the last file was shipped:

High Availability SharePoint

Failing Over in Real Life – It’s Blown up For Real!

When your primary data-centre goes dark there is of course the question of “are we expecting the primary farm to come back online any time soon?”. We could enable read/write access but then we’d have to get those changes back to our primary SQL Server instance if we want users back on that farm again, which can be a hassle. For some failovers it’s enough to have just a read-only copy until the problem with the primary farm is resolved and log-shipping resumes – that way no changes need to be resolved back to the primary.

As part of the failover you’ll need to decide if you’re going to allow writing as everything will still be read-only mode until you do.

How to Failover to Your DR Farm – 1. Get the Latest Log-Data

The first thing you’ll want to do it make sure your secondary has the latest changes possible sent from the primary by finding the last available transaction logs to be restored. So to some extent you’ll have to figure out logically where you can get that – is the 1st SQL instance responding enough to be able to generate more logs? If not, what was the last transaction log it did generate? Did it copy it to the 2nd file-share? Etc. The point is, you’ll need to find them and the restore them to the 2nd instance if you want to ensure your failover farm has the latest content possible. That’s often quite a big deal and really down to your individual needs/usage etc but grab all the TRN files you can & restore them.

High Availability SharePoint - Restoring Logs

Here I’ve gotten the latest TRN files from the incoming share and copied them into the backup folder for SQL Server to restore them to the secondary copy. If you can generate a newer transaction-log backup (assuming the 1st SQL instance is even responding) then do. Selecting the latest transaction-log – SQL is clever enough to go finding all the previous logs to be able to restore everything in the LSN chain.

High Availability SharePoint - Restoring Logs

On this screen we see what backup we really have in terms of LSN sequencing. We need everything from the last backup restored up until the latest TRN file – the newest possible. If there’s a gap in the LSN chain we can’t restore the log-files!

2. Redirect Traffic to Disaster Recover Farm

So anyway, once you decide you do want to failover you’ll want to change the DNS for your website first so users start being directed to the 2nd farm. So instead of going to which would be the Network-Load-Balancer (NLB) of farm #1, we want it to go to which is the DR farm NLB. There’s nothing magical about this; update the DNS ‘A’ record ASAP to start sending people to the secondary – DNS can take a while to propagate so bear this in mind.

Another take on this is to have a separate DNS name for your second farm – something like for example & have a reverse-proxy (TMG Server for example) redirect all requests to the 2nd name with a simple HTTP 302.

3. Stop Log-Shipping

Disable the jobs that execute the log-shipping on both nodes (if possible). As you saw above; these jobs are in the SQL Server Agent; disable them from the management studio. Also, finding the source-databases properties in Management Studio and unselecting the “Enable Log-Shipping” (the opposite of what’s described to set it up) and clicking OK will do the same.

4. Enable Read & Write for Users?

If you’ve decided to go all the way with your secondary farm and enable full read & write then you’ll need to set the content databases as writable. To do so, do the opposite of this – or just run (per DB):



Other Thing You Need To Do On Failover

Unless you have some kind of elaborate script that can detect when a web-farm has gone offline with 100% confidence, you’ll want to do this manually. Given this isn’t a simple setup the process isn’t particularly automatic, but the point is when our production farm goes dark for reasons unknown and the bosses are screaming down the telephone at you, believe me when I say you’ll be happy you have a DR farm ready to rock & roll; to cover you until you can figure out what went wrong with the 1st farm at least.

So to failover to our standby farm we need to:

  • Prepare the farm for operation:
  • Disable SQL Server Agent.
  • Dismount/mount content databases to refresh site-collection count. Any new site-collections added since the last DB mount on the secondary won’t be seen until you dismount & remount that particular content database.
  • Decide whether we want to allow read/write access on the secondary farm, which will basically involve having to perform a full backup back to the primary SQL instance if we want to switch everyone back there again.

Things Can Go Wrong on Failover…

Given we’re mirroring a very complex environment in two completely distinct locations, lots can go wrong. By “go wrong” I actually mean “upon failover, fail to give normal service because there’s a problem”. Some problems can be severe enough that the entire application fails completely, making your secondary data-centre look something of an expensive experiment gone wrong; other errors can mean unexpected behaviour, like searching for something and not getting any results.

Anyway, there’s two categories of failures when failing over. It can go wrong if you’re not careful:

Badly Wrong

  • SharePoint patch level. If this isn’t done right (i.e. be the same for the most part) then expect terrible things to happen. SharePoint binaries only work with specific SharePoint database schemas – if you failover a version 15.0.6060.5000 schema to a version of SharePoint that’s never seen that version (because it’s old) then it won’t touch it.
  • Custom solutions. These need to be handled carefully – more on that soon.
  • Hand-made web.config changes that for some reason the application needs but isn’t a SPWebConfigModification (so isn’t applied automatically).

Not so Badly

  • Managed paths. Make sure they’re the same between farms.
  • Service-application settings.
  • Performance – can the DR farm handle the same traffic as the 1st farm? This can be a “badly wrong” problem depending on how bad it gets.

Next Steps – Custom Solutions (WSPs)

Check out how to maintain custom solutions between DR farm in the next section –


// Sam Betts

Comments (37)

  1. Fadi Abdulwahab says:

    Great job  , Thanks for posting.

  2. Eric says:

    Why do you choose log shipping as the backup configuration for SQL?  Is there some reason why you don't use database mirroring?

    We use database mirroring on a SharePoint 2007 farm, but so far I have not had to fail over to that mirror.  Also we use a SQL alias on the Web Front End servers so if only SQL goes down then we can point the alias to the backup server (once active) and then there is no need for a DNS entry change.  I'm sure it is just another way to skin the cat, but with this set up we don't have to fail over the entire farm if only a portion of the first farm goes down.  Now if a WFE goes down then we do need a DNS change to our backup WFE, but if SQL goes down then we only need to change the alias.

  3. Two reasons; log shipping has almost no integration points so there's less that can go wrong – the tail log backup just has to arrive in a file-share within a certain time window; any way that a file can arrive. The other main reason is that with log-shipping you have very deliberately x2 logical & separate farms. If you blow-up farm #1 config DB then because that breaking change isn't replicated across, farm #2 will happily live on. There are of course scenarios where even log-shipped changes will kill both farms but you've reduced the likelihood of that happening at least.

  4. spbaldness says:

    So this is great but how can you keep your SP patch levels the same with no downtime?  If you fail over to farm 2 and patch farm 1 then failing back to farm 1 would have  an inherent risk due to patch levels being out of sync.  

  5. It's easily doable; you just have to fail-over to the DR farm in R/O mode, suspend log-file shipping, fail back to the primary once patching is complete, detach content DBs from DR farm, patch binaries on DR, resume log-file shipping (which will now include the patches for the upgrade), and everything resumes as usual 🙂

    More about it on my other post –…/patching-sharepoint-farms-with-no-downtime-high-availability-sharepoint.aspx

  6. Ali says:

    Great post, thank you. If the requirement is to have continued R/W, it appears the only way to get back to your primary farm from the secondary is a full backup/restore of the content database. Correct?

  7. Tim says:

    What about using SAN replication for the disks between the two sites?

  8. You mean for replicating log-files between sites? Yep, that should be fine as long as they end up getting restored to the 2nd SQL instance without any errors. That's more of a SQL Server issue at that point.

  9. Tim says:

    Thanks for your quick reply, but no, I mean SAN replication technology that is synchronously replicating the disk writes. Ideally this would be completely transparent to the host as it is not involved at all in the process.

  10. In theory that could work if it was done transparently to SQL and you could maintain a 1ms response-time between all SP machine & SQL. That's more for a single logical farm I'm assuming whereas log-shipping gives you two which you can hop between as & when needed.

    Perhaps if you want offsite DR for a logical farm you could look at AlwaysOn?

  11. Tim says:

    Yes, AlwaysOn is up for consideration as well. SAN disk replication is what we use for a well-tested offsite DR implementation using another RDBMS platform, so we will be testing it out with SP and see if it is another viable option. Thanks for the replies!

  12. Bill says:

    I'm assuming service accounts in read only DR farm need to be identical to production?

  13. Not necessarily, although I've not tested that scenario. Just as long as both accounts are in the right SQL database roles (WSS_CONTENT_APPLICATION_POOLS) you might find it works fine. Again though, don't take this as gospel, but in theory using different service accounts should work.

  14. Micah Nikkel says:

    Thanks for the illustrative and helpful post on the subject.  In terms of scripting the Failover / Failback of the Log-Shipped databases, we currently use this solution for our SharePoint environment:…/Streamline-Log-Shipping-Failovers

    Any thoughts or comments on this after testing it?

  15. Richard Pyra says:

    Great article, thanks for posting, hard to find anything that actually discusses how to set up the DR farm. We plan our using AOAG and hopefully I can add to this discussion later.

  16. Alex says:

    Thanks for the post Sam; to setup the DR farm, do you recommend restoring a backup of the production farm to the DR farm? Or, setting up the DR farm manually, as if it was a totally new farm? Also, will I need to be sure that all features/apps are installed in both farms prior to moving over my content databases?

  17. The DR farm should be identical to the main farm (i.e. have a PowerShell script that does it for consistency), but essentially it's own independent entity.

    I did another post about maintaining custom code between the two @…/managing-custom-solutions-for-disaster-recovery-sharepoint-farms.aspx

    Definitely don't restore a backup from the main farm 🙂

  18. Kenny says:

    Are there any additional considerations when using SQL Always On groups to keep the content databases in sync instead of traditional log shipping?

  19. Tyrone says:

    How does adding a DR farm affect the current backup schedule of a live farm.  In my case, I have a SharePoint farm up and running with nightly Full backups.  Whenever a Full backup is run it wipes out the transaction logs.  So wouldn't that create a failure to restore transaction logs on the DR farm?   Or does enabling Transaction Log Shipping create a separate set of logs just for the DR farm?  

    Sorry if that's a newbie question.  My SQL knowledge is very basic.

  20. Hi Tyrone,

    Running a full backup shouldn't touch the transaction logs so log-shipping can continue uninterrupted. If you need to be 100.00% sure, run the backups with the COPY_ONLY parameter.

    // Sam

  21. Hi Kenny,

    TBH, using AlwaysOn to keep the content DBs synced is probably even easier than log-shipping if it's an option. I'm looking at how that would work when using AO already on each farm though as it's not clear yet to me at least – stay tuned.

    // Sam

  22. Tyrone says:

    Thanks Sam!  I was confusing transaction logs with differential backups, whoops!  

    My other question is, how do we manage the transaction logs when they are copied over to a file server or DFS?  Do we have to manually delete them when they are no longer needed?

  23. I seem to remember old TRN files are cleaned-up when they're applied to the DR databases with the default scripts that are generated. Otherwise you'll need to implement some kind of clean-up script.

  24. Tyrone says:

    Sam, thanks for all of your support.  The (hopefully) last question that I would like to ask is that I am now seeing Critical events logged pertaining to the User Profiles service every hour on the standby server.  It seems to be related to the User Profile Synch service (User Profile Service Application – System Job to Manage User Profile Synchronization) which runs during the time a transaction log restore runs and fails to log in the Farm account.  That timer job runs every minute.

    I'm assuming this is expected behavior since the restore prevents the account from accessing the database while it runs a restore.  Is this something that you have encountered in the past?

  25. Hmm, the user-profile app shouldn't be synced between farms (which I presume it isn't) so that shouldn't error. Not sure.

  26. Amanda says:

    Hi Sam, this is a great article !!  I would like to ask whether you have similar steps for setting up a cold standby in DR?  And perhaps any links for setting up a SharePoint 2013 from scratch in DR.

    Thank you

  27. Thanks Amanda for your nice comments 🙂

    Cold-standby is generally understood as "I could build another farm and have it running in hours/days, should my 1st one blow-up". Not very interesting to blog about in other words, but have a look at the TechNet article on it by all means –…/ff628971.aspx

    // Sam

  28. Darren says:

    Great post. I just wanted to ask – other than SQL log shipping between the farms, is there any scripting necessary to keep other components (or SharePoint DBs that don't support log shipping) in sync?

  29. soupau says:

    Hi Samuel, i am planning to do a DR exercise for the first time. May you please share me the complete step by step process in performing a DR from PROD farm to DR farm. Very much helpful if you can


  30. cr.gomezm says:

    Good blog Sam, congrats!!!! 😉 … Actually we moved from an AlwaysOn architecture to a Log-Shipping one, I must say, although it requires a little more work and maintenance I am impressed on how it Works.

    Only want to know if someone has found probrems with this setup, we have encountered that for many of out content databases, SharePoint behaves weirdly, locking them out and preventing any operations from SQL engine, including log restoring… the only way to bring them back is stopping SharePoint services so SQL have full Access to the DBs (no services have been deployed on this farm at this time).

  31. Hi cr.gomezm,

    Interesting. I assume the DBs on the secondary node are read-only? SharePoint shouldn't be able to do anything other than read, so I can't see how that would prevent log restores unless SQL was somehow set to be passive when active connections are in place or something. Sounds like a SQL issue to me either way – I'd open a ticket with support to investigate if you have a Premier contract.

    // Sam

  32. cr.gomezm says:

    Hi Sam,

    It seems that SharePoint is locking out the database, we have been investigating the issue, by now we have found two workarounds: A) sync the restoring schedule on the secondary replicas with a daily system task that starts sptimer services on the farm and launches stsadm -o execadmjobs, launch a full crawl and stop again the services; B) create a dummy R/W empty database on each WebApp and then leave all the content DBs dismounted.

    In SQL activity monitor you can check that it seems to be a job, so we have tried to disable all Jobs with a locking property set as contentdatabase but the issue persists.

    Do you think maybe it can help running the stsadm -o preparetomove command on the DR farm?

    Thanks for all, this is driving us nuts!!!! maybe we will end up contacting premier support.

  33. cr.gomezm says:

    by the way… forgot to mention that we double checked that the option to disconnect users when restoring is marked.

  34. Hi Cr,

    Yeah sorry; this sounds like a SQL issue really. I couldn't say what's going on but you shouldn't need to do anything SharePoint side to switch backend instances. With log-shipping you should just set the secondary as the new primary and that's it – no further interaction needed…

    // Sam

  35. Rishi says:

    This is a great article and exactly what I was looking for. I tried this process but I’m configured with HNSC. I replicated the WebApp Content DB and our Intranet site Content DB. I created the DR webapp, deleted the DB then mounted the log shipped webapp content DB. I can connect to the webapp portal site in read only without issue. How do I attach the DB that was configured to leverage the HostHeaderWebApplication switch?

  36. Rishi12345 says:

    Great article. I tried to follow this but having problems with getting “Access Denied” when trying to access the read only site. I am currently using Host Header site collections. I am replicating the portal site content DB and the intranet content DB. Mount them both to the web application I created. I can open up the portal site web site, but not the host header site collection (Intranet). Any help would be appreciated!

  37. Rishi says:

    Any tricks or special configuration for host named site collections. I have no problem hitting the read-only Web App portal site but the main intranet site collection under the web app keeps giving me the following “error: Sorry, this site hasn’t been shared with you”. can’t find anything around HNSC. Any help would be appreciate as I’m doing a site backup, copy, and restore right now. Its about 5-6 hour DR downtime and lots of data to copy over nightly.

Skip to main content