Something that’s often required for high-availability SharePoint installations is the ability to failover to another web-farm entirely when needed because of a failure of some kind on the 1st farm, or wanted because of maintenance taking the 1st farm offline completely.
Edit: If you want a quick overview of SharePoint DR, then this new post is for you.
It’s simple enough in concept but isn’t particularly easy to setup in SharePoint. Sure it’s not cheap, but as a business we care about uptime first and this is pretty much the only way of being sure we’ve removed any single point of failure for our super-important SharePoint farm + all the apps that run on it.
Update: SharePoint DR is even more awesome with SQL Server AlwaysOn – check out how to do this in my new post here!
In this guide I’ll build a second “datacentre” to mirror the first one then we’ll see how we can switch users to the 2nd farm while the 1st one goes offline. Farms going offline can happen for all sorts of reasons; both deliberate (farm patching) or accidental (whoops, something fatal happened). Either way, the goal is clear: keep the SharePoint applications online at all costs.
So for our test we’re going to assume we have a nice SharePoint 2013 farm that we want a hot-standby for in case it goes offline for any reason (hint: there are many). The goal is simple; when farm 1 goes dark we want farm 2 to service requests instead and we don’t want users to realise that we’ve had a farm-eating problem. Our target setup will be something like this:
We’re building everything on the right so that when a failure occurs we can just redirect users there while the whole of the left-side SharePoint farm recovers from whatever knocked it offline. This is prepared for a complete lights-out scenario if necessary, in which case we switch to this:
Building a Hot-Standby/Disaster Recovery SharePoint Farm
We call it a DR farm for short and broadly speaking setting up such a secondary farm involves:
- Setting up a secondary SharePoint farm with its own configuration DB, search app, etc, not with no web-application yet.
- Enabled content DB(s) log shipping.
- Once the content databases are backed-up to the 2nd farm, configure/add the applications now to use them (don’t create new content DBs).
- Finally on failover, decide what you want the users to be able to do. The options are: read-only mode until farm 1 lives again (easiest), or full read/write in which case we’re going to have to figure out migrating the content changes back to the primary farm. More on this below.
Prepare Secondary Farm
We need a completely separate, second farm configured for when our first farm goes offline for whatever reason. The whole point of this exercise is so we can divert users to a second farm entirely, but one that magically just as up-to-date as the first farm so the first farm can go completely dark if necessary – failovers are never predictable so a true high-availability SharePoint environment will need x2 of everything locally – AD, SQL, and SharePoint obviously because any one of those factors could knock out our primary farm.
Install & Configure SharePoint
I’m going to assume that there’s already a perfectly running SharePoint web-farm. The first step is obviously to install the 2nd farm & make sure it has the same SharePoint patch-level as the first. Really, make sure those SharePoint patch-levels are identical if you want your failover to work.
Enable Log Shipping from Primary SQL Server to Secondary SQL Server
There’s three steps to log-shipping once its set-up and all three of these steps will be need to be configured:
- Copy database transaction logs to backup target.
- Copy from server 1 to server 2. This can be done via Distributed File System automatically if configured after log-shipping is setup.
- Apply the copied transaction logs to the secondary copy.
4. Pause & repeat. Every 15 minutes by default.
Purple lines are done by the primary SQL instance; Green by the secondary. For the purposes of this test we’re making the two file-shares the same thing, but in a real-world example this process wouldn’t work so well if the two datacentres are located in different countries for example. The backup/restore process are not really suitable for gently pushing/pulling bits across a WAN gracefully but rather reading/writing to the core DB as quickly as possible from/to a local staging area, hence this extra copy stage.
This whole process does however need to be setup first as obviously a copy of the database needs to exist on the 2nd server, and in the right mode. SQL Server Management Studio will give you the option of initializing the content automatically if you want which means it can perform a full backup of the content database and restore it to a location on the 2nd server ready for log-shipping to begin. This is what we’ll do in this guide as it’s the simplest.
For the purposes of this demo SQL-SP2013\SP15 is our primary SQL instance with the content databases in read/write mode. SQL2-SP2013\SP15B is our secondary read-only with its own configuration DB, service-applications, etc.
So first, find the content databases we want to enable log-shipping for in Management Studio and enable them for log-shipping like so…
Click the back-up settings button to configure how transaction-log backups will happen.
You need to specify where the logs will copied to first; this should be somewhere “locally” to this SQL instance; that’s to say, somewhere close by on the network (the whole idea of a second farm is to have it far away so any local disasters won’t affect it).
This job will back-up the most recent transaction-log data since the last backup to a file-share. You’ll probably want to leave the default values here. Click OK and close.
Now we have to configure what instance we’re restoring to – there has to be a copy of the content databases we’re log-shipping on the target 2nd SQL Server at some point and given we’ve not created any web-application yet on the 2nd farm we have no content-databases on the SQL Server yet. For this test we’re going to get SQL Server to do the hard work of restoring a copy on the 2nd instance for us, so after connecting to the 2nd SQL Server let’s select “generate a full backup and restore it onto the secondary”.
Connect your 2nd SQL instance here. We want SQL Server to initialise the target content DB too – on the 2nd database server there should be no sign of our content database WSS_Content yet. If there is, this operation will fail.
Now click on “Copy Files” to configure where the transaction logs are copied to having been backed up by the 1st SQL Server.
It’s the job of the 2nd SQL Server to pull the files from the backup location to this location we’re configuring now but it’s not essential if you have your own system of transferring files from one location to a remote location, like Distributed File System for example. In our example the source/destination are the same so we’re kinda skipping this step just for simplicity for now.
Finally, the transaction-log restore job. Here there are some things you really need to not leave to default.
Important SharePoint Specific Configuration!
First & foremost make sure you allow standby mode for the restore or the secondary farm won’t be able to read the DB until it’s restored to a usable mode manually. While transaction-logs are incoming we can use it in read-only mode if we select “standby mode”. This allows SharePoint to use the database at least, even if nothing can be written.
Also very important is the “disconnect users” option for the restore. This is because we need the restore to work above all else; if SharePoint has a connection open for example then the restore would fail without this option selected. That doesn’t sound so bad but if the restore wasn’t possible for long enough then old (still unapplied) transaction log-files would be deleted as per the backup clean-up settings and now you have a broken transaction-log chain, meaning the only way to update the secondary copy would be a full backup & restore again. This obviously could be a big deal, depending on the database size. In short, make sure that no matter what, SQL Server can restore transaction-log backups no matter what – if we need to use the database instance because of an outage at SQL Server 1 then we’ll not be getting any new transaction logs anyway so disconnects won’t happen.
For the record, you’ll know if SharePoint tried to access the database & failed because of a restore happening because the login will be denied – you’ll see event ID 3760 as if the login isn’t setup correctly; in reality it is, but SQL has booted the connection SharePoint had open as per our configuration, is in “single-user” mode while it restores the transaction log and is denying all remote logins until it’s done.
Finish Log-Shipping Configuration
Click OK. This should now run:
What’s actually happened is its run x2 scripts, one on each SQL Server to create x3 SQL Server Agent jobs to do the above mentioned tasks – back-up; copy; restore.
Here you see the jobs created needed for the transaction log shipping and below we have our content database in the correct mode for log-shipping to take place:
The database is mounted on the secondary SQL Server but in read-only mode.
Mount 2nd Database Copy
Once we have a content database we can now create a SharePoint application to use it but we need to add an application without a database because we’re going to add the synced content DB after we have the application as if you just supply the synced database name with the new application settings you’ll get a constant stream of errors from the search app complaining the DB is read-only. So either create a new app giving a temporary database name that you delete afterwards or use PowerShell to create an application without any content database given – then add the content database with Mount-SPContentDatabase or via Central Administration.
Once done & open your web-application with the content DB read-only copy and you’ll see something like this:
The database is in read-only mode until we stop the log-shipping.
Testing the Log Shipping & Failover Farm
So let’s make sure our updates are being published to the 2nd farm then. We want to make a change to the 1st farm web-application and check the change is replicated to the 2nd farm.
Here we make a change to “sfb-sp15-wfe1” (slightly odd name for a SharePoint web-app but adding a NLB + a proper DNS name per farm was just a bit too much for this guide :P). Here we’ve change the title & we see that a transaction log-file has been shipped after the page-edit:
Here we can see the 1st farm with an edit applied and in the file-share the backup uses we can see a transaction log generated after the edit, which should contain the edit. Now we just have to wait for the destination server to pull the file off and restore the transaction log to the WSS_Content database. We’ve left the defaults in so this job runs every 15 minutes, so once the new transaction log becomes available to the 2nd server it can be anything up-to 15 minutes from then.
So if we look at our offsite farm SQL Server we see the job has run successfully after the last file was shipped:
Failing Over in Real Life – It’s Blown up For Real!
When your primary data-centre goes dark there is of course the question of “are we expecting the primary farm to come back online any time soon?”. We could enable read/write access but then we’d have to get those changes back to our primary SQL Server instance if we want users back on that farm again, which can be a hassle. For some failovers it’s enough to have just a read-only copy until the problem with the primary farm is resolved and log-shipping resumes – that way no changes need to be resolved back to the primary.
As part of the failover you’ll need to decide if you’re going to allow writing as everything will still be read-only mode until you do.
How to Failover to Your DR Farm – 1. Get the Latest Log-Data
The first thing you’ll want to do it make sure your secondary has the latest changes possible sent from the primary by finding the last available transaction logs to be restored. So to some extent you’ll have to figure out logically where you can get that – is the 1st SQL instance responding enough to be able to generate more logs? If not, what was the last transaction log it did generate? Did it copy it to the 2nd file-share? Etc. The point is, you’ll need to find them and the restore them to the 2nd instance if you want to ensure your failover farm has the latest content possible. That’s often quite a big deal and really down to your individual needs/usage etc but grab all the TRN files you can & restore them.
Here I’ve gotten the latest TRN files from the incoming share and copied them into the backup folder for SQL Server to restore them to the secondary copy. If you can generate a newer transaction-log backup (assuming the 1st SQL instance is even responding) then do. Selecting the latest transaction-log – SQL is clever enough to go finding all the previous logs to be able to restore everything in the LSN chain.
On this screen we see what backup we really have in terms of LSN sequencing. We need everything from the last backup restored up until the latest TRN file – the newest possible. If there’s a gap in the LSN chain we can’t restore the log-files!
2. Redirect Traffic to Disaster Recover Farm
So anyway, once you decide you do want to failover you’ll want to change the DNS for your website first so users start being directed to the 2nd farm. So instead of www.MySuperAwesomeSite.com going to 10.0.0.1 which would be the Network-Load-Balancer (NLB) of farm #1, we want it to go to 10.0.1.1 which is the DR farm NLB. There’s nothing magical about this; update the DNS ‘A’ record ASAP to start sending people to the secondary – DNS can take a while to propagate so bear this in mind.
Another take on this is to have a separate DNS name for your second farm – something like ReadOnly.MySuperAwesomeSite.com for example & have a reverse-proxy (TMG Server for example) redirect all requests to the 2nd name with a simple HTTP 302.
3. Stop Log-Shipping
Disable the jobs that execute the log-shipping on both nodes (if possible). As you saw above; these jobs are in the SQL Server Agent; disable them from the management studio. Also, finding the source-databases properties in Management Studio and unselecting the “Enable Log-Shipping” (the opposite of what’s described to set it up) and clicking OK will do the same.
4. Enable Read & Write for Users?
If you’ve decided to go all the way with your secondary farm and enable full read & write then you’ll need to set the content databases as writable. To do so, do the opposite of this – http://blogs.msdn.com/b/sambetts/archive/2013/05/06/set-sharepoint-content-database-in-read-only-mode.aspx or just run (per DB):
ALTER DATABASE [WSS_Content] SET READ_WRITE WITH NO_WAIT
Other Thing You Need To Do On Failover
Unless you have some kind of elaborate script that can detect when a web-farm has gone offline with 100% confidence, you’ll want to do this manually. Given this isn’t a simple setup the process isn’t particularly automatic, but the point is when our production farm goes dark for reasons unknown and the bosses are screaming down the telephone at you, believe me when I say you’ll be happy you have a DR farm ready to rock & roll; to cover you until you can figure out what went wrong with the 1st farm at least.
So to failover to our standby farm we need to:
- Prepare the farm for operation:
- Disable SQL Server Agent.
- Dismount/mount content databases to refresh site-collection count. Any new site-collections added since the last DB mount on the secondary won’t be seen until you dismount & remount that particular content database.
- Decide whether we want to allow read/write access on the secondary farm, which will basically involve having to perform a full backup back to the primary SQL instance if we want to switch everyone back there again.
Things Can Go Wrong on Failover…
Given we’re mirroring a very complex environment in two completely distinct locations, lots can go wrong. By “go wrong” I actually mean “upon failover, fail to give normal service because there’s a problem”. Some problems can be severe enough that the entire application fails completely, making your secondary data-centre look something of an expensive experiment gone wrong; other errors can mean unexpected behaviour, like searching for something and not getting any results.
Anyway, there’s two categories of failures when failing over. It can go wrong if you’re not careful:
- SharePoint patch level. If this isn’t done right (i.e. be the same for the most part) then expect terrible things to happen. SharePoint binaries only work with specific SharePoint database schemas – if you failover a version 15.0.6060.5000 schema to a version of SharePoint that’s never seen that version (because it’s old) then it won’t touch it.
- Custom solutions. These need to be handled carefully – more on that soon.
- Hand-made web.config changes that for some reason the application needs but isn’t a SPWebConfigModification (so isn’t applied automatically).
Not so Badly
- Managed paths. Make sure they’re the same between farms.
- Service-application settings.
- Performance – can the DR farm handle the same traffic as the 1st farm? This can be a “badly wrong” problem depending on how bad it gets.
Next Steps – Custom Solutions (WSPs)
Check out how to maintain custom solutions between DR farm in the next section – http://blogs.msdn.com/b/sambetts/archive/2013/10/31/managing-custom-solutions-for-disaster-recovery-sharepoint-farms.aspx
// Sam Betts