Patching SharePoint farms is a rare occurrence but one that almost certainly will happen at some point or other, if only for the need to stay on a supported platform as time goes on. When patching SharePoint it’s necessary to take the entire farm offline at one point for reasons explained here at length. In short though a SharePoint farm depends on specific binary + database version combinations to operate properly, so it’s just impossible to stay running in any capacity during patch-time making any high-availability for your SharePoint applications promises difficult to keep.
That said, it’s of course possible to run a SharePoint web-farm that doesn’t go offline during a patch process if you have a Disaster Recovery (DR) farm to lean on while you patch the main SharePoint farm. This is article is about how that’s done but in short it involves using said Disaster Recovery farm with SQL Server log-shipping enabled as per this article, in read-only mode. This is true high-available SharePoint patching explained.
Update: this process is ever better if you use SQL Server AlwaysOn. Check the new post!
Prepare & Verify Disaster Recovery SharePoint Farm
Key to making this work is switching users to the DR site while we take the primary farm offline which obviously implies checking the DR farm is working and up-to-date first. For this demo I’m too lazy to use a common name for both farms that I can switch via DNS but here we see both farms working anyway, albeit with the DR farm in read-only mode still.
Check the SQL Server events for log-shipping errors (there shouldn’t be any). Once it’s confirmed we have up-to-date content databases we can add it to the SharePoint web-application in Central Administration or with Mount-SPContentDatabase.
Both primary and DR SharePoint farms working just fine with the DR farm in read-only mode as is normal. The DR farm is loading the application without any error though, which is key.
Once we've confirmed everything is ok we need to suspend log-file shipping until we’re done fully patching the primary farm. Do this in SQL Server Management Studio – find the backup job and disable it. The DR farm will probably start worrying log-shipping isn’t working but we don’t mind because this time we meant it.
Once both the backup and alert jobs are disabled we should be ready to patch the primary farm. When log-shipping doesn’t happen, SQL Server will likely freak out somewhat…
You may see this error on the DR farm SQL Server – SQL is worried that log-shipping has stopped, which it has because we stopped it.
Switch Users to Disaster Recovery Farm
A key step before we patch though is to failover to the DR farm, however you do that. A common solution is to update a DNS “A” record for the website name so users start going to the 2nd site network-load-balancer instead of the primary NLB but bear in mind that change can take a while to propagate.
This step is really important though because we’re about to drop the primary farm – anyone still on it will not be happy if they see an error because their DNS cache was stale for example.
Also bear in mind that users will be in read-only mode on the DR farm as it’s impossible to resolve changes in an unpatched content-database with patch changes so we keep it simple by not letting any changes occur until we’re patched and back on the primary farm again. Well, impossible it might not be but there’s certainly no supported way of doing it I can think of right now anyway.
Begin Patching Primary SharePoint Farm
Now users are being sent to the DR farm let’s fire the patch up on all the servers in the primary.
Patching binaries on the server – this can take a while. Once it is all done on all servers we need to upgrade the databases. Last time I did it with the wizard; this time for variation let’s use psconfig.exe
That “100%” is what we’re looking for – that means the patch changes have been applied to all relevant SharePoint databases, including our content database. We now need to run the same command on every other server in the farm or just run the Configuration Wizard which should do the same post patch install.
Once that’s all done let’s open central administration so we can check there’s nothing else to do:
The build version is newer and there’re no more pending actions on the servers in the primary farm except perhaps to verify the sites/apps still load. Also remember to check in Central Admin for any new health warnings – sometimes I’ve found the security token service can need a restart after a build-to-build upgrade.
The important thing to do now is test that everything is working as expected before we switch back the users.
Switch Back User to Primary Farm
Now we’re now basically done with our DR farm as far as users are concerned. Failover to the primary farm as you did the other way-round to patch the primary SharePoint farm.
Resume Log-Shipping to Disaster Recovery Farm
First of all we want to disconnect the about-to-be-magically-upgraded content database from the not-yet-upgraded DR farm to avoid any potential conflicts. The worst thing that would happen is that SharePoint would refuse to touch it but still, just to be safe dismount the content databases before we resume database syncing either in PowerShell or Central Administration.
This is how it looks if a SharePoint farm tries to use a content database that’s incompatible with the farm binary version (it says “WSS_Content is in an unsupported state, and could not be used by the current farm”). But we’re jumping ahead of ourselves of course – this message above only happens once log-shipping is resumed and the patched database makes it to the DR farm.
Enable again the SQL Server Agent backup job and run it (right-click on the job and select “Start Job at Step…”).
The next backup it does will include all the changes made by the patch so it’ll be bigger than normal. Sure enough, notice the file-size of the latest TRN file.
When that log-file backup gets restored to the DR farm, the patch for that content-database will be complete ahead of the DR farm binaries so the DR farm will quite likely start to fail as per the above message if the content database is still attached to the farm.
When it’s finally restored you should see something like this in the event-logs of the DR farm SQL Server
Verify Log-Shipping is working to DR Farm
Let’s just make sure the secondary SQL Server is picking up the log-files again so we can make sure our DR farm is ready for an unplanned failover if necessary. Open up the event-viewer on the active node if it’s a cluster or just the machine if it’s just a standalone machine.
Notice the series of errors about SQL Server complaining it’s not been fed any data for a while and its’ DB copy is quite likely out-of-date. Normally this would be a problem but in this instance we deliberately caused it so we can ignore these messages. Event ID 18268 gives us a nice confirmation we’re back in the game though and restores are working as normal again.
Patch DR Farm to Match Primary & Job Done!
The only thing that remains is to bring the DR farm up-to the same patch level as the primary farm. Once the content-database has been confirmed as being in sync again, patch the DR farm as before.
That’s it! Your disaster-recovery/hot-standby farm should now be operational just as before – ready to pick-up the load if the primary farm goes offline for any reason but now on a newer SharePoint build. We’ve managed to perform major heart-surgery on our SharePoint farm without losing any data or traffic.
// Sam Betts