What is AlwaysOn and why use it for SharePoint?
Update: SQL 2014 + SharePoint 2013 (abridged) guide also available. Also if you want to use AlwaysOn for SharePoint Disaster Recovery, then you’ll want to have a look at this article – http://blogs.msdn.com/b/sambetts/archive/2015/04/24/setting-up-sharepoint-disaster-recovery-sites-with-sql-alwayson.aspx
AlwaysOn is a great new feature to SQL Server 2012 that allows consuming clients to have a mirroring and failover database solution all transparently for a single named instance. SharePoint is one such consumer of SQL and depends on having a solid connectivity to run. If SQL goes down, so does SharePoint so it’s well within our interests to configure a bullet-proof SQL Server solution that just won’t die.
In this blog post we’ll set up an AlwaysOn cluster for a SharePoint 2013 farm in the following stages:
- Create SharePoint databases.
- Configure a SQL alias for a consistent SQL instance name.
- Create farm on SQL instance/replica 1, creating all databases needed (service-apps & web-apps).
- Stop SharePoint so databases remain static during migration to an AlwaysOn cluster.
- Create AlwaysOn high-availability group.
- Restore all the DBs onto SQL replica 2 (with NORECOVERY).
- Create AlwaysOn availability group
- Join the replica 2 databases to availability group.
- Create listener.
- Migrate SharePoint onto AlwaysOn on cluster
- Reconfigure SQL alias for new listener.
- Restart SharePoint services with updated alias.
- Test failover and that SharePoint can survive it.
- The failover process in action
- Why SharePoint doesn’t survive the 1st time.
Test Environment Details
For this demonstration I’ll be using a “super-cluster” configuration just because it’s the mother of all high-availability solutions. Each member for the AlwaysOn cluster will itself be a failover cluster with shared disks rather than a standalone machine. Why? That way we have instance redundancy and data redundancy too, all for the same logical SQL Server instance. Normally the two instances will be mirroring each others’ data but if one replica node goes offline the affected replica will live on no worries, and even if a whole replica (both nodes) does goes offline we can just failover to the other. The two replica members for this cluster will be:
- SQL cluster 1 (primary replica) – SQL-SP2013\SP15
- SQL cluster 2 (secondary replica) – SQL2-SP2013\SP15B
Both these instances are also on their own subnet so we’re covered for a whole bunch of doomsday scenarios – that being the whole point of this setup and blog-post. The parent cluster uses a file-share witness on a 3rd subnet to verify connectivity out of the subnet, or a routing failure.
To span both subnets we’re going to create a single logical SQL Server instance with IPs in both subnets 1 & 2, for SharePoint to connect to.
This is the SQL backend we’re aiming for in this example – it’s a cluster of clusters! Please note and to reiterate; for AlwaysOn to work this “cluster-of-clusters” setup isn’t required; you can setup an AlwaysOn cluster with just x2 standalone SQL Servers too, it’s just a single node failure would result in loss of mirroring whereas both replicas being failover clusters themselves allows us to survive up-to 3 simultaneous machine failures in our SQL Server cluster without breaking into a sweat! And that’s the goal here; a zombie SQL setup that just won’t die, for a SharePoint farm to lean on.
Create a SharePoint Farm on an AlwaysOn Cluster
Before we create a farm let’s connect to our two instances that’ll make up the replicas in the AlwaysOn cluster.
Both SQL clusters instances are ready and empty.
Create SharePoint Databases
Prepare SQL Alias
We need to fool SharePoint into thinking the SQL server where we’ll create the configuration database is called something else. That’s because when you create a farm, whatever destination server you use during the “create new farm” stage isn’t forgotten (to simplify), so first thing’s first let’s create fake alias for our SQL “server” – “SP15” as it’s SharePoint 2013 that’ll use it (version 15). As far as SharePoint needs to know everything is going into a SQL box called ‘SP15’.
Now assuming you don’t have SQL Server tools installed (which you shouldn’t if it’s on production!) you’ll need to use a tool called “cliconfg.exe” to create a SQL alias. Run it; click the “Alias” tab, and the Add button.
Add a name – SP15 for the alias name, but it can be anything – instead of referencing a server-name we’ll reference the name of the alias which will transparently redirect traffic to the alias destination instead.
Install & Configure Farm
The next thing we need to do is create all the databases we’ll need on the farm. For this example I’ll use a standard, home-baked script to create my basic services, accounts, and applications – and their corresponding databases:
- Farm (1 database; SharePoint configuration)
- Service Apps:
- User-profiles (3 databases; profiles; social; sync)
- Enterprise search (4 databases;
- State service (1 database)
- Usage & health service (1 database)
- Content apps:
- Default port-80 web-application (1 database)
We’re not so worried about having the perfect configuration at the moment so much as just getting the databases created.
The PowerShell script, doing the business. Notice the “database server” name is our faked alias and yet there’s no problem; aliases are very useful for this.
SQL instance 1 now with all the databases we’re now going to add to a High-Availability (HA), AlwaysOn group.
Stop SharePoint so databases remain static during migration
We now have everything we need to create the single SQL point of access. First thing’s first though, let’s stop any updates happening to any SharePoint database happening until we’re all setup by shutting the SharePoint services down – IIS, SharePoint timer are the important ones. We’ll restart them once we’ve updated the alias to use the AlwaysOn group and SharePoint will be none the wiser.
And we’re done with SharePoint for now.
Create SQL Server AlwaysOn Group
Restore all the DBs onto SQL instance 2 (with NORECOVERY)
Next thing we need to do is copy the databases to the 2nd instance. Part of the AlwaysOn wizard can do this automatically for you if both instances have exactly the same database file locations however we can’t do this as each failover replica has its own shared cluster-disk for its’ data files, and each disk has its’ own drive letter, therefore the paths will never match-up. And that’s a shame, but we have to do this the hard way.
To join the 2nd replica we need to run a full backup of all the databases on replace 1 but for now we’ll start with just the Central Administration content DB.
Take a full backup. Copy to destination server. Restore.
No need to create a new empty database. Just directly restore & SQL will create the necessary bits & pieces for you.
Important: the restore needs to be done with the NORECOVERY option. If this isn’t done, AlwaysOn can’t use the database.
Once successful, the database will stay in the “restoring” state. Don’t worry about this – it’s just SQL is basically saying the DB isn’t usable from this instance, for now.
Create AlwaysOn availability group on one database
Next, we’re going to create our AlwaysOn instance. We’re going to only add the Central Admin content DB to the group for now…it’ll be clear later why.
On the primary server (SQL-SP2013 in this example), expand “AlwaysOn High Availability” (this requires you to enable AlwaysOn in the SQL Server Configuration Manager) and click “New Availability Group Wizard…”
Add name. This will be the name of the role added to the parent cluster. Click “Next”.
This is where we specify which databases to add to the HA group. We’ve only backed up one the CA content DB so this is only one we can add for now.
Also this screen is a good place to point out something this technique requires that SharePoint doesn’t by default configure – a full recovery model for databases. We’ll change this later, but for now we just want to create the HA group. Click Next.
Here’s where we specify which SQL instances will make up our AlwaysOn HA group. Click “Add Replica” and connect to the 2nd instance. We’ll create a listener later; just accept the defaults for now.
Important notes here about the AlwaysOn configuration:
- The default is “asynchronous commit” for data-safety settings which basically means the primary instance doesn’t wait for the secondary instance to confirm its’ COMMIT until the primary does its’ own COMMIT for any updates to the AlwaysOn databases. This isn’t necessarily unsafe but there’s a small potential for data loss should the network breakdown between instances during an update of some kind, however performance is much better. We care about safety over speed in this example, so we’re going to “synchronous commit”.
- You can have the secondary instance as a read-only instance but this slows down the machine so it’s off by default. Obviously if there’s a failover the secondary instance will start accepting sessions again from SharePoint but enabling this is probably not necessary right now.
Here’s where we kick-start the synchronisation. Again, SQL Server can save a lot of pain here & do everything for you (saving you the whole backup/restore process above) but as I mentioned, my SQL instances have different drive-letters for their data-stores so a full synchronisation will never work. We’ll go with “Join”.
After a while, this should be what you see. If you get any error it’s likely because the backup/restore wasn’t done correctly – check the backup was a full backup?
Also it’s worth mentioning there’s some clustering considerations AlwaysOn requires like making sure each replica failover instance can’t be cross-hosted; I’m not sure why this is, but you’ll need to make sure if you have a cluster-of-clusters as I do here each cluster member can only host one failover cluster or the other, but not both.
Assuming it worked you should you be able to right-click on the availability group, select “Dashboard” and see this wonderful screen of green:
A couple of points here:
- We’ve configured the group for synchronous commits for both nodes. Again, this has a performance hit as a round-trip “ok” is needed before any writes are completed, but it is the safest. This makes little difference for SharePoint sites that are read-heavy compared to writes; a publishing site say, but could drag if lots of updates occur instead, say in a collaboration site. You can run in asynchronous mode which doesn’t wait for the ACK from the secondary node before continuing, but that’s at the risk of losing data.
- For the same reason the “failover readiness” says “no data loss”. If we were running async commits this would say “data loss” for the secondary databases but this is just over-dramatizing the fact replica 1 doesn’t wait for replica 2 to commit before committing itself, so it can’t know the failover commit occurred and therefore it assumes the worst. If you want absolute guarantees and don’t mind the performance hit, make the AlwaysOn group perform synchronous commits.
- Green is good; if one day it turns not-green (i.e. you have an unexpected failover or communication failure) then you need to figure that out with your DBA/networking people – don’t blame me, I’m just a SharePoint guy 🙂
Restore remaining databases to SQL machine 2
Next up; we need to set every database that isn’t already configured to full recovery that mode. I’d point out that there are possible supportability issues here in that the product-group haven’t officially said this is a supported action or not – I say this as a disclaimer on the very slim off-chance there are issues with changing the recovery model on SharePoint databases. There shouldn’t be any issues though although performance is obviously faster if simple transaction-logging is used.
It’s fairly simple to change the recovery model – right-click on a database, open “options”, select the recovery model there and press OK. That’s it – no downtime or restarts or anything needed.
Do this for any database not in full-recovery mode. That’ll be:
- Any search database.
- Any user-profile app database.
Next, backup each database as before. Copy as before.
Restore as before (WITH NORECOVERY). Now your SQL Server Management Studio should look something like this:
Scripting restores can be quicker than managing the GUI for multiple DBs at once (if they’re not overly complicated in file-structure that is). All databases are now ready to add to the high-availability group!
Add databases to availability group
On the primary instance, expand the AlwaysOn groups; right click on ours and select “Add Database”. We should now see the following:
Select them all & click next. Again, we’re joining only as we’ve restored the exact copy to the 2nd SQL replica. The process is basically as before just with a “connect” section to authorise.
This is hopefully what you’ll see. If some database(s) don’t join because of insufficient transaction-log data with this error:
The mirror database, “[database]”, has insufficient transaction log data to preserve the log backup chain of the principal database. This may happen if a log backup from the principal database has not been taken or has not been restored on the mirror database. (Microsoft SQL Server, Error: 1478)
If you see that persistently despite being sure of a full backup & restore, try again but just backing-up and restoring the transaction log to the same database(s) this time. It worked for me anyway.
Create SQL AlwaysOn Listener
We’re almost done with SQL; our SharePoint databases are mounted and are perfectly syncing one to the other, and in fact there’s nothing stopping us from directly using the old connection-string again knowing that our changes will be mirrored to the 2nd SQL instance. However we want a more automatic failover process than that – we want an abstract instance name that’s independent of which replica is the active one, and that’s what SQL AlwaysOn listeners give us.
Right-click on the HA group “SP15-SQL” again and click “Add Listener”. This is more of a network-admin type thing but the idea is to create IPs in both subnets. We give it a name which will be the equivalent of a generic clustered end-point and a port-number to listen on. Behold:
Click OK. This has now happened:
The listener has a DNS record, computer account in AD, and the assigned IPs. Test it by pinging it. Clustering is awesome!
Add SharePoint Servers to the farm
Configure SQL alias for listener
So that’s about it for the SQL side. We now need to add our SharePoint servers to use this new “server” we’ve setup, but given it’s on a non-standard port we need to use SQL aliases to get there with a simple name. How handy we have one as we needed to fool SharePoint into thinking the SQL Server name so we’re in luck – we’ll just update the alias with the new details!
Open “cliconfg.exe” again and edit the alias. Enter the listener name, select TCP/IP, unselect “dynamically determine port” and enter the port-number we assigned for the listener.
Accept all changes. Repeat this on all SharePoint servers.
Restart SharePoint services (with updated SQL alias)
The final stage in this journey now we’ve moved everything and have updated the alias is to restart all SharePoint services on our starting server and finally add any further servers to the same configuration database necessary. Start IIS & SPTimer and SharePoint will have no idea it’s now looking at a difference SQL instance.
Connection success (again)! Continue the wizard as normal & repeat, lather and rinse for any remaining SharePoint servers.
All SharePoint cares about is that its’ databases are on a SQL Server called “SP15”. Make sure this alias is consistent for all your SharePoint servers.
Test the Failover – Will SharePoint Survive?
So now’s the important bit – the main reason for doing all this is to ensure the continued uptime of SharePoint should the SQL back-end drop unexpectedly (even failover clusters go down sometimes). Suffice to say this blog is long enough as it is; I’ll just show what happens on a manual failover and you’ll have to believe that it kicks in automatically when various nodes go offline or don’t (assuming configured correctly).
The keyword here is “test” because as we’ll see, SQL will failover perfectly to the 2nd replica but SharePoint won’t survive for reasons we’ll see in just a moment…
So once we’re convinced we have both replicas talking and syncing to each other perfectly, we’re going to failover to the 2nd replica (SQL2-SP2013\SP15B). Right-click on the HA group again and select “Failover…”
This confirms what we’re about to do is “safe” – we’re running in synchronous mode so all update transactions are always guaranteed safe. Click next, then connect to the 2nd replica with an appropriate login and start the failover.
Success! If you have the dashboard of the AlwaysOn group of the 1st replica, it might show all sorts of critical errors but it’ll straighten itself out once it works out the primary is now the secondary. Close it and open the dashboard on the previously secondary replica and it should show all OK.
Everything is now being run by our 2nd replica. Ping SP15 and you’ll see the DNS has updated too.
But SharePoint is Down!
Agghh! The farm has been floored! Nightmare!
Yes indeed. This is also why planned, regular failovers are a good idea because you can never be certain of what’ll happen on a failover until it happens. You’ll notice SharePoint may now be complaining that various key accounts are being denied login by SQL Server. Well that simple why; your farm accounts probably aren’t known by the 2nd replica endpoint.
Solution: add the accounts to SQL as per this article (http://technet.microsoft.com/en-us/library/cc678863.aspx) and check everything is working as with both replicas being used. Make sure to check:
- All web-applications (default web-app; my-sites)
- All service applications (Search; UPA; Excel Services, etc.)
If something isn’t working with one replica active but on another one it is then there’ll be a difference in permissions. Testing is vital here; make sure you run through every service & app that’ll use different credentials and ensure each one works with each replica member in the cluster being the primary.
Don’t change anything in the failover cluster manager! I’m told by my SQL friends this will render the cluster in an unsupported state as there’s all sorts of meta-data that Management Studio needs & reads which the Cluster Manager does not.
This has hopefully shown the benefits in mounting SharePoint in an AlwaysOn cluster and how that can work. Later we’ll experiment with some of the options AlwaysOn gives us and the farm configuration in this example is far from ideal, but the key stage explained here is how to mount the SharePoint databases at least.
SharePoint is increasingly becoming critical to have online; this is yet another way we can make sure it stays online!