Those of us that have even the most passing knowledge of SharePoint will know that it needs a SQL Server instance to work at all. If SQL goes down it takes SharePoint with it which is why it’s quite common to have SQL clustered to make sure SQL stays alive at all costs. I’ve covered this before already but recently I had an issue where my multi-subnet SharePoint network had SQL timeout errors in one subnet and I finally discovered why; it was due to the multi-subnet SQL cluster the farm was connecting to.
Update: The preferred solution for this problem in 2013 is now setting SPDatabase.MultiSubnetFailover=true – this post describes the SQL-side solution if the SharePoint configuration isn’t possible.
Background: Multi-Subnet SQL Clusters – How they Work
One of the newer configurations that SQL Server in a cluster supports is a multi-subnet instance name with an IP address in each subnet, with just one IP address active at any point in time.
Multiple IP addresses for a DNS entry (multiple “A” records) is something most geeks that haven’t had much experience in networking tend to not be used to – which IP address is the right one? If I “ping” the name, why does Windows pick one IP over the others? It’s a fairly new idea on the networking scene and how this all works is a bit complicated in short.
To see all “A” records for a name, nslookup is our friend:
Notice there’s an address in 192.168.0.0/24 and 192.168.10.0/24.
This allows us to failover to another subnet entirely, and assuming our network routing is setup ok we should still be able to connect to the same instance – just using an IP address in another/the-other network. Clever, really; this was only doable before with VLANs and all sorts of networking pain.
The Multiple A-Records “Problem”
These days if a SQL client knows its host is multi-subnet it can try connecting in parallel to each IP address for the name at once; using whatever endpoint responds first to talk to the SQL instance. However, to maintain compatibility (I’m guessing), when SQL clients connect by default they’ll stick to the old way of connecting; the newest & shiniest SQL clients included unless you implicitly say “connect using the super-awesome multi-subnet method”.
The old way is basically “give me a list of IPs for this name” for which the client will then try and connect to each IP address one-by-one, moving on the next one only when the previous IP address timed-out. This is bad if say your SQL instance spans 3 subnets; if IP address 3 is the active one and is tried last then you can almost guarantee that the SQL timeout value will be hit while trying to connect to at least one inactive IP address, and the SQL client will subsequently report back to the application in question that there’s no-one listening on the end. Service failure, in other words and this is what SharePoint will suffer from unless the cluster is configured correctly.
In SharePoint you’ll start to see these sorts of errors:
Skip forward to 2012 & as mentioned it’s possible to have the SQL client try all X IP-addresses at once, using whichever one responds first to avoid this timeout issue. The problem is though you have to change the connection-string to enable this, which in SharePoint land is very much a no-no, meaning SharePoint just can’t take advantage of this.
Not to despair though; we can still use SharePoint on multi-subnet clusters, just we have to change the default cluster behaviour to not register X IP-addresses with DNS “A” records for each address – we can make it just update the one “A” record instead. This will suit SharePoint just fine, even if we’re now at the mercy of a DNS update from being able to see a failover.
Configuring Multi-Subnet SQL Clusters for Single “A” Records
It’s a fairly simple fix in principal; just update the cluster network name object to disable multiple “A-record” for that name. To do this just run:
Get-ClusterResource “[Network Name]” | Set-ClusterParameter RegisterAllProvidersIP 0
…changing “[Network Name]” for the cluster network name resource in the clustered role (“SQL Network name (SQL-SP15A)” in my example).
Don’t know what the network name is? Run “Get-ClusterResource” to get all resources – there’ll be at least two; one for the cluster root (ignore) and another for the AlwaysOn listener name. Make sure you get the right one!
After applying the change & restarting the clustered-role, the cluster manager will just update the one A-record to the active node instead of creating multiple records, and SharePoint will cope much better for databases that don’t have the option to set SPDatabase.MultiSubnetFailover = true (the configuration database for example).
Check the parameter was applied by running:
Get-ClusterResource “[Network Name]” | Get-ClusterParameter
More info on DNS registration & clusters – http://blogs.msdn.com/b/clustering/archive/2009/07/17/9836756.aspx
After restarting the cluster-name-object (basically restarting the role) & cleaning up all “A” records manually (clean-up isn’t done automatically) we can see our old A-records are still in DNS so we’ll need to delete those manually.
Once gone they’ll now not return and SharePoint will just use the one IP address for SQL Server. There is a risk of course that the A-record update won’t occur quick enough on failover for SharePoint to see – that’s the whole point of having multiple A-records, that you know in advance of which instances there’ll be available if you need to failover to another but again, this just isn’t possible yet in SharePoint. Maybe sometime soon it will be.
That’s it – happy SharePointing!