SharePoint + SQL Server AlwaysOn: Outage Troubleshooting

So you have a SharePoint/SQL outage despite having SQL Server AlwaysOn configured, and you naturally want to know why. Event IDs 5586 and 6398 are flooding the SharePoint servers and you don’t know why. On the SharePoint servers in the application log you might see either this:

Unknown SQL Exception 10060 occurred. Additional error information from SQL Server is included below.

A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)

…or this:

Unknown SQL Exception 976 occurred. Additional error information from SQL Server is included below.

The target database, 'SharePoint_Config', is participating in an availability group and is currently not accessible for queries. Either data movement is suspended or the availability replica is not enabled for read access. To allow read-only access to this and other databases in the availability group, enable read access to one or more secondary availability replicas in the group. For more information, see the ALTER AVAILABILITY GROUP statement in SQL Server Books Online.

Either way, this will be what the users see – a non-working SharePoint site across all pages:

image

This post is all about how you can figure out why SharePoint’s databases have disappeared and how to get them back. This guide does assume that all databases are on the same SQL instance and everything is offline. To some extent this guide works for single-server installations too.

Step 1 – Identify SharePoint’s SQL Endpoint Name

SharePoint can connect to SQL via a bunch of ways; as long as the TCP traffic ends up at a SQL Server instance of some kind it’s not bothered about how it got there. For us on the other hand, we do care about how SQL traffic should arrive to SQL Server because somewhere along the chain, it’s broken. We have to figure out where.

Where is the Configuration DB?

Let’s assume all databases are out so using PowerShell or Central Admin just isn’t going to happen – we need an “offline” way of getting back-end data. On the SharePoint server open the registry editor (be careful in there) and go-to “HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\15.0\Secure\ConfigDB” and check the “dsn” key.

image

So the first question is, what really is the instance name that SharePoint is trying to connect to? An alias or real endpoint?

In my example it’s pretty obvious just from the name but most installations have their own (often quite cryptic) naming conventions that might complicate things. We’ll find out regardless by pinging the server name.

Get SQL Endpoint Name - Is this a Non-Default SQL Instance?

Does the instance name have a backslash (“\”) in it?

Yes? The SQL endpoint name is the bit before the backslash. No backslash in the instance-name? The endpoint name is the whole thing.

An example SQL instance name could be “srvSQL1\SP” – the endpoint name is “srvSQL1”.

Get Real SQL Server Name – Is Endpoint Name a SQL Alias?

Open “cliconfig” from the “run” command from the SharePoint machine. Click on “alias” tab – if you see the endpoint name listed here then SharePoint is using an alias.

image

In my example, the SQL Server name is “sp15-ao” – SharePoint isn’t connecting to a real server even.

Step 2 - Identify what exactly is the SQL Server

We need to figure out if the SQL Server is – an actual server/replica or an AlwaysOn listener.

This you’ll need to compare by either asking a SQL Server guy or figuring it out from the servers themselves.

image

Here you can see my SharePoint is clearly using a listener. If SharePoint was using a replica instead of a listener we could have a problem (see below).

Is SharePoint Using a Replica?

If by now you see SharePoint is actually using a replica directly then that is your problem. If you’re using an alias with “cliconfg” then you can just edit the alias to point at the primary replica and SharePoint should be good again.

If SharePoint is not using a SQL alias and instead are just pointing directly at a replica then you can create an alias with the same name as the replica but have the alias point at the real primary, as a temporary hack to get up & running.

If you want automatic failover you should be using an AlwaysOn listener, but at least for now you can get SharePoint online again. Update/create alias on each server setup this way & enjoy a working SharePoint again. Plan to use the listener in the future – that’s what it’s there for, so you don’t have to do these manual changes every time there’s a failover.

The end.

Is SharePoint Using an AlwaysOn Listener?

If we’ve figured out that actually the SharePoint server is using an AlwaysOn listener (even through an alias) then we need to figure out why SharePoint can’t see it.

First, run a PING & nslookup test to the SQL listener name to see what we get.

image

NSLOOKUP Returns Single IP Address

This tells us all we need to know to fix this specific problem – we can see that PING resolved the name to a different IP address to what nslookup told us (192.168.20.80 and 192.168.0.80 respectively). What does that mean? Quite simply that the DNS server 192.168.0.10 has a different IP-address (A-record) than the SharePoint server thought it had because it was cached.

Easy fix in this example: clear DNS cache with “ipconfig /flushdns”. Once done, trying pinging the endpoint again.

image

It works! And so will now SharePoint.

Why did this happen? A SQL AlwaysOn failover happened and SharePoint hadn’t caught on in my instance because the DNS local cache hasn’t updated – a simple resolution of either flushing the cache or disabling the DNS client service which does the caching.

Are There Multiple IP Addresses for NSLOOKUP?

You might be suffering from this problem.

Wrap-up

In my experience AlwaysOn is often implemented imperfectly (albeit rarely gravely so). Getting a fully automatic failover isn’t easy but it is possible. Hopefully this post will have helped highlight some of the challenges in enabling fully automatic SQL Server failovers with SharePoint 2013.

Cheers,

Sam Betts