A key part of the SharePoint high-availability puzzle is a healthy and highly-available Active Directory; if AD dies then so does your SharePoint farm too, broadly speaking. However more specifically, that opening statement should really say “each and every SharePoint server needs AD failover” for the whole thing to work – if just one server cannot access AD for whatever reason then whatever roles that server has will also fail which in itself can lead to a farm outage.
Well the good news is Active Directory has been designed with high-availability in mind for literally decades now so Windows platform is pretty good not suffering too much from a single server outage – this article just shows how SharePoint specifically though will carry on living even when its AD server disappears suddenly as no two “apps” are guaranteed to behave the same when a Domain Controller (DC) drops off the IT mortal coil. We’ll assume from the get-go that there’s +1 AD server as obviously turning off the only AD server will kill just about everything as there’ll be no failover possibility; this article will show how SharePoint handles failing over to another AD server just so we can see what happens and that it’s possible.
We’re also going to assume that DNS will always be available – no DNS and AD connectivity will die too as DNS is used to lookup AD servers so we use a primary & secondary DNS server which also happen to also be DC1 & DC2 servers:
These DNS settings on each SharePoint server are important and should make sure we can locate another DC no problem. We’ll test a couple of things; browsing with NTML, browsing with Kerberos, searching, and my sites.
Test AD Failover with Working DNS
Here’s what there is; the SharePoint server had x2 DNS & AD servers – one in each subnet and Active Directory “site” with the local subnet/site DNS server being preferred.
All being well we should failover entirely to DC2 for DNS 1st and then AD too once we find the new DC from DNS…
Let’s “accidently” knock DC1 offline…
That should do it – DC1 will no longer be in service, to the surprise of everything else on the network. Worry not though; we have DC2 raring to take over when necessary.
The failover worked just fine aside from a delay, the page loaded just fine in the end. SharePoint tries to contact DC1 but can’t so moves onto another. Have a look at the network trace:
Red shows where SharePoint is trying to contact its “alive-just-a-minute-ago” local DC (i.e. the DC in the same AD “site”) but gets no response from the ARP packets. In blue we see the SharePoint machine query DNS for another DC, which actually returns one that’s not itself – DC3. Once that DNS query works SharePoint continues talking to AD but via DC3 instead (in green), and the user only has to wait until this transaction is complete before being rendered the page.
Interestingly, when DC1 comes back online SharePoint switches back slowly but surely. That makes sense as both the SharePoint box and the DC are in the same site so it’s going to be preferred for AD traffic. Importantly though, SharePoint lives on and nobody notices (hopefully) that we’ve actually failed-over to another DC because our normal DC died unexpectedly.
Errors Logged on SharePoint Servers
It’s possible you might see errors while the netlogon service on the SharePoint server figures out that Mr Favourite DC has disappeared from the radar – there’s a wait period while the “client” machines (SharePoint in this case) tries to locate its DC and then picks another when it doesn’t respond, but that’s it. To get a good idea of what’s going under the hood you need to enable netlogon logging which will give you all the detail you’d need about what calls failed & when, and what netlogon did about it.
That said, user profile imports might fail if you’ve specified a specific DC for security/performance reasons but other than that operation should continue after 5-10 seconds of waiting for general use.
Ensure Failover for All SharePoint Servers
To re-emphasise, this failover setup needs to be possible for all servers in the farm or at least for enough servers so that core services don’t die.
On an abstract level, this is what we want to achieve although I’d recommend keeping DC1 & DC2 in separate subnets for all sorts of reasons.
…because with this setup, this outage won’t matter:
The farm will live on, assuming DC2 can handle the extra load of being the only DC in the forest. That’s it! Another tool in the high-availability toolbox for SharePoint.
// Sam Betts