Why is Multi-Subnetting Awesome?
Having a SharePoint farm span across multiple subnets can improve the reliability of a SharePoint farm against networking failures. Reasons why a network could fail are mentioned below, but they are just another reason SharePoint can go offline and removing them as a factor is the focus of this post. We’ll describe the idea first, then there’s a demonstration below.
When SharePoint falls over it’s not a pretty sight – more & more SharePoint is becoming an application platform on top of collaboration and data-storage and therefore it’s becoming increasingly important it does not keel-over unexpectedly.
Many farm outages occur when the transition from web-front-end to app-server isn’t possible for whatever reason but many other outages are because of communication breakdowns in the chain of communication from client to web-front-end-“server” (whichever server that may be).
It’s possible of course to make the “web-app to service-app” transition more resilient to failure by providing more endpoints in case one dies for service-app redundancy – this is nothing new for anyone familiar with SharePoint farms.
What I really want to explore is extending this thinking into a networking paradigm by stretching our single SharePoint farm into multiple subnets with service failovers in each subnet. This way any one network (and servers inside) could go down and our SharePoint farm could live on anyway – front-end and back-end. If you like, this post is “network considerations for high-availability SharePoint farms”, or rather how to leverage networking tricks to allow failing services to have a lesser impact. Services will always fail, the only question is our ability to react to that happening.
So let’s step-back a second and work out…
When do SharePoint Outages Occur?
Let’s rephrase that question slightly – what’s the minimum that SharePoint needs in order to give successful responses to client?
- A working network for SharePoint + relying services to work together
- A web-front-end.
- A server that has all the services that the user web-requests require.
- SQL Server.
- Active Directory.
- Connectivity & routing to the SharePoint network from the requesting client
- Routing is a chain – if any parts break SharePoint won’t respond.
As you can see, the network is key to all of this playing nicely. If any one of the above fails, you have an outage and if you have an outage you have many hacked-off users, bosses, etc. We’re focussing on how to be able to survive a network outage specifically, or at least keep things usable until we can restore full service.
When do Network Outages Occur?
- Routing failure to/from (network inaccessible even if working internally).
- Network/subnet flooding:
- Bad driver/broken network-card
- Broadcasting is contained to subnets
- Cable interference & site disruption:
- Broken or damaged cables
- Unplugged or failing router(s) or switch(es)
Any one of those events would plunge a SharePoint farm into murky darkness if it happens to be in just one subnet or network, and most network engineers would tell you they’re not as uncommon as you’d think. Subnets are nice safety barrier for networking screw-ups – a fallout in one probably won’t affect the other, and we want to harness this for making sure our SharePoint farm stays online for longer.
A Typical SharePoint Network
This is a typical, basic network setup for SharePoint. A client from another location entirely connects over TCP/IP and HTTP to SharePoint and the web-front-end that gets the request responds calling whatever local network services it needs to formulate a complete response.
Blue arrows show HTTP traffic; orange is SQL, and green is Active Directory traffic. Any break in any one arrow is an outage to the whole system. And this is a bad thing, of course.
We know how little it takes for a failure to occur somewhere and there are several failure points in this simple diagram but in this post I want to focus on just networking for now so here’s how we can evade network failures from a network outage…
A Multi-Subnet SharePoint Network
This is a far superior solution for its redundancy capacity as now without impacting SharePoint we could suffer the following outages (aside from the usual AD/app-server failovers):
- Route from site A à sites B or C. Assuming sites B & C can communicate this won’t impact application servers, only web-front-ends.
- Entire network outage on either B or C (see below for diagram).
Of course, we’ve not just increased redundancy just because of the multiple subnet factor; the design of the above farm also gives x2 AD and SharePoint application servers too, for the multiple subnet design to be worth it. AD calls will use the server in the same subnet (depending on config) but WFE1 will at some point invoke APP2 for example as SharePoint isn’t site aware like AD so internal calls are done on a round-robin basis. SQL, being a failover cluster will use one subnet address or another depending on where the failover nodes are located but the idea is to have at least one failover node on each subnet with local routing between sites for inter-site traffic.
Regardless, we can now have a whole bunch of things die on us and SharePoint will carry on living (subnet 1 has died):
Improving the route-failure resiliency from client to SharePoint WFE is only as good as the chain to & from the client. Take our example in more detail:
If we have a routing/network failure in location X we’re doomed. If however Z or Y goes down (taking with it B or C respectively) we’re fine. The further up the chain we can split the routes the better and this depends purely on the IP allocation used. Obviously if the client’s ISP goes down then there’s nothing we can do (nor would we have to – everyone else should be able to access the site no problem) but if our subnets routes only split at the data-centre then equally the multiple subnets become less advantageous.
Subnet Failover in Action
I’ve tested this on SP2010 just because my test environment doesn’t have the memory for a HA SharePoint setup in 2013. This is the same setup as the above diagram; x2 SharePoint subnets, web-front-ends & app-servers with each subnet containing an app + WFE + AD server.
Currently the client machine is configured to point at the web-front-end on subnet 1.We’re going to kill the router on subnet 1 so everything on that network goes dark, update manually the DNS for the client to point at subnet 2 instead (this can be automated of course), then watch to see how SharePoint survives.
Of course we have SharePoint configured so that both app-servers mirror each other for services-on-farm. Here’s the search topology in particular – notice the query component in particular is setup to failover to the 2nd subnet app-server:
Here’s a search page working with all machines up:
Both WFE’s online, our host-header using WFE1, and search working perfectly (WFE1 and our host-name “SP14” use the same IP):
Now to cause an outage: break router from client network to WFE1 network.
Sure enough, the client can now only see one WFE (notice SP14/WFE1 aren’t responding to the ping request anymore):
When we detect the error we change the DNS for the SharePoint URL (manually in this example) – “SP14” responds to ping:
Now we load the site. With the WFE alone obviously the content will load but the question is whether the depending service-apps will too as just web-front-ends working is a long way from having an operational SharePoint. Let’s test the search functionality:
Everything’s working as expected – the WFE has picked up the 2nd application server for search and queried there as appropriate. If there’d been a problem we’d be looking at an error right now; restarting IIS would show an error if the restart happened after the outage (see below for why).
SharePoint will complain there’s an issue with a server, but will carry on regardless. It’s not supposed to be beautiful, just work well enough until you get the rest of the farm back online.
- Big one for search: if the new web-front-end server hasn’t loaded the application-pool for the website exposed to the users AND the search admin component was on the inaccessible server, the search service will fail even though a mirror has been configured. This is because the WFE needs the search admin component to get topology information to then know that there’s a local failover mirror. If the server’s sat around doing nothing with the SharePoint content-app not loaded your search service will never survive for traffic sent to that server.
- On that note, if the only accessible app-server(s) aren’t running the services you need then obviously that’ll fail too. This basically means having duplicate servers & services for everything that’s critical.
- Per subnet, it’s probably more realistic to have the destination IP set as a network-load-balancer rather than a single web-front-end. It’s done this way here for simplicity.
- This depends on the SQL cluster failing over to subnet 2 successfully. It’s possible to have the cluster work out there’s been a network outage and failover automatically but the failover needs to happen if a subnet failure has occurred.
- AD too needs to be working & each server have a local AD server as its %logonserver% environmental variable. No accessible AD, no SharePoint. Same with SQL.
- In the search administration page you’ll notice various components say “Online” even when the affected server has died. Think of it more like “Could be online” – meaning you’ve not taken it down for maintenance 🙂
DNS & Supportability Issues – Important!
The whole theory described in this post relies on the client knowing the IP address for the SharePoint site has changed. This is a DNS issue and I’m leaving it out of this discussion for now, but needless to say a DNS update is critical to clients being able to locate the new web-front-end(s).
Finally, for a SharePoint farm to be officially supported the latency between machines needs to be 1milisecond or less. Don’t save money on hardware in other words 🙂
More info on supported network response times here – http://technet.microsoft.com/en-us/library/cc262485(v=office.15).aspx#hwLocServers
This hopefully showed how it’s possible to avoid some extra outage scenarios by designing your farm into a network-spanning zombie-farm that just won’t die. Well, never say never but you can at least make it less likely by providing redundancy this way. When downtime isn’t a possibility and there’s the investment to make it work, this is something that should be considered.