The “High-Availability SharePoint” Bible

Article
04/17/2014

A big interest of mine is designing SharePoint farms to be highly-available through good architecture & solid design; something I’ve posted about quite a bit on this blog over time. This article summarises the high-availability strategies available for SharePoint and then touches on other common areas that cause SharePoint farms to fail, as a sort of grand wrap-up to the whole HA-SP series so far.

In reality, the question of “how can we avoid downtime” really comes down to how much you really care about users having no or limited access, or rather more simply put - how much money are you willing to spend to make sure things run smoothly?

Has your SharePoint farm gone down before? This is your guide to make sure it doesn’t happen again because if your SharePoint farm has gone down once then it’s probably because, in the nicest possible way, your organisation or company just didn’t want to keep it online enough and there’s almost certainly something that could’ve been done to avoid any failures from becoming a fatal blow to SharePoint running normally.

In short, keeping SharePoint online means designing a fault-tolerant architecture, coding customisations & apps in a well designed and tested manner, and implementing good SharePoint governance. First though, the architecture…

A High-Availability Architecture for SharePoint

So as previously mentioned in my blog, some of the tools & tricks in the high-availability SharePoint toolbox are:

First & foremost - having ready a Disaster Recovery site should you need to take the primary farm offline for any reason, farm patching included. DR sites are crucial to maintaining high uptime, they just are. This isn’t a high-availability practise in as much as it’s just a 2nd logical instance of SharePoint in case the 1st farm dies, but still, it’s very important if you’re serious about keeping users online.
SharePoint depends on SQL Server so making sure it’s always available by implementing failover clustering and/or AlwaysOn clustering is another huge boost.
Service-application and web-application redundancy (web-app redundancy done by Network Load Balancing; outside the scope of SharePoint core). Related to that, federating service-applications is a great way of reducing internal service dependencies.
Setup Active Directory to it can failover should an AD server roll over and die. DNS redundancy is key to high-availability too.
Multi-subnets for your SharePoint farm servers to avoid network failures, combined with multiple routes between subnets if possible. This is also known as a stretched SharePoint farm.
Reverse-proxies for public SharePoint URLs – make sure if something dies because of a denial-of-service attack it’s not SharePoint.

The principal point of doing all of these is to make your SharePoint farm handle a failure of any one dependant service or resource. SQL failures, AD outages, even SharePoint server failure can be automatically worked around if the architecture is well designed – a key goal of any high-availability design.

Hardware & Performance

Sometimes performance in a SharePoint application can be so slow that it’s no different from a complete system outage, at which point we often get a call to help out. Here’s how you can avoid that awkward situation:

Performance and capacity planning:
- How do you know what your maximum throughput is? What tests have you done to prove maximum throughput?
  - Does this take into data-growth? What’s the expected data growth-rate?
  - What queries & code routines did your stress-test involve and invoke?
- How many users are you expecting at any one time?
- What type of load-balancing is involved? Are we using “sticky sessions” at all? This can impact caching, particularly.
- Hardware: is your topology big enough to support all the users? Is SQL Server capable of servicing all the SharePoint machines in the farm under high load?
Caching strategies:
- What caches are in use & active in SharePoint? If the answer is anything less than a very certain “blob + output + object caches at least” then you need to make sure all caches are fully configured (which some aren’t by default) and in active use. If you can’t prove they’re not in active use then they’re probably not and you’ll probably see SQL Server struggling to keep up as there’ll be lots more trips to the databases than normal.
Search & content-crawls:
- What servers are being used for crawls? What impact is it having on the whole system?
- How many search queries are expected per minute/second?
System bottlenecks:
- Which server(s) are running the slowest or working the hardest? Where are the bottlenecks? A chain is only as strong as its’ strongest link and all that.
- Learn how to use the Windows Performance Monitor. It’s golden for working out which components are slowing down the whole system.

Performance troubleshooting is a highly complex game and rarely ever highlights any single cause for slowness. Proactive planning is your strongest ally when it comes to making sure your SharePoint installation will cope.

A Problem of Performance and Customisations

Aside from the above risks, other farm failures are often caused by customisations to the out-the-box product. SharePoint supports customisations albeit with disclaimers for performance because it’s very common that SharePoint gets the blame for someone else’s bad coding over which we have no control. But that aside, application performance targets are a complex issue and rely on lots of factors and considerations, including:

You should be getting 15-20 thousand users per web-front-end server, or around 50 concurrent HTTP connections at any one time. If you’re not hitting these statistics or even close then you need a serious review of why that is. Pro-tip: it’s probably because of your code – use the developer dashboard to figure out what.
“Hot pages” – which pages are executing the most code/queries? Is one of the hot pages per-chance the homepage?
Memory management – does your code correctly use the APIs with the right disposing being handled (or not, if the SPObjects are not supposed to be cleaned up?). Is SharePoint complaining about incorrect API usage (it will if it’s being used incorrectly)?
Data growth – your awesome query may have worked in your dev environment where there are only 2 items returned but how have you planned for ever growing data? SharePoint starts struggling with performance hugely if your query returns over 5000 elements.

I’ll say it again because it’s important – most customer performance issues we deal with in Microsoft have 3^rd-party code involved in the problem at least, if not directly at fault. Custom-code is a big risk if not planned well.

System Patching and Updates

The other thing that can unfortunately kill a SharePoint farm is the monthly patches for Windows, .Net and SharePoint alike. Very occasionally a patch of some kind has a negative impact on a farm in some way so really should be run through a staging/testing environment before being deployed in production, with some testing done on the application too. If you use some kind of patch control system this task is fairly trivial to control with Windows Server Update Services or System Center.

Good patching practises is a subject all on its own but in short, make sure you’re not updating production even with Windows updates unless you’ve checked them on a test environment first and tested the site still loads; searches still work; users still update etc all before.

A bad patch is rare but they occasionally slip through the cracks, like any other vendor really; it’s ultimately your responsibility though to make sure your systems don’t break when they’re installed. Of course, even riskier is allowing your SharePoint farm to go unpatched but I digress.

A Question of Governance

The other thing worth mentioning is how well organised the operations of the farm are, specifically how well patching, archiving, application deploying, and general administration are planned. People with no processes in place tend to be the ones’ that go offline lots; SharePoint, like any system of any complexity needs regular care & attention to keep it well-oiled and running smoothly so make sure the roles & responsibilities are maintaining a healthy farm is clearly defined. We take quality control very seriously at Microsoft but the SharePoint software stack is very complicated indeed; from Windows platform/IIS core to ASP.Net, to browser compatibility – everything has to work flawlessly together in unison so patching one element of that huge tech sandwich can occasionally have unforeseen consequences. A basic testing strategy helps mitigate this risk almost entirely and really should be done if you’re serious about mitigating risk.

Wrap-Up

That’s it! If you follow all of these rules you should never have any problems running your farm and rolling out applications. Running a SharePoint farm + applications on-top is tough job; there’s lots to remember, plan for, and think about but hopefully this has given a good quick overview at least. If there’s any interest in looking deeper at any area or other please let me know in the comments.

Cheers,

// Sam Betts