I continue to hear questions and debate over how to build local or regional immunity into a single SharePoint farm. Enterprising SharePoint folks want to make sure that their SharePoint service remains online even if Dr. Evil fires the “laser” at their primary datacenter. While you probably don’t have to worry about Dr. Evil and his laser you should definitely plan for scenarios where part or all of your primary SharePoint deployment is destroyed. Fortunately, SharePoint supports some great solutions to help you out. Let’s boil those solutions down to the basics.
First of all, have you seen the newest SharePoint database mirroring whitepaper? While it’s a little light on the end to end scenarios it is very relevant to this conversation. Another great resource is the TechNet article on Optimizing MOSS Deployments for WAN environments. The supported scenarios are as follows:
Single SharePoint farm straddling two very close datacenters
Two SharePoint farms in two separate datacenters of (nearly) any distance.
Let’s start with the former. In this scenario a single SharePoint farm consisting of two or more web and application servers (query, excel, central administration, etc) are deployed across two datacenters located within a very short distance. Each application role should be represented in both locations and a load balancing mechanism should be used to direct users to the proper web server on either side (or perhaps both). Finally, every SharePoint database needs to be replicated between two SQL servers on both sides using synchronous replication solution such as Highly Available SQL mirroring. You can also use a hardware replication solution but ensure the vendor can guarantee that the entire dataset (config, ssp, admin, search, and content) are consistent.
So what constitutes a very *short* distance? Having 1 millisecond or less of latency (key constraint) and enough *Available* bandwidth to provide LAN like performance. I would consider that 1Gbps or better though the official guidance leaves that a little less defined. If you read my post on Mirroring and Bandwidth you can see why I recommend 1Gbps or better. Bandwidth is cheap so I hope that isn’t a barrier for most folks. However, the latency thing is tricky. How do you get less than 1ms in latency? Simple physics suggest that at the speed of light it should take about 1ms for your electrons to travel 93 miles and back as measured by a simple ping.exe test (RTT). The problem however is that at short distances the speed of light isn’t really the bottleneck. There’s a lot of overhead associated with networking especially when it comes to making decisions about routing. Those routing decision clock cycles eat up precious time. Multiply by the number of routers between your two deployments and you can start to understand the problem. Practically, 1ms in latency is probably less than 10 miles apart and only under the best conditions. (read less routed)
So what does this mean exactly? Good question. It means that while this solution is much better than a single datacenter deployment it’s not perfectly immune. I generally refer to this as the local immunity solution. There are a lot of disaster scenarios that could affect both datacenters when they are located that close. The most obvious ones are prolonged regional power outages, earthquakes, and floods. If you don’t have to worry about those problems then this is the solution for you. What’s great about the straddled farm scenario is that one farm = as little operational overhead as possible. When you make a change to the farm configuration, add a solution, change the service account, etc, these changes are replicated automatically for you across the entire farm thanks to SharePoint’s hot administration model. There is one glaring hole however.
As you are probably aware the index server role is not a redundant one. That means you have to choose a datacenter to stick the index server in. If anything happens to that datacenter the index recovery fun begins. There are ways to mitigate the index redundancy issue, but that is a post for another day. The good news is that the query role should continue to serve queries for existing indexed content so as long as there are functioning query server(s) in the active datacenter.
Now, I know what you are saying. Mike, my hardware vendor has a comprehensive storage replication solution and will guarantee transactional consistency to insane distances. (1000 miles + ) I’m not using database mirroring. Why can’t I spread the rest of the farm components further apart? To answer that we have to pull the curtain on the wizard. The easy answer and the typical MSFT response is “you’ll shoot your eye out kid.” It may seem harmless to spread a web or application server across large distances. (hell, I once joined a web server to a farm that was located across the world) but there is a lot going on behind the scenes that will cause numerous problems. I can’t go into all these issues in depth (I’m a busy guy), but trust me. If anybody ever truly wanted this to work it was me. The allure of a single farm stretched across the globe is nearly irresistible. One stop administration and intercontinental redundancy. That’s insane goodness. I will give you sneak peek into the problems. Timer jobs, query propagation, query’s internal load balancing, and Excel’s internal load balancing.
So if you need a better redundancy solution than the one above allows….say your primary datacenter is on the US west coast and you got to worry about those pesky earthquakes, those rolling blackouts, or those nasty smug clouds, what do you do? (just kidding SF, I love the environment and your beautiful city) The solution is to deploy two separate farms of equal capacity and capability to two datacenters separated by very large distances. Most typically at least 1000 miles apart. With two separate farms you are not constrained by the effects of bandwidth and latency on SharePoint’s functionality behind the scenes. In this model you use a database replication mechanism such as log shipping, mirroring, a backup replication solution such as DPM, or your storage vendor’s own hardware replication solution. These replication solutions ensure that the content databases are consistent and available in the remote datacenter and can be attached to the farm within the RTO period while also meeting the RPO. There are a lot of cool ways to do this, but I’ll save that for another day as the concept is pretty much the same across each solution.
With your “secondary” farm intact and the content databases available in the remote datacenter failing over between datacenters can be as simple as bringing the databases online (log shipping and mirroring) and attaching to the farm. You can either provide the users with an alternate url to access the content immediately upon recovery or use the original url after a DNS change points users to the right place. This is a very flexible solution and one I like, but there are a number of little caveats that make this less attractive than the straddle a farm scenario.
The first problem is the secondary farm does not have an SSP or has a different SSP than the one in your primary farm. You can either configure the secondary SSP to search the primary farm (only in smaller environments) or restore the primary farm’s SSP to the secondary farm assuming you have configured SSP backups to be replicated to the primary farm. The final option with search is to simply forego it altogether and rebuild the index after failover which may be OK with organizations where search isn’t critical. More on the specifics of this in a different post.
The second problem and probably the biggest is the operations quotient. The simple fact is that maintaining two farms is much harder than maintaining one. It’s hard enough sometimes just keeping track of changes occurring to one environment, but having to replicate those changes to secondary environment requires operational discipline that few companies enjoy. (not even MSFT) Couple that with the complexity of deciding when and how to failover between datacenters and you can see why this is a complex issue. I’m hoping an enterprising vendor will fill this niche with a nice solution. I’m offering my consulting services to any and all vendors who are interested. The good news is that you can count on SharePoint to be “dial tone” as long as the content is available and the bits that the content relies on (SP patches and customizations) are available. The rest of those farm settings generally don’t impact the availability of the service or the data and those settings can probably be consistently replicated from the memory of the supporting administrators or rediscovered on the fly. (Example: “oh, we forgot to configure the people picker searchadforests settings. That’s an easy fix.”)
While this isn’t comprehensive list of all the options available it does cover the two major HA buckets and describes the main pros and cons of each. Please let me know with your comments and emails what specific issues you are concerned with and I’ll post specific information about it.
I apologize for any typos or other literary offenses. I typed this in a hurry. Look forward to some cool posts on how DPM can solve your redundancy problems. Until next time.