SharePoint Business Continuity Planning

For this blog post, I'd like to discuss a topic that seems to be getting a lot of attention with most of my customers lately - Business Continuity Planning. Typically, the discussions regarding this topic run the gamut but within those discussions I often find that there is confusion and misunderstanding around key topics as well as a blending of separate topics (primarily High Availability and Disaster Recovery). Often when discussing Disaster Recovery, I hear customer(s) use words such as automatic failover and zero downtime. While important to any business continuity discussion, "automatic failover" and "zero downtime" are typically a part of the high-availability planning rather than the disaster recovery planning.  Sure, you could argue that it's a matter of semantics and it doesn't really matter, but in my experience, I have learned that is important to keep HA planning and DR planning separate because with SharePoint the optimal architectural design for HA may look very different that the optimal architectural design for DR.

In this post I will define key terms and concepts and then provide a list of questions that can be used to help facilitate conversations around Business Continuity Planning.

Business Continuity

Business Continuity refers to the activity performed by an organization to ensure that critical business functions will be available to entities that require access to those functions. Typically, business continuity planning for SharePoint environments include architecting solutions for High Availability, planning for Disaster Recovery and conducting normal day-to-day operations such as applying software patches and taking backups.

For more information on business continuity management and SharePoint, please see https://technet.microsoft.com/en-us/library/jj715263.aspx

High Availability (HA)

High-Availability is typically defined in terms of the end user's ability to access the system during designated times and perform expected business functions. As a result, the goal of High Availability planning is usually to minimize or eliminate system downtime.

In the case of SharePoint, the following descriptions usually apply to High Availability (HA):

  • There could be a single data center or multiple (close proximity) data centers
  • Utilizes a single Configuration database (meaning single SharePoint Farm)
  • Leverages Automatic failover (no user intervention required)
  • Typically consists of SQL Clusters and load balanced application servers
  • Depends on database and/or storage technologies to synchronize data between servers in near real-time
  • Provides protection from Failure of DB server and Application (SP) servers
  • There is a defined goal for desired uptime (3, 4 or 5 9's)
  • There is an attempt to eliminate single points of failure via redundancy (Hardware, power supply, network, etc.).

For more information about SharePoint High Availability see https://technet.microsoft.com/en-us/library/jj715263.aspx

Disaster Recovery (DR)

Disaster Recovery Planning typically focuses on recovering from a planned or unplanned outage. It entails defining detailed operational processes to be followed to ensure that recovery is possible and documents business requirements around recovery. Such requirements typically include but are not limited to RTO (Recovery Time Objective) and RPO (Recovery Point Objective). Most Disaster Recovery plans cover the scenario of a complete loss of the primary data center and as such often include a secondary data center located some distance apart from the primary data center.

In the case of SharePoint, the following descriptions usually apply to Disaster Recovery (DR):

  • Servers are located in different data centers, often far apart
  • Consists of separate configuration database in the secondary data center (Separate SharePoint Farm)
  • There is a defined Recovery Time Objective (RTO)
    • Time frame after a disaster in which the recovered business functionality should be available (i.e. service downtime after event)
    • Typically described in hours
    • RTO requirements typically dictate the DR Approach (Hot versus warm versus cold)
    • For SharePoint getting the data to the secondary data center is often the longest time component of the recovery process
  • There is a defined Recovery Point Objective (RPO)
    • Specifies time frame of allowable data loss after recovery from event
    • Usually specified in hours
    • RPO requirements typically dictate data backup/synchronization approach and schedule 
  • Facilities of Secondary Farm may have various degrees of availability during normal operations (cold, warm, hot)
  • May be invoked via a planned or unplanned failover.
  • Failover is often a manual, deliberate event.

For more information about SharePoint Disaster Recovery Planning see https://technet.microsoft.com/en-us/library/ff628971.aspx

Now that some common terminology and SharePoint specific characteristics have been established, it is time to list some of the important questions that should be answered as part of any Business Continuity Planning exercise.

Questions For Business Continuity Discussions

These discussion question are broken down into five categories of questions:   Business Requirements, Infrastructure Inventory, Operational Practices, Asset Prioritization, and Service Prioritization

Business Requirements

The following business requirements should be defined and documented as part of the Business Continuity Planning Process

Uptime

  1. What are the business needs around uptime?
  2. What are considered normal business hours for your organization?
  3. Does your organization differentiate between planned downtime and unplanned downtime?
  4. Do you have regular maintenance windows? If so, how long are they? How often do they occur?
  5. Do you proactively measure and report application uptime?

Disaster Recovery

  1. Do you have a desired/defined RTO for major extended outage?
  2. Do you have a desired/defined RPO for major extended outage?
  3. Do you perform disaster recovery tests?  If so, how often?
  4. Do you have a disaster recovery site? Where is it located compared to the primary site?

Infrastructure Inventory

The following questions help define and document the physical infrastructure that will be utilized for business continuity discussions.

Computing Resources

  1. Are you running with physical or virtual servers?  If virtual, what platform is being used?
  2. Do you have a process in place to manage growth and prevent over subscription?

Storage

  1. What storage infrastructure/vendor are you using?
  2. Is the same storage infrastructure used at both the primary and secondary data center?
  3. Does the storage infrastructure provide replication mechanisms to different data centers?
  4. What maximum IOPS capacity can the storage infrastructure provide?  Have you measured this?  If so how?

Network

  1. What is the bandwidth and latency between data centers?
  2. What other applications are utilizing the network connection?
  3. Are WAN accelerators in use between data canters? If so, what kind?

Database

  1. Do you take regular database backups? What is the backup schedule?
  2. Do you use DB Log Shipping?
  3. Do you use DB Mirroring? If so, Synchronous or Asynchronous?
  4. How large are your SP databases?

Operational Practices

The following questions, help define and document that operational aspects of running a SharePoint environment

Patching

  1. Do you take planned downtime to patch?  If not, how do you achieve that?
  2. Do you run with limited capacity while patching? Read-only?
  3. Do you use a planned failover to DR to eliminate downtime during patching?
  4. What is your patching cadence? How frequently do you patch?

Backups

  1. Do you take regular backups? If so, What kind? What types of tools are used?
  2. Do you ever test restores?
  3. Do you take SharePoint Farm backups? What about Site Collections Backups?
  4. Where are backups files stored and how are the archived?
  5. Is there an off-site back-up facility in place in case of total data center loss?

     

Asset Prioritization

Asset prioritization and inventory is an important part of the Business Continuity Planning process.  Identifying what business assets are most critical, where they are stored and what priority they should have from a high availability and/or disaster recovery perspective will help business entities devise an optimized business continuity plan.

SharePoint Asset Recovery

Since the focus here is about business continuity planning for SharePoint, these questions focus specifically on SharePoint assets but a complete business continuity plan should inventory all important information assets including those NOT stored in SharePoint.  (Note: Though intended for use in the upgrade process, the workbook at https://go.microsoft.com/fwlink/?LinkId=252097 can be used as a tool to assist in documenting the various SharePoint Assets).

  1. How important is availability of each of the following assets after recovery from a disaster event? (Scale of 1 - 5 : 1 - not very important, 5 - very important)
    • Documents
    • List items
    • Social data
    • User/Profile Data
    • Termsets/Taxonomy
    • MySites
  2. How quickly do you need access to each of the following assets after recovery from a disaster event? (1 -5 : 1 can probably wait a while, 5 need this right away)
    • Documents
    • List items
    • Social data
    • User/Profile Data
    • Termsets/Taxonomy
    • MySites
    • Ability to locate artifacts via search

Service Prioritization

Again there are non SharePoint services that are going to need to be prioritized and inventoried but these questions just focus on SharePoint

SharePoint Service Prioritization

  1. Which if the following services must be highly available? What level of availability is desired?
    • Sites (Are there some sites that must be more available than others? If so which ones?)
    • Content (Pages, documents and lists)
    • Search Query
    • TermSets and Taxonomy
    • User Profiles
    • Social Activity
    • Excel Services
    • SharePoint 2013 Apps
    • Workflow
    • Custom SharePoint Solutions (.wsps)
    • Other?  If other, what?
  2. How important is the availability of each of the following services after a disaster event? (Scale of 1 - 5 : 1 - not very important, 5 - very important)
    • Sites (Are there some sites that must be more available than others? If so which ones?)
    • Content (Pages, documents and lists)
    • Search Query
    • TermSets and Taxonomy
    • User Profiles
    • Social Activity
    • Excel Services
    • SharePoint 2013 Apps
    • Workflow
    • Custom SharePoint Solutions (.wsps)
    • Other?  If other, what?
  3. How quickly do you need access to the following services after a disaster event (1 -5 : 1 can probably wait a while, 5 need this right away)
    • Sites (Are there some sites that must be more available than others? If so which ones?)
    • Content (Pages, documents and lists)
    • Search Query
    • TermSets and Taxonomy
    • User Profiles
    • Social Activity
    • Excel Services
    • SharePoint 2013 Apps
    • Workflow
    • Custom SharePoint Solutions (.wsps)
    • Other?  If other, what?

Credits

A big thanks goes out to some colleagues whose feedback and discussion around this topic have been greatly appreciated and helped to complete the picture in several areas!  Thanks Bob Fox (https://blogs.technet.com/b/sharepoint_foxhole/), Kirk Evans (https://blogs.msdn.com/b/kaevans/) and Cory Roberts.  It’s awesome working with you guys!

I hope you have found this information useful and as always I welcome comments, feedback, community insights and corrections!

# # # # #