What to know before you choose a SQL Server Disaster Recovery and/or High Availability solution

Why disaster recovery planning matters:

· Do you know what your plans are to recover from various data losses? Have you ever tested those plans? Can they be implemented within your Service Level Agreements (SLA)? Are you confident your plans cover the most likely and/or most painful types of data losses?

· Who gets to explain to the CEO and potentially the press why you were down for X hours more than the agreed upon SLA or why you were never able to recover your customers’ data at all? If it’s not you and your department that does the explaining directly, you can still bet you’ll be grilled by the person who does get to do the talking.

· What will happen to the business if there’s a large and possibly catastrophic data loss to one or maybe even multiple sites? What will happen to your job?

 

Methodology:

· Decide what you would like to protect against. Examples: hardware failures (disk, CPU, memory, etc.), loss of server, loss of server room (flooding for example), loss of electric grid in a city or section of country, malicious attack (internal or external), accidental data loss (delete wrong rows, drop database), etc. You may want to do a probability/impact matrix.

· Decide how much downtime you can afford per type of potential downtime. For example, if it’s an electrical outage expected to last one day, will you bring up all systems on generators or bring up remote locations. It’s possible that with a bigger loss you may be able to have a longer acceptable downtime. If the entire city is without power people may expect that they won’t be able to do business locally. But they may expect that their data is available at other sites during the local outage (their last bank transaction, the prescription they just dropped off, the fantasy game predications they just entered).

· Decide how much data you can afford to lose. Can you reenter the last X minutes/hours/days worth of data or must you be able to save every byte? What is the level of acceptable loss within your budget?

· How long do you need to keep data backups? If there’s an accidental data loss that is not discovered for X amount of time, what will you do? What if you find out backups have been failing for the last 3 weeks and for some reason no one knew about it, but all the older backups are gone and you’ve lost the current database?

· Once a system is taken offline, is there still a chance you’ll be asked to recover data from it? For example, regulatory requirements might demand that you be able to pull up data from 7 years ago, even if you only migrated only the most recent year from the obsolete system to your current system. Will you have the hardware, software, and operational expertise to restore an older backup taken from a system on what is now outdated hardware and software?

· Once you know what you need to protect against, you can begin to consider technical and resource considerations that will help you meet those goals. This includes frequent testing of whatever technologies and processes you put into place. There will be many tradeoffs in cost vs. functionality. It would be astronomically expensive to protect against all types of potential failures, the business has to decide where to draw the line. The planning should be revisited periodically to make sure it still meets your needs and that it is still working properly (based on your testing).

· Once you decide on technologies, then you can develop policies, procedures, and responsibility (departments/groups and/or people) guidelines. This is at least as important as the technologies you choose. This will include how to implement the technology on existing and forthcoming systems and monitoring the system as well as periodic testing. The testing must include the entire process across all responsible groups or it isn’t complete/accurate.

· Determine who is responsible for making sure this is done initially and who is responsible for making sure it is revisited periodically.

 

SQL Server 2008

· High Availability https://www.microsoft.com/sql/techinfo/whitepapers/SQL_2008_HA.mspx

· Always On https://download.microsoft.com/download/c/a/f/caff7135-8d80-4dad-a104-0da8558d8a0e/Availability%20DataSheet.pdf

 

SQL Server 2005

SQL Server 2000

Technical options/considerations:

· RAID arrays to reduce disk failure problems. RAID 10 is generally the best across the board.

· Clustering to reduce failures from non-disk hardware problems (this can be local or geographically dispersed).

· Mirroring or log shipping to protect against various types of failures (various levels/options are available).

· Replication to provide concurrently usable copies of some data on another server (you must build your own recovery methods, there is not anything built in). You have to consider recovery from loss of publisher(s), distributor, and subscribers and scenarios that require a reinitialization.

· Backups, which should include frequent testing of the entire restore process, both from a technology and personnel/procedure perspective. You also have to make a wise decision on full, differential, and tran log backup frequency and retention as well as compression. Point in time recovery might be an option for incorrect data updates (malicious or accidental) but may be difficult to across multiple databases. The media (local disk, remote disk/same room, very remote disk/different location, tape/dvd) should also be considered. The frequency is important and may vary depending on the time of day or even time of year. Are the backups themselves protected from loss, theft, and corruption?

· Various recovery models and settings in the databases.

· Database snapshots have a limited potential role is very specific scenarios (maybe to protect against accidental data loss where you would know about the problem very quickly).

· Be able to consolidate various systems on one set of hardware (if sufficient hardware cannot be found, you may have to run a 2nd system on an existing server). This may involve a 2nd instance or just another set of databases on an existing instance. There may be issues to consider with sync’ing logins/users either across domains or if you combine two instance’s worth of databases into a single instance. Will you combine databases from different versions? This goes into the bigger issue of knowing when you can recover in the same location and/or hardware vs. new hardware/location. This could involve Virtual Server or the Hyper-V virtualization features in Windows 2008.

· You should plan to run DBCCs on your backups to make sure they are valid.

· Available disk space plays a big part in some of these decisions.

· Consistency of names, disk layouts, configuration, etc. can make recovery simpler. Regardless of planned consistency, you need to have metadata, login/user, and configuration information available remotely separately from the backups (or at least know/practice how to get the necessary data from the backups).

· For a warehouse or Analysis Services database, you will need to have the ability to rebuild from the source data. The warehouse or AS database may have to re-populated when the design changes which requires that the source data still be available.

· Do you have the media available to apply the exact same OS, firmware, SQL Server, etc. versions including hotfixes, editions, and x86 vs. x64 vs. IA64? Do you have a way to track exactly what version/hotfix level is needed?

· What about other applications on the box, especially those that rely on SQL Server? Do you have a recovery plan for Sharepoint, Project Server, BizTalk, Performance Point, or whatever applications you run against SQL Server databases?

· What if you can’t restore one or more of the system databases, do you have enough information saved off (ids, password, job schedules, linked servers, etc.) to be able to rebuild the system so you can use the user databases you did have saved?

 

Non-technical considerations:

· How often can you afford to test the restore process (this is a very important step and the proper resources should be dedicated to it, weigh the costs of testing against the costs if you cannot restore the data for some reason). This will probably involve training each time as people move through various positions/responsibilities and the technology changes over time as well.

· Define general priorities in case multiple systems fail at once (such as a big storm or a malicious attack). You won’t necessarily have the hardware, bandwidth, or personnel available to restore everything at once. Where do you start?

· Are your SLAs realistic? Well-documented?

· Make sure the planning process is revisited at least once a year (preferably more often), including verifying that testing is still occurring and succeeding.

· Have well-defined procedures for reacting in a timely, preferably automated fashion to technical failures (such as the backups failing due to lack of disk space or one disk in a RAID 5 array failing).

· Will your plan facilitate/complement population of your QA or development environment? Will it facilitate/compliment the plan to rollout new systems?

· How will you detect a problem? DBCCs, users reporting problems, etc.

· Document completely and clearly and make sure more than one person knows where the documents and passwords are and more than one person practices implementing them. This is a very important process and you cannot afford to leave it in one person’s hands in case that person is not available at the time of a disaster.

· Document why you didn’t implement the options you choose not to use. For example, maybe you’ll choose not to implement option X because it would put you over budget or you chose technology A over B because it was more important to have quick recovery even if it meant losing a small amount of data.

· Know and document where you are now and where you want and need to be.

· Know your executive sponsors and make sure they understand the tradeoffs in the system.