SharePoint vs. Snapshots (Part 2)

This is a continuation of the previous article, SharePoint vs. Snapshots, which us recommended prior reading.

Hopefully after reading my previous post on how SharePoint and snapshots really, really don't get along well, you're wondering how you might take advantage of this incredibly awesome capability to literally turn back time on your environment in case of an emergency.

Before we discuss how snapshots can help SharePoint, let’s discuss briefly how snapshots should NOT be used:

  • Snapshots have almost no place in a regularly scheduled backup plan. That is, unless you've followed the carefully coordinated steps below, there's pretty much nothing you can or should do with snapshots to recover a failed SharePoint server or drive (including SQL LUNs or data storage).
  • Snapshots should not be used casually. All of the reasons listed in the previous article imply that snapshots can cause some interesting and extremely difficult to resolve issues in your SharePoint environment. They shouldn't be used as part of any real frequent administrative activities.
  • Snapshots should not be used to replicate an environment. Along with all of the potential issues mentioned in the previous article, using snapshots to replicate an environment for testing or pre-production use introduces an entire list of additional risks, such as duplicate names on the network, AD membership failure, unexpected notifications to end users from the new, non-production system, etc. Don't do it. A well designed deployment script (manual or code) is the better answer here.

So… what role can snapshots play?

Warning: I am about to discuss something that violates Microsoft supportability statements. Although the process below SHOULD work as described without any significant problems or errors, you do so at your own risk.

Any use of snapshots must avoid the key elements that present inconsistency in the environment. The use of snapshots must specifically avoid any loss of synchronization with the SharePoint Configuration database. Other issues listed in my previous entry, specifically storage, network, and domain membership synchronicity should also be avoided.

The unfortunate reality is that there is only one time when we can be absolutely certain that the environment is in a known, stable, unchanging state: When the machines are physically turned off. You can click a thousand buttons, pre-script backup commands, or press the "snapshot" button as fast as you can, and there is still going to be a risk of misalignment. It is only when the machines are turned off and SharePoint is physically incapable of doing work or making request of the databases that you can be certain that any snapshots or backups you make will be consistent between each of the infrastructure components.

Of course, once you power down the machines and take the various snapshots, you must manage that backup as if it was one single atomic unit. That is, if any piece of it is to be restored, all of it is to be restored. You cannot choose to restore one SharePoint server and not the others. You cannot choose to restore one database and not the others. You cannot choose to restore the SharePoint servers and not the database server. They are all linked, cannot be separated, and cannot be restored independently. Restoring one means restoring all. This includes the content databases, which means that if you suddenly decide you need to revert your SharePoint environment to that snapshot, you will also be reverting the state of the SharePoint content to that point in time. This could potentially be a significant amount of data lost and would likely be unacceptable to your end users.

As I mentioned in the first bullet above: Snapshots have no place in a regular backup or disaster recovery strategy, mostly because of the issues listed above. In particular, the requirement that the content be included in this backup/recovery plan makes using snapshots as part of a standard backup strategy nearly impossible (notice the "nearly" ;)). There is only one process that can change this, which I'll discuss at the very end of this post.

So… where are snapshots useful? Where can you use them and have them be something actually useful? There's really only one place that they offer true value for SharePoint: Patching! 

SharePoint (and I might contend Windows and SQL) patches are a one-way proposition. Once you've installed them, there is absolutely no supported way to remove them. Your options are to make them work, or completely reinstall your environment. However, a properly timed and executed snapshot can give you a (reasonably) safe back-out strategy that would otherwise be impossible using the built in features. If you have a production-identical environment and you've done all of your testing, deploying a SharePoint service pack should be fairly simple, flawless, and give you no cause for concern… but no one will hold it against you for having a safety net. Snapshots are the only thing that offers the possibility of sufficiently capturing every aspect of the environment to be that safety net.

This is predicated on the process being followed properly. Assuming you've done adequate testing and are in your scheduled maintenance/outage window, the process should look like this:

  1. 1.       Shutdown all of the VMs.
  2. 2.       Snapshot the VMs (either in the virtualization or storage tiers).
  3. 3.       Backup/Snapshot the SQL Storage.
  4. 4.       Power on all services
  5. 5.       Apply any updates as normal.
  6. 6.       Verify update(s) were completed successfully.

In the event that you find yourself needing to roll back to the snapshot that was taken using the above process, the rollback should look something like this:

  1. 1.       Power down all SharePoint VMs.
  2. 2.       Restore all snapshots/backups (including SQL and SharePoint VMs)
  3. 3.       Power on VMs.

Using this process there is a potential risk of a loss of connectivity to Active Directory. If the machines have changed their account passwords (Yes, every AD joined machine has its own account and account password), then when you revert the snapshot to the state prior to that change, the domain membership will be broken. For this reason, be sure you have the password for a local administrator account. This should allow you to log into the machine, unjoin and re-join the machine from the domain. This resets the machine account password and will allow the system to operate normally.

As with all things, details matter in the backup/snapshot and the revert processes. It is important that the backup and snapshot be appropriately thorough, and that no major elements of the process be missed. This means that, depending on your infrastructure, the number of things and/or places that have to be backed up or snapshotted must be well known and accounted for.

  • (A) The easiest option is to simply take a snapshot at the storage tier, being certain to capture the virtual machine hard drives and any pass-through mounted LUNs for SQL Databases (or SharePoint servers, if applicable). In this instance, the SQL Servers should ALSO be powered down so as to prevent them from attempting to modify data during the snap.
  • (B) + (D) You may also choose to perform snapshots of the virtual machines in the virtualization tier, and snapshots of SQL at the SAN tier.
  • (C) + (D) You may choose to perform a backup copy of the VHDX or VMDK files which contain the virtual machine hard drives. This is the equivalent of a "manual" snapshot, but reduces the potential negative performance impact that true snapshots generally imply. SAN locations must also be backed up/snapshotted.
  • (B)/(C) + (E) You may choose to perform snapshots of the VMs in the virtualization layer and perform SQL-based backups of all databases.

The options listed above aren't exhaustive… it's up to you to know what you need to back up/snapshot and where it will best meet your needs to do so. For example, it is entirely possible for your VMs to use a VMDK for their system drive, and pass-through storage directly to your SAN for a non-system "data" drive. You must be aware of the fact that your VM snapshot will NOT snapshot the pass-through storage… it is up to you to either snapshot this at the storage tier or perform a high fidelity, full backup of that volume while the machines are powered down. The number of things you need to consider to be certain your snapshot process captures everything is going to increase as the complexity of your environment increases. As another example, RBS creates yet another point that must be captured as part of this process.

The ultimate goal of all of this to help you understand that when you're using snapshots with SharePoint, you need to not just snapshot a machine, a database, or a set of files… you're attempting to capture the state of the SharePoint farm at a point in time… and you need to capture that state in such a way that it is in perfect sync across all of the components. Shutting the machines down is a way to reduce the number of things that could be altering that state at any given moment, thereby increasing the likelihood that the capture of that state that you do will be a complete capture of that state, and that if you need to revert to that prior state, there are fewer things you need to accommodate, fix, or otherwise deal with that could make your farm completely unstable (and unsupportable).

I've tried to emphasize that the farm backup is really only good if it is intact, and it's only good at one moment in time. There are some reading this that will try to do things I'm NOT suggesting. For example, I'm almost certain I will someday receive a support request that says something like this:

"Customer followed your blog entry on snapshots and did one just before deployment of a service pack. They then learned two weeks later that their customizations are not compatible, so they reverted the farm to the snapshots and then restored their content databases to the most recent backup so as to not lose any data. Now their farm is broken."

Just trying to head this one off at the pass, let me be clear: THIS WILL NOT WORK. You cannot have the farm in a state that is "pre" service pack and databases that are "post" service pack. The pre-SP farm may not know how to talk to the post-SP databases… there is no telling what structural changes may have been made as part of the service pack upgrade, and we cannot support any scenario in which you try to do this or any variation of this process. JUST SAY NO.

And now, I did promise to mention the one caveat to the "snapshots are not backups" statement. It MAY be possible to use snapshots as part of a regular backup process as long as all of the above stated rules of snapshotting are maintained (ex., the machines are powered off, etc.) AND there have been no changes to the SharePoint platform. This means that IF the SharePoint version is exactly the same across all snapshots, no hotfixes or other patches have been deployed to SharePoint, no solutions or features have been deployed, changed, or upgraded, and no logical architectural changes have been made (no new web applications, modifications to service application design, etc.). This caveat also only really applies to the Farm vs. the Content Databases. In this scenario, as long as all other things are equal, it may be possible to use an older snapshot, and then restore a newer content database. The trick to this is forcing SharePoint to treat it as a "new" database, forcing it to somehow sync up to the DB. The following process is an example:

  1. 1.       Restore the SharePoint Farm (including all content) as you normally would.
  2. 2.       Detach all SharePoint content databases from the farm (in SharePoint, not in SQL).
  3. 3.       "Catch up" the databases to the most recent version you're looking for.
  4. 4.       Re-attach the newer databases to SharePoint, forcing SharePoint to treat them as if they're brand new and doing all of the normal synchronization and maintenance processes against them.

This possible option though comes with a ton of caveats, limitations, requirements, and details, and is generally NOT RECOMMENDED… even by me. I only mention it in the spirit of full disclosure (and because someone will eventually ask me the 'what if' question) and is not something I'm saying you should do. Pretty much ever. Again, just say no.

Thanks for reading… and if you have any questions please feel free to ask. I'll get back to you as soon as I'm able.

Chris Mullendore