VSS backup of AlwaysOn Secondaries

Hi Everyone,

Today I’m going to highlight one of the changes brought by SQL Server 2012 SP2, which is the way we handle VSS Backup requests on AlwaysOn Secondary Databases.

Until now, any request for a FULL database backup (VSS_BT_FULL) thru VSS against a DB that is an AlwaysOn secondary was failing by design. Our VSS Writer SQLWriter would return FAILED_AT_PREPARE_SNAPSHOT (0x800423f4 - VSS_E_WRITERERROR_NONRETRYABLE).

A copy-only VSS backup (VSS_BT_COPY) would work.

The rationale being the following: a FULL backup is actually updating the target DB (reset of differential bitmap mainly), which is not possible when the DB is read only. Furthermore, because of the failover possibilities introduced by AlwaysOn, the favored option was to use Native SQL Servers backup that could rely on SQL Server variable backup location (https://msdn.microsoft.com/en-us/library/hh245119.aspx) if needed, and be ‘alwayson –aware’.

So that could be the end of the story: against an AlwaysOn Secondary DB, either use Copy_only VSS backups or use native backups.

But of course that wouldn’t make for a very interesting blog post…

Enters HyperV…

Consider the following scenario:

Large Windows HyperV Servers, hosting many HyperV virtual Servers, some of them SQL Servers in Always On architecture.

In short: a Private Cloud.

In this context, the IT usually takes care of the infrastructure at host level, and lets users deal with whatever happens within the VMs. One of the key tasks of IT is to manage backups (eg. for disaster recovery at datacenter level, or to provide restore of single VMs).

And the mainstream way to do that is to take VSS backups of the Host Disk Volumes. Microsoft System Center DPM will do exactly that.

But VSS backups are all about taking backups that are consistent: in ‘standalone’ SQL Server context you may already know all the logic SQLWriter implements to make sure that IO against the Databases that are backed up are frozen during the snapshot operation. So, back to our HyperV context, collecting a point-in-time image of a VHD without bothering with what happens within the VM would be defeating that very purpose right?

So what happens is the following: the VSS backup is propagated to Guest VMs thru HyperV integration services. That propagation hardcodes the backup type to VSS_BT_FULL, and therefore all guest VMs are initiating a VSS backup/snapshot in their own context. The purpose is to make sure that all applications are quiesced within all running VMs at the time we take the snapshot at the host level. This will enable us to generate a consistent backup of running VMs.

But let’s now put this in the context where one of the VMs is running an AlwaysOn secondary DB: you guessed it, it’s not going to work:

clip_image002

The important thing to know here is that the error returned by SQLWriter in VM3 will actually bubble up all the way to the initial VSS backup command at Host level, and will make it fail as a whole.

So we ended up in a situation where the IT infrastructure folks would see their Host backups failing from time to time for an unknown reason, depending on whether one or more of the VM present on the Host Drive being backup up had a secondary AlwaysOn DB! It could be that the AlwaysOn AG spans different HyperV hosts and therefore that the presence of a Secondary DB on a given Host is not something static over time.

Because of the complexity of the whole call chain, and because infrastructure IT operators may not have any visibility (or understanding) of the VM content, you can imagine what kind of troubleshooting challenges this would offer… And even when the situation is understood, well, what do we do? If host level backup must be somehow manually synchronized to the applicative state of Guest VMs, the Private Cloud scenario becomes quite more complicated all of a sudden.

This is the reason why SQL Server 2012 SP2 ships a code change for SQLWriter that will implement the following:

clip_image004

As you can see, SQLWriter now detects this specific situation and changes the backup type to VSS_BT_COPYONLY. This will only happen for VSS_BT_FULL backups against AlwaysOn secondary Dbs. VSS_BT_FULL backups against primary DB will happen without change.

In this case, the VSS backup will now successfully complete in VM3 and the host-level backup success state will no longer be tied to guest VM’s AlwaysOn activity. Private Cloud scenario unlocked!

Important note: the fact that VSS backup of AlwaysOn secondaries now works does not make it the preferred solution to backup SQL Server AlwaysOn architectures. The main purpose of the SP2 change is to avoid a situation where a single SQL Server in a VM fails a complete host-level backup operation that encompassing dozens of VMs.

The resulting backup for the VM hosting SQL should be considered a Disaster Recovery one, where AlwaysOn will be removed at once at restore time, not as a way to rebuild a subset of the nodes for an AlwaysOn Availability group. And for regular databases within the VM, that backup is as good any regular VSS one.

Finally, SQL Server 2012 SP2 only contains a partial fix for this issue. Servers running case-sensitive sort orders will require SQL Server 2012 SP2 Cumulative Update 2.

HTH,

Guillaume Fourrat
SQL Server Escalation Engineer
Microsoft France