VSS writers and inconsistent shadow copies

A few days ago, a customer told me that he encountered a weird error while using NTBackup to backup the system. The error text looked like this in backup log:

 Media name: "System State.bkf created 8/29/2005 at 5:38 PM"
Volume shadow copy creation: Attempt 1.
"MSDEWriter" has reported an error 0x800423f0. This is part of System State. The backup cannot continue.

Error returned while creating the volume shadow copy:800423f0

The first thing that we observe is the particularly cryptic message which doesn't really tell anything, except that an error occured. Unfortunately this is an example of a useless message: what is the message transmitted here? What should be the next steps to remedy the situation? Fortunately, developers at Microsoft (including me) started to realize in the last how important is to have the right error message infrastructure in place - but I digress.

Being familiar with VSS, I immediately spotted the issue: 800423f0 is VSS_E_WRITERERROR_INCONSISTENTSNAPSHOT (documented briefly here on MSDN, in the VSS SDK section).

What is an inconsistent shadow copy?

Let me step back and offer some context to this problem. VSS is a complex infrastructure that allows backup applications to create, list, delete shadow copies in order to get a reliable backup. These shadow copies are especially useful during backup, where you need a stable "version" of your data. Shadow copies are nothing more than static images of the system volumes, images that are frozen in time. More technical information about VSS can be found here.

VSS also allows applications to participate in the shadow copy creation process. Applications can receive various notifications during backup and restore - notifications that are used to ensure that these applications do have a consistent stored image on the shadow copy. These notifications are received through some dedicated software components called VSS writers. For example, AD has one writer, Exchange another writer, SQL another one, etc. (you can actually enumerate the writers on your machine and their state through the VSSADMIN LIST WRITERS command).

Now, a consistent image is absolutely needed in order to guarantee a reliable restore of your applications. At restore time, nothing is worse than to discover that your backed-up data is corrupt or inconsistent. And this is the reason for adding a special error code to denote the fact that a certain writer is in "inconsistent state" during the creation of a certain shadow copy set.

For example, if you have a SQL database with the database on X:\ and logs on Y:\, and you create a shadow copy set containing only X:\, then the shadow copy is inconsistent with respect to the SQL writer. That doesn't mean that the shadow copy will be inconsistent with respect to other writers - it is very likely that no other writer will have data on X:\.

What is the solution?

I'm skipping for now the technical details on why this "inconsistent snapshot" can cause problems, and go directly to the solution. The workaround to the NTBackup problem sounds weirder than the usual set of workarounds, but here it is:
1) Isolate all the relevant volumes on the machine that might contain active SQL databases and/or logs.
2) On each of these additional volumes (which are not related with the system state) create one empty file.
3) In NTBackup, along with the system state, include these empty files created above. (but not the large database files). This will include that all these volumes will be part of the shadow copy set, and therefore the MSDE Writer should not fail anymore with 0x800423f0.

This will eliminate the NTBackup error when you have this type of failures with the MSDE/SQL writer. A similar problem can happen with the AD writer (named the NTDS writer on domain controller). The solution is very similar.

Technical details...

Why does this work? Well, it's essentially a limitation of NTBackup. NTBackup doesn't "know" whether the SQL writer is important enough such that a failure in it should abort the system state backup. Part of the reason is the fact that some system services (like UDDI) depend on SQL or MSDE. Also, the particular mode used to create shadow copies is based on the assumption that the correct set of SQL-related volumes need to be present in the set (in the example above, X:\ and Y:\). If one volume is present and the other one is missing, the writer goes into an error state. Now, if you select at least one file in a given volume, NTBackup will include that volume in the shadow copy set. Since both the database & log volumes are now in the set, the SQL writer is happy.

That said, I would also like to mention that the level of integration between NTBackup and VSS is atypical in this specific case - a different backup application should choose to enforce writer consistency not at the volume level, but at the component/file level. Such an backup application will cause the SQL writer to go in a failed state only if you select the database file and not the log file, for example. In other words, if one of the writer-related files is present in the set and the other ones are missing, only then the writer goes in a failed state.

One might appear confusing from the details above. How does the writer know whether the backup application needs volume-level consistency (i.e. NTBackup) or component/file-level consistency (all other backup apps)? The answer is that the application has the ability to choose between the two consistency modes, in the IVssBackupComponents::SetBackupState(...) API. The first parameter dictates the selection model for this backup session. The recommended value is bSelectComponents == TRUE, which also tells that the backup application will select the necessary writer components during backup. The bSelectComponents == FALSE is a value maintained for legacy reasons (compatibility with NTBackup, mainly) but in general, it is not a good choice to use this mode, especially if you want to keep track what writer components are present (and in consistent state) on the shadow copy set.

One more note: The VSHADOW.EXE sample requestor, provided in the VSS SDK, provides a sample implementation of a backup application that works in component mode. BETEST.EXE is a more complex VSS requestor, that uses advanced VSS features.