Oh no – my backup is corrupt too! Help!

Funny how a bunch of people all seem to have the same problem at the same time. Maybe it’s just that people don’t want to talk about corruption until someone else does – it’s like a dark secret that once let out of the bag – everyone talks. The common HA/DR issue over the last few days on the newsgroups and forums seems to be not just data corruption but backup corruption too.

It goes back to the statement that you don’t have a backup until you’ve done a restore. What do I mean by that? I don’t mean you need to restore it back over the original database, but unless you’ve checked the validity of the backup, how do you know it’s going to work in a disaster recovery situation?

You really want to check two things:

  1. The integrity of the database being backed up
  2. The integrity of the backup file itself

Check #1 is easy – run some form of DBCC consistency checks before taking a full backup – otherwise you may be backing up a corrupt database. In fact, I was just dealing with a customer who unfortunately had PFS page header corruption (which cannot be repaired through DBCC) in their database and then found out that it was also in their backup. Their choice then became restoring from a week-old backup or exporting all data into a new database – neither of which is particularly palatable…

Sometimes it may not be possible to run CHECKDB on the source database - commonly on VLDB systems or system with high, continuous workload where the extra overhead required cannot be tolerated for the length of the CHECKDB. In that case, consider running CHECKDB on a restored copy of the database. This not only proves to you that the backup is valid (because it restores correctly) but that the source database was not corrupt at the time the backup was taken. This is a common strategy on systems that where CHECKDB cannot be run. The only problem with this approach is – what does it mean if the CHECKDB comes back with errors? You can’t tell whether the restored database or the source database is corrupt so you’re forced to go back to the source database and run another CHECKDB. Either way – you’ve successfully caught some corruption before it has a chance to really bite you.

I’ve just explained a component of check #2 – validating the backup. The other thing to do before doing the backup in SQL Server 2005 is choosing to use the WITH CHECKSUM option on the backup command. This will do two things while the backup is being taken:

  • Evaluate and check all existing page checksums or torn-page protection on database pages as they’re read
  • Calculate a checksum over the whole backup stream and save it in the backup

This can help ensure that the database being backed up doesn’t have any corruption caused by the IO-subsystem.

Once the backup is taken, it’s possible to verify all of this without actually writing out anything – using RESTORE … WITH VERIFYONLY. This will do everything short of writing bytes to disk, including rechecking any existing page checksums or torn-page protection on pages in the backup and recalculating and checking the checksum on the entire backup.

So, there’s a lot you can do to cut down on the likelihood of having a corrupt backup when the time comes to use it. Now – what happens if you get into the situation from the title of this post – the database AND the backup are corrupt?

Well, many people would throw up their hands in despair, but there is one thing you can do – allow the restore to ignore errors. There’s an option on restore called CONTINUE_AFTER_ERROR which will let the restore continue even if page checksum failure or corruptions in the backup stream are found. The catch is that the database may be inconsistent afterwards so you’ll need to work out what data is corrupt and do an export/import or run repair. In the best scenario, the only corruptions needing repair will be in non-clustered indexes, in which case you can rebuild them or allow repair to fix them up. In the worst scenario, the corruptions will be in the base table data, you’ll need to use REPAIR_ALLOW_DATA_LOSS; however, this may still be better than not being able to restore the backup at all.

Bottom line – make sure you validate your backups after you take them so you don’t get into a difficult situation. Also, be sure to keep more than one business cycle’s worth of backups (not just the last week – but maybe the last month) so that if a backup becomes bad later (through tape degradation, etc), you will have other options for restore and recovery.