Repair vs Restore Dilemma

For some reason The Clash's 'Should I Stay or Should I Go?' has been going round in my head today - 'Should I RE-pair or restore? Nah nah NA-na nana nah' - must be something to do with Redmond's good weather.

Anyway, I get asked for help with this a lot - more so it seems over the last year or so as people are moving bigger databases to SQL Server and more people are realizing that they need a good backup strategy rather than relying on luck and repair. Usually I get asked this after someone's been faced with the dilemma under pressure, has just had to make an arbitrary choice (trading off data/work loss against limitation of downtime), and wants to know if there's a more sure-fire way of working out what to do.

The bottom line is that I can't give a general recommendation one way or the other - it all depends. However, by doing a little homework and planning, you can work towards a sensible (and defendable to your management chain) answer fairly easily. The trick is to practice going through this exercise now, rather than trying to do it for the first time when a disaster happens and everyone's in headless-chicken mode trying to get the business going again. If you can just plug the data into a tried and tested set of decision steps, you'll be more confident in your choice and things should go more smoothly.

I'm going to assume you're running SQL Server 2005, as the new online restore technologies allow for much reduced downtime and a wider spectrum of choices when formulating your backup strategy.

First, gather the following information:

  • Know your backup strategy so you know what kinds of restores are possible (page, file, filegroup).
  • How long does it take to restore from your backups, in the various granularities your backup strategy makes available?
  • How long does it take DBCC CHECKDB to run?What about DBCC CHECKTABLE on your biggest few tables? What about with the REPAIR_ALLOW_DATA_LOSS option? When a repair option is specified, these commands run a little differently and can take longer.
  • How much data/work can your business afford to lose? (I'll call this the data-loss SLA)
  • What's the maximum downtime your business can tolerate in the event of a disaster? (I'll call this the downtime SLA)

You should make sure your database is structured in such a way that it is possible to formulate a restore or repair strategy that satifies both your downtime and data-loss SLAs. If this isn't possible, you need to push-back on the SLAs before you get blamed for not meeting them when a disaster occurs.

Now, there are some decision steps to go through to work out what to do, using the information gathered above. Of course, you need to factor in doing some kind of root-cause analysis on whatever disaster occurs so you can take whatever preventative steps are necessary to stop it happening again. This may be as simple as copying the database files somewhere else for later analysis.

Usually the disaster is that a query fails because of an IO error - that's the first the DBA knows that something is wrong. Then some kind of DBCC consistency check (typically a DBCC CHECKDB - remember to use the WITH NO_INFOMSGS, ALL_ERRORMSGS options) is run to see what consistency errors have been caused by whatever hiccup the hardware has caused. So, assuming you have a set of results from a consistency check, here are the steps:

  1. Do you even need to run repair or restore?
    • If the number of allocation errors detected (look at the bottom of the DBCC CHECKDB output) is greater than zero, then answer to this is yes, you've got to do something.
    • If there are no allocation errors, and you can determine that all the reported consistency errors are from non-clustered indexes (i.e. they all reference index IDs greater than 1), then the answer is no. This is because non-clustered index errors don't represent damaged data, just a damaged alternative access path to the data. You can simply disable the index (e.g. 'ALTER INDEX IX_Song_Names ON MyRockMusic DISABLE') and then rebuild it at your leisure (e.g. 'ALTER INDEX IX_Song_Names ON MyRockMusic REBUILD WITH (ONLINE = ON)')
    • Note that there's a bug in the code to determine which repair option to recommend at the end of the check output. If all the errors are from non-clustered indexes, it should recommend REPAIR_REBUILD, as nothing that can be done to a non-clustered index can result in data loss. However, it will sometimes recommend REPAIR_ALLOW_DATA_LOSS, for instance if one of the repairs involves deallocating an index page before rebuilding the index. This should be fixed in the next service pack.
  2. What's your data-loss SLA?
    • This is simple - if the answer is zero, then you have no choice but to use a backup.
  3. Do you have a backup?
    • Another easy one - if the answer is no then you have no choice but to run repair. You really don't want to be saying no at this step, especially if the answer for step 2 was 'zero'... Jump to step 6.
  4. Is the damaged data in a non-critical filegroup?
    • Look at the file ID of the page IDs referenced in the errors to determine which filegroups are affected. A page ID is of the form (fileId:pageInFile).
    • If the affected filegroups are not critical, you can set them offline and take care of the problem at your leisure while processing continues in the rest of the database.
  5. Can you restore a small subset of the database?
    • Is the damage limited to just a few pages? If your configuration allows it, you can use online or offline page restore to effect a very quick return to service.
    • If the damage is more extensive, is it all contained within a single file or filegroup (see step 4 for how to tell this)? If so then you can restore just that single file or filegroup.
    • Look for 'online restores' in Books Online for far more info on the various restore granularities available to you, based on your recovery model and Edition of SQL Server 2005.
  6. Can you repair a small subset of the database?
    • Taking a look at that list of consistency errors again, are there only allocation errors? If so you should be able to get away with only running DBCC CHECKALLOC with the REPAIR_ALLOW_DATA_LOSS option, which should be very fast, even for a VLDB.
    • If there are just a few consistency errors , are they limited to a single table (i.e. all the errors reference the same object ID)? If so you should be able to get away with running DBCC CHECKTABLE with the REPAIR_ALLOW_DATA_LOSS option, which is faster than running DBCC CHECKDB.
    • If there's a mixture of error types, or errors from a variety of tables, then you're far safer running a full DBCC CHECKDB with the REPAIR_ALLOW_DATA_LOSS option to make sure everything's cleaned up.
  7. If you still have the choice at this decision step, trade-off the estimated run-time of the minimal repair option against that of the minimal restore option and go with whatever fits in your downtime SLA.

I'd be very interested to hear thoughts on this - especially if there's something I've missed out.

Thanks