Common bad advice around disaster recovery

Article
06/07/2006

I've just been itching to post this one.

I'm doing a chalk-talk (AKA cabana session, this year AKA Technical Learning Centre session) at TechEd next Wednesday (8.30am 6/14) called 'DBCC CHECKDB: Magic, Monsters, and Myths'. A lot of the material in this will be included in the mega-whitepaper on DBCC I'll be writing over the summer, but it also slices nicely into blog-size posts so I'm going to do a series of them over the next month or two covering everything from today's topic to very deep drilldowns into how specific repairs work. If there's anything specific you'd like to see, let me know.

Billy Connolly (a great Scottish comedian) tells a story from his childhood about his father listening to the radio at breakfast while eating his outmeal, and every so often shouting 'No!!!' at the radio. Eventually the radio had a nice spackling of dried outmeal. I usually read the newsgroups and various MS and non-MS forums every day or so - and if I were to read them at breakfast, my laptop screen would very quickly get a nice spackling too (not of oatmeal though - nasty stuff).

I really despair to see some of the 'advice' that's out there, even though its all well-intentioned. Here are some of the gems related to disaster recovery - and I'll try not to rant.

"Just run REPAIR_ALLOW_DATA_LOSS and you'll be fine..."

What about your backups? What's the point in taking backups if you don't use them. See my previous post for how to choose whether to repair or restore. Of course, you do have backups, right?
What about the fact that you may lose data? Do these advisors understand what repair's going to do to fix the errors that have been posted - I'm guessing not.
What about working out what happened? You should always do some root-cause analysis (RCA) to work out why the problem occured. This is so important, otherwise you're just patching holes with duct-tape when you repair or restore.
I've seen people recommend running repair to 'fix' 823 errors caused by unreadable disk pages...

"Just rebuild your transaction log using these steps..."

This one kills me every time. Very, very rarely is it advised in the right context - corrupt transaction log with no backups, and even then the ramifications aren't explained.
All the same arguments as above apply here too, but additionally:
Do you know what rebuilding a transaction log does to the database? It's about the worst thing you can do - ripping out all the uncommitted and unrecovered transactions and doing nothing to fix things up. This is why its undocumented and unsupported for use except under the guidance of Product Support in last-chance situations.
I've seen people advise doing this if recovery is taking too long...
Now, in SQL Server 2005 we have a documented and supported solution for these situations, Emergency Mode Repair. See 'DBCC Statements' in Books Online and I'll cover it in detail in a future post.

"Just restore your database and carry on..."

This one's not really bad except that it's missing the crucial investigation into what happened. Chances are that whatever caused the problem this time will repeat and lead to another outage. Always do RCA - even if its just copying the database off somewhere for later analysis.

"Run CHECKALLOC, then CHECKDB, then CHECKTABLE on all your tables, then..."

It is a common misconception that DBCC CHECKDB does something different from DBCC CHECKALLOC and DBCC CHECKTABLE. It doesn't. At the simplest level, DBCC CHECKDB runs DBCC CHECKALLOC and then runs DBCC CHECKTABLE on every table in the database.
Granted, in SQL Server 2000 and before, it didn't run DBCC CHECKCATALOG, but we fixed that in SQL Server 2005. (Yes, we'll get to DBCC CHECKCONSTRAINTS too at some point).
Another thing I see is advise to run DBCC CHECKALLOC to determine the damage caused by an 823 or 824 error. That won't cut it - given the distribution of page types, the odds are that its not an allocation bitmap that's affected so you should run DBCC CHECKDB.

"Just flick the power switch on and off a few times on one of the drives..."

This is in response to 'how can I create a corrupt database to test by custom DBCC script?'. Nice - not only do you get a corrupt database, you also get a fried drive PSU. I'm sure your hardware support team would love you for that.
One thing that I used to recommend to people in pre-SQL Server 2005 days was to create a test database and manually delete a row from the sysindexes table. This will cause all manner of problems. Now, in SQL Server 2005 that's not really possible any more, so...
Here's the best way to create a test corrupt database: use a hex editor on the raw database files. Shutdown the server so the file isn't locked (don't detach the database because if you corrupt the 'wrong' page you may not be able to attach it again). Pick an offset more than, say, 100 pages into the file (at least 819200 bytes) but make sure its aligned on an 8192 byte boundary (a page boundary). This avoids critical metadata pages and allocation bitmaps so that you'll be able to start the database and run DBCC CHECKDB on it. Write a few bytes of zeroes into the file at the chosen offset and you're almost guaranteed some page header corruption errors.
Now remember, I'm only advocating doing this to create a test corrupt database, don't even think of doing this to anything you value.
If you want to test a specific error, let me know - we have tools that can engineer any single or combination of corruptions (so we can test the check and repair algorithms)

These are the worst ones around disaster recovery. I could go on all day about not shrinking database and not blindly rebuilding all indexes every night but it's lunchtime so I won't go there - yet :-)

Common bad advice around disaster recovery

Additional resources