Can't I ever get a guarantee?

I've been sitting and thinking a lot about guarantees this morning while drinking coffee in the debilitating heat here (85 degrees at 11am) and things get more complicated the more I think about it, so time to stop and post some thoughts.

Over the last few days, the question of guarantees has come up several times in various forums, essentially - what guarantee do I get from a clean CHECKDB run?

The simple answer is 'not much', but let me go into more detail.

Here's an analogy to help illustrate my point. Consider Paul, who works for the Seattle Police Department in traffic control. Paul's in a control room somewhere in the city with a large bank of monitors connected to various highway cameras. Every 1/2 hour Paul cycles through the various cameras feeds looking for problems. At the end of the 1/2 hour cycle, Paul knows that there are no accidents in the city.

Ah - but hold on. Does Paul really know that? No. All Paul knows is that at the point he looked at a particular camera, there was not an accident at that spot in the highway network. The very instant he switches to another camera feed, an accident could happen at spot covered by the previous camera.

Hopefully you're starting to see where I'm going with this analogy. A database page is known good at the point that its read from disk. After that point, anything could happen to it, the very next millisecond after the page read has completed.

So, when a CHECKDB completes cleanly, you know that the set of database pages that were read were clean at the point they were read. You can't even say that the entire database is clean at any instant of time, because its not possible to read an entire database in an instant of time.

Now, this may come as a shock to some people reading this but really it shouldn't - its common sense. Basically, there's no guarantee of any future lack of corruptions. You can say that with good, well-behaving and up-to-date hardware the likelihood of future corruptions is small, but you can't guarantee it.

An extrapolation of this point applies to database backups. A common sequence of events is to run a CHECKDB, take a backup, restore the backup on another system and then run a CHECKDB on the resulting database. Yes, this gives you a good sense that the database was clean before it was backed up, the backup worked, and what's contained in the backup is a clean database - but that's where it ends. At the instant that the final CHECKDB completes, you still don't have any guarantees. Something could happen to the backup file - you could have a double failure of the original database and the backup file. Unlikely, sure, but perfectly possible and I've seen it happen many times.

So now that I've scared the hell out of some of you reading this, I'd better touch on the questions that you're no-doubt asking yourselves:

  1. What's the point of running CHECKDB then?
  2. What can I do to get closer to a guarantee?

#1 is the easier to answer. Although you don't get any future guarantees from a clean CHECKDB run, you do at least know that everything's good at present and that all database pages could be cleanly read from disk. That's a pretty strong reassurance that your hardware is holding up.

How often to run a CHECKDB (or combination of other check commands) is way more complicated, and is a subject for a seperate blog post (yes, all those people who keep asking me to post about this - I'll get to it eventually :-)

#2 has no easy answer and I could easily fill a book writing about high availability and disaster recovery. However, I'm not going to do that yet so here are some simple things off the top of my head for you to think about and investigate and I'll follow up with a bunch of blog postings over the next few months:

  • This is the most important one - if you really have mission-critical data, employ DBAs who have a clue what they're doing - I've seen some scarily naive people running things for large companies
  • Convince those that control your budget of the consequences of not spending a little money to protect the data
  • Ensure that all your drivers and firmware are up-to-date
  • Run stress tests on your hardware to ensure that it can cope with the load that you may drive SQL Server to place on it
  • Turn on page checksums in SQL Server 2005
  • Run regular DBCCs to forcibly verify the integrity of all the database pages
  • Use mirrored backups
  • Use decent hardware to store mission critical data
  • Investigate the various Always-On technologies in SQL Server
  • Practice your disaster recovery strategy

Yes, there are plenty more things I could add to the list but I didn't mean it to be exhaustive.

So, even though its not possible to get a guarantee, I should be able to get some degree of assurance that things are ok and will continue to be that way in the face of any eventuality - the trick is to think things through carefully and make appropriate choices.

More on this early next week hopefully...