CHECKDB (Part 6): Consistency checking options for a VLDB

(Yippee - just finished my certification dives and got my PADI Open Water certification - just in time for our dive trip to Indonesia in December :-)  

This is a question that comes up a lot - in fact 3 times this week already - most recently during a guest lecture I did on DBCC for one of Kimberly's popular "Immersion Events". The question is: how can I run consistency checks on a VLDB?

We're talking hundreds of GBs or 1 TB or more. These databases are now common on SQL Server 2000 and 2005, with more migrations happening all the time. Any sensible DBA knows the value of running consistency checks, even when the system is behaving perfectly and the hardware is rock-solid. The two problems that people have with running a full DBCC CHECKDB on their VLDB are:

  • It takes a long time to run (proportional to the database size and schema complexity).
  • It uses lots of resources.

So it uses lots of resources for a long time. Even with a decent sized maintenance window, the CHECKDB may well run over into normal operations. There's also the case of a system that's already pegged in more or more resource dimensions (memory, CPU, IO bandwidth). Whatever the case, there are a number of options:

  • Don't run consistency checks
  • Run a DBCC CHECKDB using the WITH PHYSICAL_ONLY option
  • Use SQL Server 2005's partitioning feature and devise a consistency checking plan around that
  • Figure out your own scheme to divide up the consistency checking work over several days
  • Use a separate system

Let's look at each in turn.

Don't run consistency checks

Don't be daft. Don't even think about using this option. If you absolutely cannot figure out a way to get consistency checks on your system, send me email and I'll help you. Now let's move on to serious options...

Use WITH PHYSICAL_ONLY

A full CHECKDB does a lot of stuff - see my CHECKDB internals series for more details. You can vastly reduce the run-time and resource-usage of CHECKDB by using the WITH PHYSICAL_ONLY option. With this option, CHECKDB will

  • Run the equivalent of DBCC CHECKALLOC (i.e. check all the allocation structures)
  • Read and audit every allocated page in the database

So it skips all the logical checks, inter-page checks, and things like DBCC CHECKCATALOG. The fact that all allocated pages are read means that:

  • Any pages that cannot be read at all (i.e. 823 errors) will be discovered 
  • If page checksums are enabled in SQL Server 2005, any corruptions caused by storage hardware will be discovered (as the page checksum will have changed).

So there's a trade-off of consistency checking depth against runtime and resource usage.

Use the partitioning feature to your advantage

One of the obvious ways to reduce the time/resources issue is to partition the load. If you're using the partitioning feature in SQL Server 2005 then you're already setup for this. Given that you've hopefully got your partitions stored on seperate filegroups, you can use the DBCC CHECKFILEGROUP command.

Consider this example - you have the database partitioned by date such that the current month is on a read-write filegroup and the past 11 months are on read-only filegroups (data from more than a year ago is on some offline storage medium). The prior months also have multiple backups on various media so are considered much 'safer' than the current month. It makes sense then that you don't need to check these filegroups as often as the current month's filegroup so an example consistency checking scheme would be:

  • Run a DBCC CHECKFILEGROUP on each read-only filegroup every week or two
  • Run a DBCC CHECKFILEGROUP on the read-write filegroup every day or two (depending on the stability of the hardware, the criticality of the data, and the frequency and comprehensiveness of your backup strategy).

I know of several companies who've made the decision to move to SQL Server 2005 in part because of this capability to easily divide up the consistency checking.

Figure out your own way to partition the checks

If you're on SQL Server 2000, or you just haven't partitioned your database, then there are ways you can split up the consistency checking workload so that it fits within a maintenance window. Here's one scheme that I've recommended to several customers:

  • Run a bi-weekly DBCC CHECKALLOC
  • Figure out your largest tables (by number of pages) and split the total number into 7 buckets, such that there are a roughly equal number of database pages in each bucket.
  • Take all the remaining tables in the database and divide them equally between the 7 buckets (using number of pages again)
  • On Sunday:
    • Run a DBCC CHECKALLOC
    • Run a DBCC CHECKCATALOG
    • Run a DBCC CHECKTABLE on each table in the first bucket
  • On Monday, Tuesday, Wednesday:
    • Run a DBCC CHECKTABLE on each table in the 2nd, 3rd, 4th buckets, respectively
  • On Thursday:
    • Run a DBCC CHECKALLOC
    • Run a DBCC CHECKTABLE on each table in the 5th bucket
  • On Friday and Saturday:
    • Run a DBCC CHECKTABLE on each table in the 6th and 7th buckets, respectively

In pre-RTM builds of SQL Server 2005, DBCC CHECKTABLE could not bind to the critical system tables, just like with T-SQL - but that's fixed so you can cover all system tables in SQL Server 2000 and 2005 using the method above. Here's what I mean:

C:Documents and Settingsprandal>osql /E
1> select * from sys.sysallocunits
2> go
Msg 208, Level 16, State 1, Server SUNART, Line 1
Invalid object name 'sys.sysallocunits'.
1> dbcc checktable ('sys.sysallocunits')
2> go
DBCC results for 'sysallocunits'.
There are 112 rows in 2 pages for object "sysallocunits".
DBCC execution completed. If DBCC printed error messages, contact your system
administrator.
1>

There's one drawback to this method - a new internal database snapshot is created each time you start a new DBCC command, even for a DBCC CHECKTABLE. If the update workload on the database is significant, then there could be a lot of transaction log to recover each time the database snapshot is created - leading to a long total run-time. In this case, you may need to alter the number of buckets you use to make the total operation fit within your available window.

Use a separate system

This alternative is relatively simple - restore your backup (you are taking regular backups, right?) on another system and run a full CHECKDB on the restored database. This offloads the consistency checking burden from the production system and also allows you to check that your backups are valid (which you're already checking though, right?). There are some drawbacks to this however:

  • You need to have sufficient disk space on the spare system to be able to restore the backup onto. If the production database is several TB, you need the same several TB on the spare box. This equates to a non-trivial amount of money - initial capital investment plus ongoing storage mgmt costs. (I'm working on this though - I have a patent on consistency checking a database in a backup without restoring it - unclear at this time whether it will make it into Katmai.)
  • If the consistency checks find an error, you don't know for sure that the database is corrupt on the production system. It could be a problem with the spare box that's caused the corruption. The only way to know for sure is to run a consistency check on the production system. This is a small price to pay though, because most of the time the consistency checks on the spare system will be ok, so you know the production database was clean at the time the backup was taken.

Summary

You've got a bunch of choices to allow you to run consistency checks, so there's really no excuse for not knowing (within a reasonable timeframe) that something's gone wrong with your database. If you need further help working out what to do, or just want a critical eye cast over the plan you've come up with, send me an email at prandal@microsoft.com

(One of the other questions that also keeps coming up is: when are you going to write the whitepaper you promised? Q1CY07 - honestly!)