CHECKDB (Part 4): What does CHECKDB really do? (part 3 of 4)

Article
09/18/2006

(OK - I was wrong about the posting frequency - events overtook me and I got bogged down preparing for these TechEds. China is an amazing place and I wish I had more time to soak up some of its culture while I'm here but with the TechEds being so close together there's little time for anything except sessions, email, taxis, flying, sleeping, jellyfish... I did manage to go to the 400 year old Yuyuan Gardens in Shanghai, which are well worth a visit. Now I'm sitting here next to the Pearl River in Guangzhou waiting to do DAT305: Choosing a High Availability Solution - so time to squeeze in another post.)

In the previous posts of this series, I covered the system table checks that have to be done before anything else can be checked by CHECKDB, and the allocation checks that have to be done before any of the logical checks. Now I want to cover the meat of the functionality in CHECKDB - the logical checks. Note - this is a description of what happens for SQL Server 2005 - its pretty similar in SQL Server 2000.

Primitive checks of critical system tables
(Part 1...)

Allocation checks

(Part 2...)

Logical checks of critical system tables
Logical checks of all tables
Although this section has two titles, the actual logical checks performed in each stage are the same - i.e. everything I'm going to describe below. If any errors are found in the critical system tables (remember these from Part 1?) and:

repair is not specified; or,
repair is specified, but not all the errors can be repaired

then the CHECKDB finishes. An example of an unrepairable system table error is something that would require deleting data from one of the system tables - e.g. a corrupt key value in a clustered index data page of sysallocunits (remember that this is the actual table I'm talking about, not the sys.allocation_units catalog view you may have seen or used).

So if the system tables are clean, all the tables in the database are checked. This includes indexed views and primary XML indexes (which are both stored as clustered indexes - and as far as the storage engine is concerned are objects in their own right - its the relational layer that knows that they're not really separate objects).

The following checks are performed:

Validate each table's storage engine metadata
Read and check all data, index and text pages, depending on the page type
Check all inter-page relationships
Check the page header counts
Perform any necessary repairs

Let's look at each of these stages in more detail.

Validate each table's storage engine metadata

Why do we need to do this? Well, the storage engine metadata is what tells CHECKDB how to crack open records on pages in particular rowsets - so if there's something wrong with it then CHECKDB will generate a bunch of misleading errors or, worse yet, miss some errors. Here's roughly what happens while the metadata is parsed:

Loop through all the indexes, and then all the rowsets of each index, building a list of known allocation units for the index (including all the DATA, LOB, and SLOB allocation units). This allows reverse lookups - when we read a page, because the information stamped on the page is the allocation unit, we need a very fast way to convert between an allocation unit and an index/object ID.
Make sure we skip any indexes that are in the middle of being built/rebuilt online, because they will not have a complete set of data.
Build a list of computed columns and generate the necessary code to enable the column values to be recomputed.
Build a list of columns that are used in non-clustered indexes
Make sure the various USED, DATA, RSVD counts are not negative (see the BOL for DBCC CHECKDB for an explanation of how this can happen - yes, I'm deliberately not explaining here so that you read through the new BOL entry :-)
Figure out what each kind of column is (e.g. a regular column, a generated uniquifier, a dropped column)
Build a series of mappings to allow conversion between column IDs at different levels of storage abstraction
Check that the relational and storage-engine nullability flags for a column agree
Make sure the columns counters in metadata match what we've just seen

Read and check all data, index and text pages

No matter what type a page is, the page is audited and then all the records on it are audited.

Page audit checks first of all for IO errors when the page is read (e.g. page checksums or torn-page detection). If there are some, then the page is not processed any further. This is what can lead to errors like 8976 being reported (e.g. Table error: Object ID 132765433, index ID 1, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type DATA). Page (1:3874) was not seen in the scan although its parent (1:3999) and previous (1:3873) refer to it. Check any previous errors.). Then it checks for page header correctness and that the page has an appropriate type for the allocation unit its in (e.g. a DATA page should not be found in an allocation unit for a non-clustered index)

Record audits include checking the various fields in the record header and that the various offsets in the record make sense (e.g. the offset to the variable length columns section of the record should not point off the end of the record. See the post on cracking records using DBCC PAGE for more info on the record format.

The more complex checks that are done per-page depend on what type the page is. As an example, here's what is done for a DATA page in the leaf level of a clustered index (excluding the inter-page relationships - I'll list those in the next section):

Records in a page must be strictly ordered by the defined keys of the index (although the records themselves aren't necessarily stored in sorted order in the data portion of the page, accessing the records through the slot array must yield them in the correct order)
No two records can have duplicate key values (remember that non-unique indexes have a hidden, automatically-generated uniquifier column added to the key - to make or extend the composite key - so that record uniqueness is guaranteed)
If the index is partitioned, each record is run through the partitioning function to ensure its stored in the correct partition
All the complex columns in each record are checked:
- Complex columns are those storing legacy text or LOB values (text, ntext, image, XML, nvarchar(max), varchar(max), varbinary(max)) or in-row pointers to variable length columns that have been pushed off-row in rows that are longer than 8060 bytes
- The column is checked to make sure its storing the right kind of data - either the value itself or a text pointer or some kind of in-row root containing pointers to portions of an off-row value.
- The linkages between what is stored in-row and the off-row values stored in other pages are eventually checked too
Check computed columns
- If the column is persisted (either because its defined as a persisted computed column or because its used as a non-clustered index key), its value is recomputed and checked against the persisted value
- This is also important when we come to do the non-clustered index cross-checks - as any discrepancy in the stored column values will cause mismatches.
Data purity checks
- The column value is checked to ensure its within the bounds for its data-type (e.g. the minutes-past-midnight portion of the internal representation of a datetime value cannot be greater than 1440 - 24 hours x 60 minutes)
Non-clustered index cross-checks
- This is probably my favorite part of CHECKDB and is one of the most complicated bits of code.
- What CHECKDB is trying to do is make sure that each record in a heap or clustered index has exactly one matching record in each non-clustered index, and vice-versa. The brute-force (n-squared complexity) way to do this is to do a physical lookup of all the matching rows - but that's incredibly time consuming so instead we have a fast algorithm to detect problems.
- Imagine a table defined by the following DDL:

      CREATE TABLE t (c1 INT, c2 char(10), c3 varchar(max))

      CREATE INDEX index1 ON t (c1)

      CREATE INDEX index2 ON t (c2) INCLUDE (c3)

- Each row in the heap has to have two matching non-clustered index rows, on in each of index1 and index2. But how can we tell without doing the direct lookup or having to store tremendous amounts of state? We use a hash-bitmap algorithm.
- Imagine a large bitmap - say half a million bits. Initially all the bits are zero.
- For each record in the heap that we come across, we generate the matching non-clustered index records and hash them to a value. This is very quick because we have all the column values we need. We then take the hash value, map it directly to a bit in the bitmap and flip the bit. So in our example above, each record in the heap will produce two hash values (one for the record in each non-clustered index) and so will cause two bits to be flipped.
- For each record in each non-clustered index, just hash the record and flip the corresponding bit.
- The idea is that the bit-flips should cancel each other out and the bitmap should be left with all zeroes at the end of the checks.
- Taking the view that corruptions are magnitudes rarer than clean databases, we went a step further and allowed the checks for multiple tables to use the same bitmap. If you think about it, this won't cause any problems, even if two sets of records map to the same bit in the bitmap - as long as the number of bit-flips is a power of 2 (i.e. each record really does have its correct matching record) then there should be no problem.
- Here's the catch - what happens when there's a bit left on in the bitmap? Well, this is where the trade-off comes into play. If there's a bit left on, we can't tell which records in which table or index mapped to it so, we have to re-scan the tables and indexs that used the bitmap to see which records map to the bit. For every one we find, we actually do the physical lookup of the matching record and then do a comparison of all the columns, including any LOB columns used as INCLUDE'd columns in non-clustered indexes. This process is called the deep-dive and can add a significant amount to the run-time of CHECKDB if it occurs.

Check all inter-page relationships

Inter-page relationships are relevant for:

pages in heaps that have forwarding or forwarded records
pages in indexes
pages that have records with LOB columns that have their data stored off-row (these can be heap pages, clustered index data pages or non-clustered index leaf pages in SQL Server 2005)
text pages that have records with child nodes on other pages

Continuing the example above, here are the checks that are done for index pages:

All pages in an index level should be pointed to by a page in the next level higher in the b-tree, and also by the left- and right-hand neighboring pages in the same level of the b-tree. Exceptions are made for the left-most and right-most pages in a level (where the m_prevPage and m_nextPage fields are (0:0)) and for the root page of the b-tree, where the parent page link comes from the storage-engine metadata. The 8976 error message that I referenced above is generated from this set of checks.
Key ranges in neighboring pages in a level should not overlap
Key ranges of pages should be correctly defined by the parent pages in the next level up in the b-tree (the parent pages contain the minimum key value that can exist on a child page)

I'll go into details of how the inter-page checks are done in a future post. For now, its enough to say that its not an n-squared complexity algorithm.

Check the page header counts

The page header contains a bunch of counters - the ones we need to check are:

slot count
ghost record count
free space count

The first two are obvious - count the rows as they're processed and make sure page header counts are valid. The free space count is only checked for text pages and data pages in a heap (the only pages for which free space is tracked in a PFS page).

Perform any necessary repairs

I want to leave discussing repairs for another post as there's a ton of info there.

So, that's the majority of the work in running a DBCC CHECKDB and now its time for my session on choosing an HA solution in SQL Server 2005. Next time I'll post the next part of the Fragmentation series.

(So I mentioned jellyfish above... I was having a meal last night with Scott Schnoll from the Exchange team and we had what we thought were noodles on the side of one dish. Scott asked what kind of noodles they were and then we found out we'd been eating cold, sliced jellyfish - it was really very tasty!)

Service Broker checks
Metadata cross-checks
Indexed view and XML index checks
(Part 4...)

CHECKDB (Part 4): What does CHECKDB really do? (part 3 of 4)

Additional resources