Sensitive question – CHECKDB when and how to execute it in large SAP systems

As the data volumes of our customers increase more and more often questions come up whether to run CHECKDB or when to run CHECKDB or how to run CHECKDB. Questions by customers usually get raised when the run time of CHECKDB more or less spawns most of the weekend and comes into conflict with other tasks like full online backup. When we look at our customer base, then we can see customers doing everything from running CHECKDB on production on a weekly basis over running it on a restored copy of production on a monthly basis to running CHECKDB in case of a storage issue only to (unfortunately) never running it. It all is dependent on the systems and resources customers have and to a degree the knowledge and expertise and budget available. But let’s go step by step

Problem with running CHECKDB on large databases

CHECKDB reads all pages of tables and indexes in order to perform consistency checks on the format of the physical structure like pages and extents. It also checks the format of the content as well as allocation consistency and assignment consistency (does an index row point to the correct data row). For this purpose I/O bandwidth and CPU resources plus a bit of memory are used. If I restrict the # of CPUs for CHECKDB, the longer it will run. If I give it more resources, the more I am impacting my production due to its I/O workload and CPU consumption. So for a lot of customers it is difficult to find the sweet spot. Often enough keeping the workload going takes precedence and the throughput of a CHECKDB run lands somewhere in the 200-300GB/h. Looking at a 10TB database of which we have quite a few at our SAP customers, the run time becomes excessive. Sure one could run CHECKDB with the option ‘PHYSICAL_ONLY’ and restrict the number of checks to a subset around checking physical consistency of pages mainly. The reduced number of checks certainly would reduce the run time, but no the I/O workload since still every page needs to be read. Hence we need to find a concept which does minimizes the need for running CHECKDB on the one side, but still assures that physical damage to the page since it got stored on disk is detected not too long after the damage happened. For that we can apply a 3 step procedure

 

Step 1:

Only operate your SAP databases with the database option ‘checksum’ set. This is the default setting for all databases created on SQL Server 2005 and later SQL Server releases. Unless somebody who wanted to shave off a few CPU cycles didn’t disable ‘checksum’, it is on. For database which got upgraded from SQL Server 2000, the ‘checksum’ option needs to be explicitly set as the database option. From the point on where this option is set, every page written to the database, contains a checksum in the database header, which was built reading the page. As soon as a bit on the page changes, the checksum wouldn’t match with the content on the page anymore. Having the checksum option set, SQL Server will do this check with every read from the disk. If the page got physically damaged while on the media or in transfer, SQL Server would give an error. This all works for pages which get read frequently and which have a checksum (all pages which got changed since the move to SQL Server 2005 happened and the time checksum got enabled). For database which got upgraded from SQL Server 2000 or which got installed on later releases w/o enabling checksumming, there is no way from SQL Server side to check which pages since the change to checksumming got actually checksummed. Especially customizing tables in SAP databases are mostly read-only and hardly receive changes. Hence w/o any activity like export/import due to a Unicode migration or rebuild of all the indexes, there can’t be certainty about every and each page being checksummed. As mentioned the case looks very different for SAP systems which got installed under default settings of SQL Server 2005 and later SQL Server Releases. But what about financial data of 9 months ago which only will be touched when the year-end reporting is running? This is where step #2 comes in

 

Step 2:

With the introduction of the database option ‘checksum’ in SQL Server 2005, the backup functionality also was extended to check the checksums of the pages which got written to the backup. Since the normal case is that our SAP customers even with largest databases execute at least one full database backup the week, there is a chance to check whether anyone of the pages got physically damaged. However this check is not enabled by default in anyone of the SQL Server releases SQL Server 2005, 2008 2008R2. The check needs to be enabled with the backup command. It can be seen here in a command I traced backing up a little test database:

BACKUP DATABASE [testl] TO  DISK = N’c:test1_full.bak’ WITH NOFORMAT, NOINIT,  NAME = N’testl-Full Database Backup’, SKIP, NOREWIND, NOUNLOAD,  STATS = 10, CHECKSUM

When looking into the backup wizard, there is a checkbox which enables this option in the options tab of the backup wizard. What does this option exactly do? For every page which is pushed to the backup media, SQL Server calculates the checksum and compares it with the one in the header. Means the physical consistency of a page is getting checked. Usually the thing we worry most about. The rest of the checks as done by CHECKDB are not done. The checksum test during backup is automatically done when leveraging SQL Server 2008 backup compression. In that case it doesn’t need to be specified separately anymore. Means with this step we get confirmation that bits didn’t change since SQL Server calculated the checksum of a page and wrote it to disk. This should detect the most significant source of corruptions we encounter with customers. As another side effect enabling checksum in the backup, SQL Server builds a checksum over the whole backup. This again enables SQL Server in the verify or in the restore process to detect physical corruptions which happened after the data got written to the backup media.

 

Step 3:

With the two steps upfront physical corruption or damages of pages which occurred in the I/O path or in storage locations get detected in short time. The very rare issue of successfully persisting a page which got damaged in memory already however isn’t detected. Sure CHECKDB does a lot of other checks which should be run from time to time, but the reason #1 of page content being changed transferring to the media or by an incident on the media got covered in step 1 and step 2. Now one can lower the frequency of CHECKDB on production to a longer period.

 

When to run CHECKDB as soon as possible?

CHECKDB should be run as soon as possible when storage problems came up. We had a customer a few days back where the SAN device literally crashed. They ran CHECKDB immediately after the SAN was up again and indeed detected issue in the table GLPCA where several pages were physically damaged. Other cases when CHECKDB should be run are cases where you see read and write errors reported in the Windows Event Viewer. Other triggers to run CHECKDB could be severe breaks and issues in the FC or iSCSI cabling or hardware.

 

More luxurious ways of performing CHECKDB

Some of our customers from the beginning opted for not running CHECKDB against the production system. Instead those customers perform a restore exercise every week or month against an older set of server hardware and storage. After the restore is finished they use that database to run CHECKDB against. Some of those customers use this system for the remaining time then as ‘destructive testing’ system. Or they use it to have SAP support looking at issues which they have in production, but which are not reproducible with the older dataset in the test system.

Some of our customers are using Database Mirroring. The mirror would give those customers the possibility to create a database snapshot on the mirror and run CHECKDB against the mirror. This is a fine thing and tells whether the mirror database is fine or not. However it doesn’t give any indication on physical consistency of the principal database. Since SQL Server Database Mirroring only ships the change records to the mirror, eventual damages done to pages on the principal are not transferred to the mirror. This can be very different to Storage Replication where a category of damages could be transferred to the secondary data mirror.

Additionally SQL Server 2008 introduced automatic repair of damaged pages under the condition that synchronous database mirroring is used

 

Backup Verification

Another related issue to the topic is verifying backups not only when performing them (like with checksum), but when they are on the backup media. The only way to check whether a backup is restorable end-to-end is to restore it. In SQL Server we can perform a ‘restore verifyonly’ which tells us as much as the data on the backup media does fulfill the requirements in terms of formats. Restore verifyonly performs the same steps as a normal restore, except copying the bits and bytes of the restore to disk. Means fundamental functionalities around a backup device being readable, the device hardware being able to function is tested. As of the quality what is on the backup device format of extents and pages are checked as well. If ‘backup with checksum’ or backup compression was used, two different checks on checksums are executed as well. First check is the checksum on the backup stream. Second check is as during the backup to check the checksum on each page and compare it with the entry in the page header section. So this ‘restore with verifyonly’ option gives already pretty good indications whether a backup is restorable. But the ultimate test for restore is to execute a real restore. Not only that a real restore can give the staff some exercise or familiarity with the processes around a restore, it also gives some indication of timing and other things to take care of when restoring. What to do with the restored database? As I mentioned earlier, it could be used for running complete CHECKDB on it or perform exercises one wouldn’t even dare to do in a development or test system.