FAQs CHECKDB in SAP SQL Server Databases…Again

Article
10/30/2013

1. What is CHECKDB?

2. What is the reason for yet another post about CHECKDBs on SAP SQL Server databases?

3. What is database corruption?

4. What can cause database corruption?

5. Why should you be concerned about corruption in a database?

6. How does one learn that they have corruption in a database?

7. What should you do when you encounter signs of corruption in an SAP database on SQL Server?

8. How frequently should you perform CHECKDB on SAP SQL Server Databases?

9. Is it necessary to perform BACKUPs and CHECKDB on SAP SQL Server Databases that are already using one of SQL Server’s High Availability (HA) offerings; like Always On, or Mirroring, or Log Shipping?

10. Is the PHYSICAL_ONLY parameter for CHECKDB sufficient?

11. Is there any guidance to offer for very large and highly used SAP systems?

12. If CHECKDB is clean of any signs of any problem, is there any risk that my database might be inconsistent?

13. If CHECKDB responds that there are signs of corruption but that the REPAIR_REBUILD option should be sufficient to remove the corruption, what should I do?

14. If CHECKDB responds that there are signs of corruption but that the REPAIR_ALLOW_DATA_LOSS option is the minimum level to remove the corruption, what should I do?

15. Is it possible for database backup images to become corrupt?

16. Is there anything else I can do to reduce the risk of corruption in my database?

17. Is there anything that can be done to speed up CHECKDB performance?

18. Is CHECKDB even still required?

19. Summary Recommendations

1. What is CHECKDB?

“CHECKDB” refers to the SQL Server database checking utility (DBCC) command DBCC CHECKDB that is documented in SQL Server Books Online like HERE. It is the best way to check the internal consistency of a SQL Server database that SQL Server development provides. Consistency being described HERE.

2. What is the reason for yet another post about CHECKDBs on SAP SQL Server databases?

Some SAP customers still don’t periodically run database consistency checks (e.g. CHECKDB). We in SAP-SQL Server support know this because we continue to encounter corrupt database situations every so often, and in most of these situations CHECKDB was not regularly run. With aging hardware and challenging economic times, databases with corruption problems continue to trickle in and are not trending to nothing. For any number of reasons, there is a wide variance in technical skills that is encountered, and those at the “tail end of the distribution” can benefit from further amplification on this topic.

3. What is database corruption?

Databases exist to store data so that it can be accurately, reliably, and usefully retrieved. Database corruption is a state of a database where it is unable to provide data that had been previously saved. When a database (or any storage repository of data) has no corruption within it, it can be said to be ‘consistent’. Data in digital computers resolves only to streams containing differing sequences of zeros and ones and other dissimilar streams referring to those. When even one of those ‘bits’ is out of its expected sequence, that database can be considered ‘corrupt’. But similar to mutations in DNA, not all such changes lead to a full-blown disease condition. Sometimes the mutations can be benign. But both genetic mutations and data corruptions are unexpected states that should be avoided and preparations laid for their probable occurrence. More information on database consistency and corruption HERE and HERE {Wikipedia}

4. What can cause database corruption?

Anything that can alter the data inside of a database or the information critical to the interpretation of that data can give rise to corruption. Things that affect the physical data itself or the meta-data used to interpret the data in a meaningful manner can cause corruption: natural disaster, environmental problems (e.g. overheating), hardware, firmware, drivers, malware, and human action. On former versions of SQL Server that are no longer supported, as with any RDBMS product, software bugs could occasionally be the source or corruption too. And though we seldom run into such situations, it is something to always keep in mind.

5. Why should you be concerned about corruption in a database?

In most cases, corruption arises from problematic environments. And typically problematic or sick environments don’t generally heal themselves. Consequently, after corruptions start to appear, more will follow. Signs of corruption are symptoms of an unhealthy system. Similar to the human body: criticality can vary, but generally when unhealthy symptoms are detected they should be investigated with the help of trained professionals. The problems should ideally be understood, the root causes eliminated, and procedures followed to restore things back to a normal healthy state. It should also be noted that corruption should be anticipated in the same way that sickness and injury might insult your body. One seldom expects to get hurt or sick, but one should be mindful and plan for the situation when it will happen.

6. How does one learn that they have corruption in a database?

Most commonly corruption is identified by end-users as they interact with data and fail to be able to perform the productive work they are trying to do. Usually there will be some kind of errors that are raised or recorded in log files like SQL Server ERRORLOGs, the Windows System and Application Event logs, or the event logs of the application that encountered the failure (e.g. for SAP Netweaver systems ST22, SM21, or ST11 Dev Trace files). Examples of some errors that usually indicate corruption:

Msg 824, Level 24, State 2

Msg 845, Level 17, State 1

Msg 926, Level 21, State 6

Msg 8909, Level 16, State 1

Error: 3414, Severity: 21, State: 1

7. What should you do when you encounter signs of corruption in an SAP database on SQL Server?

All corruption messages should be taken seriously. The overall steps for handling corruption can be found HERE. When corruption messages are encountered, the first thing that is needed is a current CHECKDB on the affected database.

The quickest way to get a current one is to empty the cached data buffers on that SQL Server instance using the tSQL utility command DBCC DROPCLEANBUFFERS and then get a full DBCC CHECKDB with NO_INFOMSGS on the database.

If there are no overt symptoms to the affected system and the production demands are such that dropping the clean buffers would cause excessive operational problems (which would be a very rare situation), then it is also fine to take a BACKUP of the affected system (after the time of symptom presentation) and restore it onto another SQL Server instance and obtain the full DBCC CHECKDB with NO_INFOMSGS on that server.

8. How frequently should you perform CHECKDB on SAP SQL Server Databases?

As with most things, in the RDBMS world since it functions more like a platform than an application: it depends.

It depends on many things that only the data users, the application deverlopers, and the database and system owners can determine. In general it seems to come down to: How critical is the data contained? And How risk averse are you? The most common frequencies considered are Daily, Weekly, Monthly, or only on system architecture change?

A few of the questions for considering to help you assess the appropriate frequency:

What’s the worst case that might happen?

Hypothetically: you might encounter undetected corruption in a critical database that prevents using your intended restore process

What’s the likelihood that the worst case would happen?

Very very small. But not completely zero.

How many resources do you have?

Do you have spare systems sufficient to house restored images of backups of critical systems? If so, can you use those? If not, why not?

Will you leave your house today?

Things happen. This is an important question for perspective. For this is roughly the same likelihood that is involved here.

No one can answer these things for you.

For myself, I would start with Getting a full CHECKDB Before my first Go Live {to ensure full consistency then}. And then again, soon after Go Live on a RESTORED backup image to also validate the viability of the backup/restore strategy. And I would always run additional CHECKDBs after any changes or alterations to the disk IO subsystem of any kind. After that, it would depend on resources. All SAP customers will have DEV and QA systems. Many will also have Stress Testing servers. And some will even have PRD-1 testing servers to use as mockups before moving things into production. Many of these systems are not used all the time and could be leveraged when not in use to intermittently restore BACKUP images and then run CHECKDBs run on those images. I would try to get a CHECKDB every month, at least.

SQL Server on Windows Azure VMs

Always On

Database Mirroring

Log Shipping

Yes, some SQL Server HA solutions reduce the need for checkdbs (“Always On” and “Mirroring” since SQL Server 2008) because they have some automatic data recovery features built in as described HERE. But High Availability isn’t identical with Disaster Recovery. And even though there are some auto-correcting features in these recent versions, they do not decrease the need to take BACKUPs and to confirm the viability of RESTORE of BACKUPs in the event of catastrophes of any kind. And as part of a fully vetted Disaster Recovery Plan Validation effort, a key part of the assessment of the RESTORE operation should be to run CHECKDB to ensure the consistency of the database.

With “Always On” (pre-release known as HADRON) there is also the ability to perform the CHECKDBs on one of the readable secondaries, as is described HERE.

One must note that all of these HA offerings have the power of transmitting the database changes recorded in the transaction log from the primary source database to the destination secondary HA database. And since most corruptions are physical in nature, this helps prevent corruption from moving onto the HA secondary database. BUT, one must also keep in mind that CHECKDBs performed ON the secondary will only be assessing the consistency of that secondary, and there still could be existing corruption that exists on the primary source database. That is: a clean CHECKDB on a secondary does not guarantee that the primary is clean and consistent. Awareness of this fact can help to more clearly understand where corruption actually exists. CHECKDB on the primary source database can sometimes be called for even though the CHECKDB on a secondary fails to indicate any concern.

10. Is the PHYSICAL_ONLY parameter for CHECKDB sufficient?

It is far better than nothing. It will catch almost all problems. It might not catch all. It depends on risk aversion. A full DBCC CHECK will check some additional things that the PHYSICAL_ONLY won’t. But to date, none of these additional checks are known to have identified any substantial corruption that the PHYSICAL_ONLY check had missed. These additional checks aren’t expected to identify additional instances of severe corruption.

It is recommended to regularly run CHECKDB using the PHYSICAL_ONLY parameter and to intermittently run full CHECKDB. The PHYSICAL_ONLY parameter is believed to be sufficient to catch all signs of substantial corruption.

And it is also recommended, as in point 16 below to be certain to use the CHECKSUM property for all SAP Databases, as has been recommended for years.

11. Is there any guidance to offer for very large and highly used SAP systems?

Have a read on Juergen’s blog HERE.

12. If CHECKDB is clean of any signs of any problem, is there any risk that my database might be inconsistent?

Potentially Yes, if you have ever used RADL or LOGREBUILDING or you have unreliable administrators or others with direct access to the data. And why these activities are discouraged.

Application Logical corruption is significant and can be introduced by any of these activities. More information about Application Logical corruption can be found HERE.

But outside of these caveats and by and large: NO. A clean DBCC CHECKDB on the entire database (typically using the WITH NO_INFOMSGS clause) is the best tool available to ensure that your SQL Server database is consistent.

13. If CHECKDB responds that there are signs of corruption but that the REPAIR_REBUILD option should be sufficient to remove the corruption, what should I do?

Investigate, identify, and eliminate the root cause of the problem.

Backup the database as sound contingency planning

When convenient, stop all application work into the database.

Run DBCC CHECKDB using the REPAIR_REBUILD option

Generally in this situation, these steps work fine, and you will not need to even restore from backup.

14. If CHECKDB responds that there are signs of corruption but that the REPAIR_ALLOW_DATA_LOSS option is the minimum level to remove the corruption, what should I do?

Open a case with support experts. Either an SAP customer support issue in the BC-DB-MSS queue or a Microsoft SQL Server CSS customer case. Or consider opening a case with both. The value of the SAP support issue is to understand the significance of the corruption from the application perspective when possible. The value of the Microsoft SQL Server CSS support issue is to offer 7x24 support in working through problems with the backup/restore process and actual removal of the corruption. When a critical server down situation exists, the 7x24 availability is usually the key. But when the server is up and running it is typically the application dependency on the logical integrity that ends up being more important.

Take note, however, that for SAP customers the standard firmly stated policy to customers is to “Fix the cause” and to RESTORE FROM BACKUP in the event of physical corruption. And that the customer owns the viability of that process. If, for whatever reason, that should prove to be unacceptable {e.g. the backup images themselves can be corrupt} then SAP offers a non-guaranteed service for data recovery that is available at only a HIGH priority level of issue. This is consistent across all RDBMS providers.

The point of emphasis for SQL Server customers running SAP software: First, make sure that your backup strategy is sound and well understood and will satisfy business expectations (in regards to potential loss of data and time to recover and resume normal operations). Second, that SAP will only pursue data recovery efforts as a paid service and at only a HIGH level of issue priority. This is articulated in SAP Note 1597910 [via SAPNet HERE].

15. Is it possible for database backup images to become corrupt?

Yes, just backing up a database doesn’t mean that those backup images are guaranteed to be sound. Or that the backup solution or technology hasn’t encountered additional corrupting events enroute to or while existing on the backup destination.

The only assurance that any backup image is fully reliable is to obtain a separate backup system, restore from the backup image, and obtain full DBCC CHECKDB on that restored image. This should be done periodically. How frequently? The response is similar to previous question.

16. Is there anything else I can do to reduce the risk of corruption in my database?

Stay on current drivers, firmware, hardware and software versions.

Make sure to enable the CHECKSUM option on all SQL Server databases and on all BACKUP images as Cameron mentions near the top HERE. But be aware, that enabling it only adds the CHECKSUM protection for every subsequent time that a page will be written by SQL Server. i.e. pages that are on disk and aren’t written won’t have the checksum total saved for the page simply by enabling the option. That is, CHECKSUM helps greatly, but only for database pages that will get written to disk by SQL Server. All CHECKSUM does is provide quicker corruption notification, it does not prevent corruption. It is still important to run CHECKDB periodically, even if CHECKSUM is set.

CHECKSUM on the database is generally not impactful to performance, but on the BACKUP itself might show some performance impact. Assess that for yourself.

17. Is there anything that can be done to speed up CHECKDB performance?

Have a read on these two Bob Ward blogs on a Faster CHECKDB:

Part I {brief}

Part II {later, more complete explanation}

The last 4 of the 5 recommendations at the bottom of “Part II” are sound recommendations to consider. The first one, which is the most helpful to improve performance, can be problematic for us in the SAP space where MAX Degree of Parallelism (MAXDOP) is generally set to 1 under normal operating conditions.

If a separate, isolated backup system is leveraged for running CHECKDB {typically as part of a robust Disaster Recovery Plan}, then MAXDOP can and should be set to zero. But sometimes such isolated systems are not available and the database server is already resource constrained (I/O constraint being the most common). In such environments there are two other things than can be considered. One option is to dynamically switch the max degree of parallelism back to 0 for the execution of CHECKDB. This of course would need to be done during available windows where other workload is minimal. sp_configure can be used to alter the setting of ‘max degree of parallelism’ away from the normal SAP configured setting of 1 to a value of 0 and back again afterwards. Executing CHECKDB inside of a batch can work well on a server when no other productive work is happening at that time.

And the other option is to set up a separate resource group and assign MAXDOP to that isolated resource group and let it run in that protected context. Cameron Gardiner mentions this at the end of the post HERE.

The MAXDOP for SAP systems should always remain at 1 for optimal performance.

18. Is CHECKDB even still required?

What Robert Dorr wrote in 2009 is still true today.

https://blogs.msdn.com/b/psssql/archive/2009/02/20/sql-server-is-checkdb-a-necessity.aspx

19. Summary Recommendations

1) Get CHECKDBs. At least intermittently. Using the PHYSICAL_ONLY clause is fine.

2) Intermittently validate your Disaster Recovery Plan. i.e. Restore a backup onto another SQL Server instance and get a full DBCC CHECKDB with NO_INFOMSGS on that restored image.

This ensures you know how to restore.

And this ensures that you actually can restore

3) Understand the information in the databases under which you have responsibility.

How critical is the information?

Comprehend the logical integrity needs of the applications that access that data.

Know what the impact of logical corruption is to the applications that use the data.

4) Confirm the tolerable loss of information and the time to recover from any disaster event by the consumers of the data.

Not so much what they want, but what they can realistically tolerate {for this will help scope what should be in place in response to a disaster event}

5) Set the CHECKSUM options in the database and BACKUP, if you haven’t already.

6) Know what you will do when corruption is encountered in your database. Every DBA/SA should expect and plan for it to happen. And simply be pleased that it has not happened….today.

FAQs CHECKDB in SAP SQL Server Databases…Again

Additional resources