I think you will all agree with me that database corruption is one of the nightmares in the life of a DBA. We store precious data in databases and each day the dataset grows bigger and bigger. A database corruption issue not only puts the data at risk, it also threatens to take a toll on businesses and revenues.
Throughout my career in Microsoft, I have seen many corruption issues; and while working with the DBAs for troubleshooting such corruption issues, I have felt the apprehension and anxiety within their minds. Hence, thought of putting down some posts related to database corruption. In my next few posts, I will be discussing a few common corruption scenarios and how to repair the databases in each of these scenarios or how to salvage maximum data off corrupted databases. However, in this post, I will mainly target the most common causes of database corruption and how to prevent database corruption in the first place. Additionally, I would be discussing various best practices to follow, so that we have a fallback plan just in case data cannot be recovered after a corruption.
Before I begin my discussion, I would like all my readers to understand that although Microsoft has written the SQL Server product, Microsoft does not guarantee that databases will never become corrupt or in case of corruption, 100% data can be recovered. However, I would like to assure you that throughout the code of SQL Server, Microsoft has taken utmost care to safeguard your data as much as possible.
Moreover, I would also like my readers to also understand that Microsoft is not a Data Recovery Company. Meaning, Microsoft Product Support Services (PSS) does not guarantee that if you call in with a database corruption issue, PSS would recover all your data. All support that PSS provides in corruption cases is on "best efforts basis", meaning that PSS will provide commercially reasonable efforts to recover your database or data off your corrupted database using documented and undocumented commands and procedures. However, 100% data recovery is not guaranteed.
Before I go any further, I would request my readers to go through the following articles:
=> Overview of the Microsoft third-party storage software solutions support policy
=> Microsoft SQL Server Database Engine Input/Output Requirements
=> Microsoft does not certify that third-party products will work with Microsoft SQL Server
The most common cause of database corruption (more than 95% of all corruption cases) that we in PSS encounter turn out to be caused by a platform issue, which is a layer below the SQL Server. The most common individual cause is a 3rd party driver or firmware bug. The next most common cause is an actual hardware fault and are typically either of Disk, Controller, CPU, or Memory Module(s). As of all Database Management Systems, SQL Server heavily relies on the Disk Subsystem for storing and retrieving data. Any issues in the Disk Subsystem, the Disk Controllers or the Communication Channels or even the Disk Drivers can cause databases to become corrupt.
If you encounter any corruption issue, please engage your hardware vendor(s) immediately. The hardware vendor(s) should primarily check the driver(s), firmware(s), and BIOS versions to ensure that they are running the most current and recommended versions. Additionally, they should run available hardware diagnostic tools to identify potential hardware issues. Each vendor typically has their own in-house utilities designed to detect obvious hardware problems. While OEM diagnostic utilities may be able to detect an obvious hardware failure, they are far from conclusive. In our experience it is common for these diagnostic utilities to produce a clean bill of health for a system that is later found to have a hardware problem.
The Windows Event Logs should also be constantly monitored for any Disk / Hardware related errors. It is not necessary for a corruption to be introduced or detected as soon as a Hardware Error is reported. A corruption case can be detected even months after a Hardware Error is reported. Consider a scenario when a disk suddenly became unresponsive when SQL Server was writing into it. This will obviously cause the data to become corrupt; however, this corruption will be detected only when we try to access the pages that had become corrupt; and this can be months after the corruption was actually introduced. Similarly, a bit flip while writing data to the disk subsystem, caused by a disk driver, can cause corruption, and this might be actually detected at a much later point in time.
Kernel Drivers and softwares like Antivirus, that have the ability to take Kernel Level Handle on files, are also known to cause database corruption issues. In case you have Antivirus Installed on the system, please ensure that you exclude the SQL Server Database Files from Antivirus scans. Please refer to Guidelines for choosing antivirus software to run on the computers that are running SQL Server.
Although very rare, we have seen a few people keep database files in compressed volumes and folders so as to conserve disk space. Having database files in compressed volumes and folders are not supported and can cause database corruption as well. Please avoid storing database files in compressed volumes and folders. In case you are concerned about disk space and would like to compress database files, please make use of data compression, introduced in SQL Server 2008. Additionally, please do not store your backups in compressed volumes and folders. Please utilize Backup Compression introduced in SQL Server 2008.
One last addition to the list of probable causes of database corruption includes Bugs with Microsoft products itself. PSS is always on the lookout and will always be on the lookout for evidence of Microsoft Bugs that can cause databases to become corrupt; and all such identified Bugs are fixed as soon as possible, so that our larger customer base does not get affected by those Bugs. However, it is always recommended to be on the latest builds of SQL Server and Windows to avoid database corruption caused by Microsoft Bugs. Specifically speaking, for each of the products listed below, please upgrade to the mentioned builds:
=> SQL Server 2000: SP4 and QFE build 2245
=> SQL Server 2005: SP2 or SP3 (SP3 recommended)
=> Windows XP: SP2
=> Windows 2000: SP4 + QFE KB 838647
=> Windows 2003: SP1 + QFE KB 940467
=> MSXML with SQL 2000: If MSXML is being used, remove /3GB in BOOT.INI or disable full MSXML garbage collection (see KB 321621)
If you would like to contact Microsoft PSS for a Root Cause analysis for your corruption issues, please document the answers to the following questions:
=> When did corruption first occur?
=> Has the corruption reappeared?
=> If it has, what is the interval in which it has reappeared.
=> Do you see any pattern in the reoccurrences?
=> Are multiple databases affected?
=> Are multiple tables within the same database affected?
Additionally, please share the following data:
=> The Windows Eventlogs exported in TEXT format.
=> The SQL Server Errorlogs.
=> The MSINFO32 output exported in TEXT format.
=> Complete hardware specifications. This should include SAN details in addition to the server’s hardware.
=> DBCC CHECKDB output showing the corruption (if available).
=> Get the output from the msdb..suspect_pages table.
For more details on suspect_pages please refer http://msdn.microsoft.com/en-us/library/ms174425(SQL.90).aspx.
Also, please be prepared to run the SQL Server MPSReports. This utility would be shared by Microsoft PSS once you are in contact with Microsoft PSS.
Finally, before I conclude on this post, I would like to request all my readers to ensure that you take regular backups of the SQL Server Database onto Tapes or other removable media. Since we are relying on hardware for storing data, and hardware is prone to failure, we must have a backup of all the data, so that we can revert back in case of hardware failures. Please take regular backups of all your databases, including the system databases. Backups might include Full Database backups, Differential Database Backups, File and FileGroup Backups and Transaction Log Backups. Microsoft has provided various types of backups in SQL Server, please make use of all of these. I would request all my readers to finalize and implement a backup strategy as soon as possible, if you have not already done so. This would help you to minimize data loss in case of corruption or failure. Moreover, after taking a backup, please ensure that you actually restore the backup on a test system to ensure the validity of the backup and that it can be used in case of database failure.
Additionally, there are various High Availability Solutions introduced by Microsoft, including, but not limited to, Log Shipping, Database Mirroring and Replication; please make use of these High Availability Solutions, to ensure that you have a duplicate database ready in case you encounter a database corruption issue or a database failure issue.
Hope this post is useful. Any comments will be highly appreciated.
Disclaimer: All information provided here is my personal opinion and is neither verified nor approved by Microsoft before it is published. All information, and code samples, if any, is provided "AS IS" with no warranties and confers no rights.