Corruption Handling in SAP Databases

What is database corruption?  Why should you be concerned about corruption in a database?  What should you do when you encounter corruption in an SAP database on SQL Server?  This blog post is relevant for handling a corrupt SAP NetWeaver database on SQL Server. 

 

Physical corruption happens on all database platforms with any Relational Database Management System (RDBMS).    Data is persisted on disks and natural, man-made, and manufacturing disasters happen.  In our support experience with corrupted SAP systems, we have seen physical corruption happen due to: system power failure, bad physical memory, SAN infrastructure problems, etc.  

 

What about SQL Server’s High Availability / Disaster Recovery (HA/DR) solutions:  2012 Always On, Database Mirroring, and Log Shipping?    Yes, those all would usually help because at least one or more extra copies of the physical data is present.   But SAP does not require that customers utilize SQL Server HA/DR solutions so not all systems are protected with multiple physical copies of the data.

 

This blog will discuss four main points:

(1)   How is physical corruption normally detected in a SQL Server Database?

(2)   What is the simplest, best way to recover from corruption?

(3)   What is the difference between physical and logical corruption and why is this so important for an SAP database?

(4)   The workflow to use in order to recover best from a corruption 

 

1. How is physical corruption normally detected in a SQL Server Database?

 

In SQL Server Support for SAP, we periodically process customer messages where a physical corruption has been detected via various errors. The most common ways corruption is detected are via:

  • SQL Server errors which indicate corruption are recorded in the SQL Server Error Log, Windows Event Log or SAP System Log (transaction SM21). The most common error codes are 823, 605, 644, 601, 2511, 8928, 8944, 8952 or 8976. Other error codes may apply.
  • The output of a DBCC CHECKDB or DBCC CHECKTABLE execution states there are substantial physical or logical corruptions that can’t be routinely repaired.

 

2. What is the simplest, best way to recover from corruption?

SAP Note 142731 [via SAPNet HERE]documents that the only officially supported process of handling corruption in SAP Databases on SQL Server is to restore from a known clean backup.    But it also correctly dfferentiates the two classes of corruption that can be encountered:  Those that are non-critical and can be repaired fairly easily and those that cannot be repaired from within the database itself.   The former usually happens when the corruption occurs on index pages and the latter when the corruption resides on data or metadata pages.  When the minimum level of repair needed to fix the corruption is stated as either “repair_fast” or “repair_rebuild” please follow the “Non-critical corruption” guidelines described in that SAP Note.

 

If you find evidence of significant corruption (where repair_allow_data_loss is listed as the minimum repair level needed), you must make every effort to restore from a clean backup. Some customers fear that restoring a full database and applying all the transaction logs is risky action because of the length of time it can take and the complexity. However, it is always far more complex, time consuming, and error-prone to try to repair or recover data manually.

 

Unfortunately, some customers do not have a recent clean backup which they can restore. In these cases, an attempt to repair or recover data is the only option available. Before moving into the repair or recovery steps, we will first explain some important principles about the SAP application that you must understand before trying to repair or recover physically corrupted data.

 

3. What is the difference between physical and logical corruption and why is this so important for an SAP database?

When corruption errors are logged, the first advice is to immediately run a SQL Server Database Consistency Check (or DBCC). When you do this, the DBCC output may direct you to execute a specific repair option. However, before running any repair command you must understand that the complete consequences of running the repair options.

 

The Microsoft documentation about DBCC CHECKDB and its repair options shows that there are really only two DBCC CHECKDB repair options:

  1. REPAIR_REBUILD – will perform repairs that have no possibility of data loss
  2. REPAIR_ALLOW_DATA_LOSS – can result in data loss

The documentation then states that if you execute REPAIR_ALLOW_DATA_LOSS then you should afterwards run DBCC CHECKCONSTRAINTS. The reason is that if your database has used primary and foreign key relationships to ensure application data (or logical) consistency, then any data deleted by the repair which would result in a logical inconsistency will be identified with DBCC CHECKCONSTRAINTS.

 

However, many Enterprise Database Applications (like SAP) have their own proprietary metadata repository (such as the SAP Data Dictionary, or DDIC) that allows for self-management of referential integrity relationships, cluster tables, etc.   The self-management means it does not rely on the RDBMS to provide the Declarative Referential Integrity (DRI) to manage the Primary Key and Foreign Key relationships.  Almost all SAP database tables have Primary Key constraints specified but without any foreign keys that are associated to them.   This is significant in that there is no way for the database to assess the logical consistency between tables and their data when those relationships aren’t defined.   That can only be done within the realm of the SAP Application module itself.   More information on Application and RDBMS consistency and corruption can be found HERE

 

Therefore, if you execute DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS on an SAP NetWeaver database you risk losing data in multiple SAP application areas which cannot be detected afterward by running any database statement.  And the application level logical inconsistencies can be far broader than simple data records.  Any or all of the following examples could happen:

 

  • A metadata reference in DDIC for a particular object can be eliminated; or 
  • Some detail records for a particular header record can be removed; or
  • Many header records could be eliminated leaving several detail records orphaned; or
  • Calculations and accumulations that had been performed and stored are now changed to be inconsistent.  

 

Identifying such application logical inconsistencies is almost impossible to do.   There is no such thing as a DBCC CHECKSAPDB utility to assess the application consistency, or to ‘fix it’ when is broken.

 

In other words, one can try to use REPAIR_ALLOW_DATA_LOSS to ‘clean things up’ and actually end up making things worse. You could remove all of the blatant physical corruption but you would be creating multiple instances of logical corruption for which the application has no awareness. This would result in incorrect business data and a complete loss of confidence in the database store.

 

This is why the first and best advice when you experience corruption in your SAP NetWeaver database is to restore your database from backup. But if no backup is available for restore, you should open an SAP customer Support message in the SQL Server application area (BC-DB-MSS) so that you can get expert assistance.  Be aware though that it is your responsibility to maintain backups of your data. If you have no backup, SAP Support can try to help you recover data but there is no guarantee it can be done and you may have to pay for the data recovery. Read more about this in SAP Note 1597910 [via SAPNet HERE].

 

If you try to fix such corruption either on your own or with the help of a third party, please consider these thoughts as you proceed to work on the situation.   Specifically:

1)      [CONFIRM-BAD] Confirm the corruption actually exists (in some rare situations transient errors can give a false positive impression that corruption exists)

i.e. Don’t restore from a backup onto a database that is actually NOT corrupt.   Always take time to confirm there actually is a problem first.   And understand the landscape too.  SAN replication can deliver unexpected results in terms of corruption being present at one time and not present at another (in these cases the corruption is only in one node of the SAN paired set so depending on which SAN you’re currently reading from, the data is corrupt or not).

 

  • DBCC DROPCLEANBUFFERS and obtaining a full DBCC CHECKDB with the NO_INFOMSGS option is usually the best way to do this.

2)      [PRESERVE] Preserve the current state often, even when thought to be bad.  Never make things worse. For example, backup frequently during your repair efforts as a contingency plan in case a mistake is made.  Don’t restore from a backup onto a production system and then find out the backup is more corrupt than the restored to image, which you now have overwritten.  Backup first.

  • Preserve what you do before you change anything.   And record somewhere all of the steps that you do.

3)      [UNDERSTAND] Understand the root cause of the corruption. Most all such physical corruptions are sourced in hardware, firmware, environment, or human error.   Try to identify the cause and the time when the first problematic symptoms were observed.

  • This is not always easy to do but it must be done before continuing with the repair.  For example, if the corruption was caused by bad memory then you must diagnose this and replace all the memory before proceeding. Otherwise, a new corruption will result at some point.
  • In every corruption case you must engage your hardware partner in order to do root cause analysis. They have tools to test all of the physical components.  Unfortunately, these do not always provide clear evidence of the root cause.
  • Even in corruption occurs in a non-production system you must engage the hardware partner as many components are either shared or the exact same model. For example, most of your systems most likely use the same SAN and if a corruption appears in one database on that SAN there could be corruption in others.

4)      [ELIMINATE] Eliminate the underlying root problem. Don’t waste time fixing corruption on a system where corruption has already happened without eliminating the root cause.  If it has happened once and you don’t understand the root cause, you cannot correct the root cause. This means it will happen again. If the root cause is ‘unknown’ then consider swapping out the entire hardware system if possible.

5)      [FIX] Fix the corrupt state and any symptoms of it. Again, the best way to do this is to restore from a known good, clean backup.  It is the best way to get a critical system back into production as quickly as possible.

Try to never use REPAIR_ALLOW_DATA_LOSS (as described HERE), but if that is your only option:

a)      Open a customer support message to Microsoft SQL Server Support or SAP Support so that you can work with an expert before you execute REPAIR_ALLOW_DATA_LOSS.

  • See the following SAP Notes for more information (SAP Service Marketplace Login required):

 142731 – DBCC checks in SQL Server

1297986 – Backup and Restore strategy for MS SQL Server

1420452 – FAQ: Restore and recovery with MS SQL Server

1597910 – Handling of database corruptions on SQL Server

b)      Identify the affected tables and the type of tables.

  • Some tables may contain technical logging information which is not business critical so you can truncate the table or lose data from it without affecting business operations.

c)      Rescue as much information as possible

  • Get table names and record numbers or rowcounts
  • Get the Primary Keys of the affected tables
  • Get the data if it is possible. The reference HERE documents several ways to do this.
  • Consider data Export/Import.   Although this seldom works when it hits the corrupt pages, it is very useful for getting out all of the unaffected data.  And can be fully helpful when dealing with corruption only on internal metadata structures.

d)      Document the impact.    Have a complete list of all the affected tables where data loss is expected, the size of the tables, and the number of rows believed to be deleted.

e)      For each affected table, open an SAP customer message in the application area component (e.g., an FI support message for FI tables, an SD message for SD tables, etc.).   Sometimes with the preservation of Primary Keys it might be possible for responsible application experts to identify lost records, reconstruct data from other tables in the same system or pull data from a different system in the landscape.

6)      [CONFIRM-GOOD] Get a final CHECKDB to ensure no more corruptions are reported. This is a critical step because the initial CHECKDB output can sometimes only report the first layer of corruptions. Once you have repaired that first layer, sometimes the DBCC check can then progress further and find new corruptions.

This means you might need to execute a DBCC check and repair cycle several times until the DBCC check finally reports no further corruptions. Again, we recommend that you backup frequently so if you repair a first set of corruptions and new ones are reported you should backup the system before starting the second cycle of corruption repairs.

Once the DBCC check reports there are no further corruptions, backup this good state. Then, run any test scripts available to assess the application logical consistency of the database (these might be supplied by the SAP application area colleagues when you open the SAP customer support message).  

Lastly, make sure that CHECKSUM is set for the database and the backup as mentioned near the start HERE.

7)      [REVIEW] Learn from this event in order to either better prevent a corruption recurrence or to establish a disaster recovery plan. For example, frequently customers do not have a clearly defined disaster recovery plan.   At a minimum, you should restore from a production backup at least once a month and run a DBCC check on the restored database. This will confirm that the media is readable and that your backups do not contain corrupted data.  Customers who do this can safely restore a backup when corruption is experienced and can completely avoid the lengthy data repair/rescue process.

Following are some specific suggestions for you to review after a corruption:

  • Determine if one of SQL Server’s HA/DR would have helped in this situation:  (SQL 2012 “Always On” (SAP Note 1772688 [via SAPNet HERE]), Database Mirroring (SAP Note 965908 [via SAPNet HERE]) or Log Shipping (SAP Note 1101017 [via SAPNet HERE]).  If yes, we recommend you implement it.
  • How frequently are you performing Database and Transaction Log Backups?   
  • Are you testing them to ensure the backups are not corrupted?
  • Do you have periodic disaster recovery (DR) drills? Many SAP customers call an unscheduled DR drill once a quarter to (1) ensure data is recoverable and (2) that employees have experience with the DR steps so that an actual DR event is familiar and orderly. This also allows a quarterly update of the DR plan so that it is not outdated.
  • Do the proper employees know the DR plan? Are enough employees familiar with the DR plan in case not every employee is available for a DR?

 

A more generic and detailed list of how to repair database corruptions in Enterprise class databases can be found HERE.

 

SUMMARY RECOMMENDATIONS  

For handling corrupt SAP databases on SQL Server

  • Involve skilled professionals – open an SAP customer support issue (and consider opening a Microsoft SQL Server support issue too)
  • Avoid using RADL or LOGREBUILDING when possible
  • Understand the SAP need for Application Consistency
  • Should RADL need to be used; work to extract at least the Primary Keys
  • Consider the 7 prong approach listed above for handling corrupt SQL Server databases:

CONFIRM-BAD

PRESERVE

UNDERSTAND

ELIMINATE

FIX

CONFIRM-GOOD

REVIEW

  • Consider the blog post HERE for additional general guidelines for repairing enterprise class SQL databases

 

 Thanks to Leslie Moser for help with the text.