General SQL Database Repair for Enterprise Applications

Article
10/06/2013

Physical corruption happens on all platforms with any Relational Database Management System (RDBMS). Data is persisted on disks. Disks exist in nature. Natural, man-made, and environmental disasters happen. This blog post is targeted for when corruption is encountered in SQL Server databases. Should this happen? Ideally: No. Never. Does it happen? Yes. Still. Isn’t technology reliable enough to avoid this? Not completely. Aren’t Enterprises skilled enough to avoid this? Some but not all. Aren’t Database Hosting Entities skilled enough to provide this? Unfortunately, not all the time. Entropy exists, resources are constrained, and technical experience varies.

What about SQL Server’s HA solutions? 2012 Always On, Mirroring, and Log Shipping?? Yes, those all will definitely help. But not all customers choose to stay current on such helpful technology. And even in those situations, external disasters can and will still happen.

The issue of focus here is that in rare situations it may come to your awareness that you have a corrupt Enterprise class SQL Server database. Typically from error messages in applications or in the ERRORLOG.
Sometimes only by using the SQL Server provided Database Consistency Checking (DBCC) utility DBCC CHECKDB.

Corruption is justifiably alarming. You want the corrupt situation to just GO AWAY. And you want more complete information on what to do, before you proceed. You have never implemented any SQL Server High Availability (HA) features: not 2012 “Always On”, not “Mirroring”, not even “Log Shipping”. You have heard some rumors in regards to CHECKDB as a final straw to grasp for.

The purpose of this communication is for the reader to consider and use these guidelines as they are working with consultants or support experts when handling such situations. The hope is that the increased awareness will lead to a better final outcome. It could be used by the individual who is working to resolve the corrupt database problem, to help them keep in mind where things are headed, and some areas they can check against the advice they receive as they are trying to work through this painful situation. Not all experts have the same level of experience. And even experts can unintentionally forget some details too.

This list of guidelines attempts to be fairly complete. But as such, there are items here that can be skipped (for any number of reasons). "Intentionally Skipping" considerations, however, reflects an entirely different level of empowered health than being oblivious or unaware of things that could be done. The intention of this list is not to provide a dogmatic approach that must be followed literally (urgency and time commonly arguing against such an approach) but rather to expand awareness and thinking on this class of operational events for every one to function more collaboratively and efficiently toward a better end state. Hopefully, at least, it can provide a basis for more informed dialog and awareness between support providers and support recipients during the handling of such events.

Trying to resolve corruption in a database can be involved and complicated and usually happens under very stressful conditions. Offered here is a set of considerations for a single, independent corrupt SQL Server database. Some SQL Server systems are integrated into complex and highly interwoven scenarios, and such systems would need to have more complete consideration of the additional business implications for Disaster Recovery in those situations. Details on those more complex environments is outside the bounds of this post. A robust and thorough Disaster Recovery Plan (DRP) should be worked up for all such scenarios.

The customers that seem to usually encounter database corruptions are those that lack the resources (money, time, expertise) to properly prepare for disaster events ahead of time, and as a result are typically the most overwhelmed and under prepared when corruption events happen to their database. Most commonly two reasons for this are encountered:

Workloads are forever pushing Disaster Recovery Plan Validation (DRPV) work down to a lower priority such that it never gets addressed (i.e. Planning and Validating Disaster Recovery processes never actually occurs)
The Database Administrator (DBA) / System Administrator (SA) has left (by choice or request) and their responsibilities have been incompletely picked up by their replacements.

When evidence of corruption is encountered, here are the steps to follow (elaboration of each will follow):

{A few items are underlined for emphasis because they continue to be frequently neglected. And their value is the motivation for this blog.}

1) Confirm there is a problem. {Verify that there actually is a problem}

2) Record actions taken. {Start and maintain a written log of all the steps that are done in a chronological fashion. And maintain distilled summaries of history and current status}

3) Know the system. {Know and be able to fully describe the system involved and impacted: multiple DBs?, Replication?, Distributed Transactions?, other dependencies?}

4) Scope the problem. {Understand the nature of the current problem state, if it is possible. What is corrupt: data file(s) or transaction log file(s)? User or system database? What table(s) or functionality is affected?}

5) Identify the cause. {WHAT and WHEN are key. What happened and when did it happen? Power spike? Signs of problem in the Windows Event Logs? Hardware problems?}

6) Eliminate the cause. {If known, correct the problem causing component. If not…replace the system??? Don’t waste time recovering and restoring on top of a sick system}

7) Plan for contingency . {Preserve your current state. Work to NEVER make things worse through the corruption handling process. Backup frequently.}

8) Fix the problem state. {The root cause has been eliminated or corrected. Now the goal is to return the database back to a consistent state. i.e. Get rid of the corruption.}

8A) Communicate. {Open a case with both the Application vendor (if involved) and directly with the RDBMS support services. Occasionally the corrupt data might be in nonessential tables that can be dropped or truncated?}

8B) Understand consistency {The difference between RDBMS and Application consistency is significant }

8C) Understand RESTORE {Understand the value of restoring from BACKUPs: Full, Differential, and Transaction Log}

8D) Understand the risks and limitations of RADL and TLog Resynthesis

8E) Can you use RESTORE in this case?

8F) Use RESTORE if you can {This is best and fastest solution if it can be used: Then skip the rest of the 8 steps and move to step “9)”}

8G) RESTORE won’t work, so then Make Plans. Consider and assess Parallel actions. {Possible to install clean or QA system to PRD and salvaged data?}

8H) Export all the data you can {Can you export all of the data and objects from the database?}

8I) Scope what data is affected. {Identify likely affected objects}

8J) Get useful details. {Get info on the affected objects (rowcounts and Primary Keys if possible, data if possible too)}

8K) Work to salvage data. {Take steps to repair}

8L) Assess the repair effort

8M) Maximize physical recovery. {Repeat “8G)” through “8K)” if needed}

8N) Correct Application inconsistencies. {Make follow up appointments with Application owners to see if lost data can be resynthesized somehow}

8O) Backup this known, good, clean, repaired state

9) Assess application functionality

10) Resume normal operations

11) Review the Event and Learn

11A) Make plans for implementing SQL Server HA features

11B) DRP for adequacy

11C) Review lessons learned from this event

1) Confirm there is a problem. {Verify that there actually is a problem}

It is important to confirm that a corruption problem actually does exist. Most of the time when evidence of corruption appears, a real corrupting event has occurred. But it is possible, albeit very unlikely, that the corruption was potentially an artifact or a brief memory anomaly where the problem state was not persisted to disk. Handling corruption events are not trivial undertakings. It is a serious and usually stressful situation. Take the time and confirm the fact that you actually do have a problem before moving forward to fix it.

WHY IS THIS IMPORTANT?

If you don’t take the time to confirm you actually have physical corruption, then you may errantly spend much time, labor, and emotion restoring previous database (and transaction log) backups on top of more recent good quality data that was not actually corrupt. Although extremely rare, it is an unfortunate state when it happens.

WHAT SHOULD BE DONE ON SQL SERVER?

Confirm the corruption actually does exist. This is best done either by obtaining a current BACKUP of the database and restoring to a different physical SQL Server {if such is available} and obtaining a fresh DBCC
CHECKDB on this second server. Or executing a DBCC DROPCLEANBUFFERS on the existing database server where corruption is believed to likely exist and obtain a full DBCC CHECKDB {using the WITH NO_INFOMSGS option is fine} and confirming that the signs of physical corruption are still present and even identical with the initial information.

2) Record actions taken. {Start and maintain a written log of all the steps that are done in a chronological fashion. And maintain distilled summaries of history and current status}

This may seem like overkill. And when things are very simple, it is. But when things take unexpected turns or the situation takes longer than expected to resolve, you’ll be glad that you have everything documented.

It frequently seems that a fair amount of time gets unnecessarily spent in revisiting the history of long running problems. Having a concise and distilled description of what happened and the current status of things is helpful when onboarding new individuals into the situation. That is, it is nice to have both the long running chronological log for completeness and a current snapshot of the distilled history and the current status of things underway. The current distilled snapshot will evolve over time as more information continues to come to light. It is acknowledged that this distillation task is difficult to do, in practice, because the facts are seldom definitively known while the situation is underway. And prioritization selects what seems best to get down the road. Nonetheless, the working distillation of both history and current status is valuable and can help keep the responding team focused. Keeping a list of questions and speculations can also aid in flushing out the complete picture of the event, and can free human processing to better center on the high priorities of the moment too.

WHY IS THIS IMPORTANT?

It allows for additional individuals to quickly come up to speed with the current status of the situation and a history of all that has been done and tried or is in progress, up to that point. This is helpful for minimizing the time to resolution when issues get complicated. And it is also beneficial for post mortem assessment as to what could have been done better or differently. Be happy if you get through a corruption handling event and are able to conclude that this step was superfluous.

WHAT SHOULD BE DONE ON SQL SERVER?

Nothing specific to SQL Server. Of course, it is best to record this on a completely separate computer system and also on a white board for easy local review.

3) Know the system. {Know and be able to fully describe the system involved and impacted: multiple DBs?, Replication?, Distributed Transactions?, other dependencies?}

It is important to clearly understand the basic functioning of the system. Not necessarily the inner workings but a functional knowledge to be able to grasp how other processing might be affected by corruption in a particular database in a specific database server instance.

WHY IS THIS IMPORTANT?

To minimize the impact to other dependent processing that corruption handling might affect. Keeping everyone abreast of those dependencies early can be helpful to the decision processing on how best to
move forward in certain situations.

WHAT SHOULD BE DONE ON SQL SERVER?

Nothing specific to SQL Server. Just have the owning technical architects be able to clearly describe the dependencies of relevant technological components if they exist. Examples in the SQL Server domain that might be affected: Replication, Distributed Transactions, CLR functioning, downstream data dependencies (SSIS, SSRS, Biztalk).

It is valuable to know as much detail about the symptoms of the corruption as possible. It can draw attention more quickly to the affected components. Things such as “All Users started seeing message X when they ran transaction Y” or “Users simply couldn’t log onto database Z after time aa:bb” or “The database server instance A failed to even come online and stay online” are all valuable information. SQL Server ERRORLOGs and Windows System and Application Event logs frequently offer explicit messages that are helpful for determining the times and types of corruption encountered. Use some basic troubleshooting isolation skills to scope and bound the extent of the affected problem. How far back do the symptoms extend? At what point in the past were things behaving normally?

WHY IS THIS IMPORTANT?

It is helpful to be able to establish checks for when the problem state will be gone. It is helpful to clearly understand if the problematic state is a Host problem (e.g. multiple servers), a database instance problem (multiple instances), a database problem (multiple tables), or something local to only a few tables or structures. Having this information is helpful for everyone to focus on the proper elements of the problem. This does frequently evolve, as more and deeper information comes to light.

WHAT SHOULD BE DONE ON SQL SERVER?

Provide all explicit error messages that are recorded in the SQL Server ERRORLOGs and that Applications are encountering. Also provide all error messages and final results at the end, from a full DBCC CHECKDB that was performed as a fallout from step “1) ” above. And specifically, if CHECKDB runs to completion and indicates corruption (a.k.a “inconsistencies”) it will also provide a statement of “the minimum repair level” to correct the identified inconsistencies. This needs to be provided too.

5) Identify the cause . {WHAT and WHEN are key. What happened and when did it happen? Power spike? Signs of problem in the Windows Event Logs? Hardware problems?}

Expend some effort to understand the cause of the corruption and when the corrupting event occurred. Is a specific disaster event able to be identified that physically affected things at a certain time. Identifying those situations are rare. Sometimes there are signs of hardware componentry errors or warnings when none had been present before. Knowing the times and the components affected, are very valuable for the next step. It is important to investigate the SQL Server ERRORLOGs and Windows System and Application Event Logs for any telltale signs of system instability. Unfortunately, times for errors and warnings don’t always signal the actual point of corruption. For corruption is typically first identified only at the point when a particular page of data is actually touched.

WHY IS THIS IMPORTANT?

It is helpful in the decision process of future recovery efforts to know, as best as possible, what actually failed and when the failure happened. And when failure cause can be isolated to a specific component, then
eliminating that cause of further corruption is possible.

WHAT SHOULD BE DONE ON SQL SERVER

Review the SQL Server ERRORLOGs and Windows System and Application Event logs to try to determine WHAT actually failed and WHEN the probable time window for the failure event.

This also provides sound justification for backing up older copies of those log files too. Something that better run IT shops always do.

6) Eliminate the cause. {If known, correct the problem causing component. If not…replace the system??? Don’t waste time recovering and restoring on top of a sick system}

There are two main aspects to handling corruption problems:

1) Eliminate the source problem that caused the corruption.

2) Return the database to a consistent state from which normal operation can then continue. i.e. Fix the corrupt state.

With the operational pressure that exists in many production situations where corruption is involved, attention is focused understandably on the second of these two aspects. Conscious effort must be exerted to resist this inclination. Corruption should not happen. But it will; entropy exists. Failing to attend to the first aspect here, in many times will frequently lead to a longer time to final resolution, heightened anxiety, and decreased respect for the operational owners of the involved resources. By and large, most all corruptions I have dealt with center on things outside of the SQL Server and Windows processes. If your efforts in regards to point “5) ” were successful in identifying the root cause of the problem…then fix it before you do anything else. And if you are not able to locate any ‘smoking gun’ in your efforts of point “5) ”, it is this author’s opinion, that {if possible} you replace the entire physical system so that you have new and different physical resources to work with. Many Enterprise data centers have spare resources that can be re-provisioned in situations like this. If you are not able to fix or replace the physical components of a failed system (owing to availability or cost), it is recommended that you redouble your efforts at DRPV because you should expect that additional corrupting events will be coming. If corruption happened once, and a root cause was not identified and the environment left as it is; expect more corruption events to happen.

WHY IS THIS IMPORTANT?

To not waste time working to address the second aspect of the corruption problem before addressing the first. It is very wasteful of time, for all involved, to work on the second aspect and find that consistency cannot be re-established because new and different corruptions keep happening.

WHAT SHOULD BE DONE ON SQL SERVER?

Nothing specific to SQL Server. Eliminate the cause of the corruption.

7) Plan for contingency . {Preserve the current state. Work to NEVER make things worse through the corruption handling process. Backup frequently.}

In working through corruption situations, it is important to Make Contingency Plans. Make sure to preserve the state of what is, before you go forward; under the possibility that by going forward you might actually make things worse. For instance:

Let’s say you find you have corruption on Friday…OK. Business decision makers had told you that full weekly database backups on Sunday would be sufficient for your needs. So you restore your last backup, and run CHECKDB. It’s still corrupt. And you find out that the one corrupt table had been there for 6 weeks. And your CFO tells you he needs all the critical details on an order from Tuesday. But you already restored the backup from last week!

If you had taken the time to take a backup, before restoring, you could have returned back to the original state.

As often as necessary, back things up before you go forward (as a contingency plan in case things go unexpectedly wrong). And label the backup clearly as containing the corruption too.

There are some data salvage things that can be done sometimes to extract clean data out of databases where corruption exists {this would be something for Data Recovery experts to assist with}

WHY IS THIS IMPORTANT?

To avoid the situation where the corruption handling process itself made a bad situation even worse.

WHAT SHOULD BE DONE ON A SQL SERVER?

When the SQL Server process and database are up and available, always try to get an online backup of all databases before restoring on top of those databases. And make sure to label the backups appropriately as ‘bad’ or ‘corrupt’.

If SQL Server or the databases involved are not able to run, then obtain an offline backup of all database data and log files. {An offline backup is just an operating system file copy operation}

8) Fix the problem state. {The root cause has been eliminated or corrected. Now the goal is to return the database back to a consistent state. i.e. Get rid of the corruption.}

You have eliminated the cause of the corruption (the first aspect of point “6) ”). But you still have the corruption in the database that needs to be addressed.

WHY IS THIS IMPORTANT?

This is probably the heart of the reason why this blog is of interest to you. You want to get rid of the corruption and return to normal operations.

WHAT SHOULD BE DONE ON SQL SERVER?

There is only one easy and reliable answer to this: Restore from a known, good clean backup. And for 95% of our customers; this works. And if so, it is the way to go! But there is another 5% that never test their backups and their backups are bad, or their backups are insufficient to satisfy the needs of the business owners and data consumers such that restoring from backup is not a viable option. In short, they never substantially validated their Disaster Recovery Plan. And for those situations, there is no canned set of answers for this. A few of the things that it depends upon: What type of corruption is present? What tables are affected? Are system tables affected? Which ones? Are data files or log files affected?

So in general, what are the next steps to follow to fix a corrupt database when restoring from backup is not an option?

8A) Communicate. {Open a case with both the Application vendor (if one is involved) and directly with the RDBMS support services. Occasionally the corrupt data might be in nonessential tables that can be dropped or truncated?}

You have physical inconsistencies in your database. Your database is corrupt. Engage experts to verify that fact and work with them to discern how best to proceed. Typically this means that you should open a support case with the application owner (or vendor) of the software that interacts with that database and the RDBMS provider of that database (e.g. SQL Server support) and provide all of the information you’ve acquired from steps “1)” through “7)” and why you can’t restore from backup.

WHY IS THIS IMPORTANT?

Some of the following might help you resolve this situation more quickly:

Late breaking news that others are being affected by the same symptoms, where those solutions could be leveraged.
Application owners and vendors can sometimes reveal insights into the importance of affected tables such that some can be safely dropped or truncated.
Tracking all such incidents allows RDBMS vendors and Application experts track details to try to improve the entire community long term.

8B) Understand consistency {The difference between RDBMS and Application consistency is significant }

The primary purpose of a database is to reliably store and retrieve data. How fast it can do its job and the nuances with which it can do that are of secondary importance. Information on consistency and corruption can
be found HERE.

In a note for emphasis: DBCC CHECKDB has always stated it checks both physical and logical integrity of the database. This is a true and accurate statement. It does do that. And it can do that completely, but only so from an end-user perspective if all application metadata logic between all the data in the database is managed within the database. For example, if all inter-table relationships are managed by Declarative Referential Integrity (DRI) inside the database, then the consistency of those relationships {the RDBMSLogCon between those data objects} can be checked. But, to be clear, CHECKDB cannot check those application consistency items that are external to the database. This might seem obvious. But it isn’t to many. And in practice, this fact gets often overlooked. CHECKDB cannot check Application consistency.

Applications and end users care about application consistency. Application consistency demands full RDBMS logical consistency that it uses. RDBMS logical consistency demands RDBMS physical consistency. It is possible to correct both RDBMS physical and logical consistency and still be left with an inconsistent database because of data inconsistencies created in the application layer. And for enterprise class applications, this is more commonly the norm than the exception; because the application consistency layer exists outside the awareness of the RDBMS.

WHY IS THIS IMPORTANT?

To properly grasp and amplify the real value in restoring from backups to an earlier state. By successfully restoring from backup, all levels of consistency are achieved (RDBMS physical, RDBMS logical, and Appplication consistency) and the database will no longer be considered corrupt. It may be moved forward into the future with full confidence of the integrity of the data inside. Note, however, that it is incumbent upon the applications that access that data to be aware of the implications of moving backward to that earlier point in time. Eliminating the corruption by any means other than restoring from backup introduces the possibility of creating application level logical inconsistencies. Because of this, we need to note some impacts for any data repairing efforts:

1) Always try to restore from backup to eliminate corruption in a database. This will deliver the best and most complete pathway to recoverability. It is fastest.

2) If a pathway of restoring data from backup is not possible, and extreme measures would be used like RADL or TLOGREBUILDING, expect that application corruption will probably be introduced.

3) Understand the impact to the applications using the data arising from the introduction of logical corruption. And this is where the communication with the data owners or application vendors comes in, as mentioned in “8A) ”

8C) Understand RESTORE {Understand the value of restoring from BACKUPs: Full, Differential, and Transaction Log.}

RESTORE should be your friend. But as with any friendship, you must spend time together to build awareness and establish trust. Practice and learn how to restore from backup. Do this now, before you encounter corruption events. It is very stressful to be struggling with restore syntax, proper backup image locations, and proper sequence of activities when handling a server down corruption situation. Rather choose to pursue a low stress, planned Disaster Recovery Plan Validation (DRPV) exercise. Spend a couple weekends, get some sufficiently sized systems to mimic production and practice restore scenarios. You can find more out about SQL Server restore in SQL Server Books Online (BOL).

In addition to this, also ensure that you understand the difference between the different types of database backups: Full, Differential, copy only, and transaction log. This is covered extensively in SQL Server BOL. One additional note often goes unrealized, however, is that transaction logs in sequence can span across full and differential backups. So, if by chance a last full backup image unfortunately has corruption in it, if you are taking transaction log backups continually, it is possible to restore from an early full backup and apply all the transaction log backups after that (spanning the corrupt full backup) and move forward to a later point in time {presuming that a complete sequence of all transaction log backups is available}.

WHY IS THIS IMPORTANT?

To build your confidence and familiarity with the normal RESTORE process. You want the RESTORE process to be thought of as a warm blanket you know, trust, and feel very comfortable with.

If the direction from support for a corruption instance comes, “You’ll need to restore from a valid known, good, clean backup.” How will you react to that (both tactically and emotionally)?

You should know the steps to do that. And you should feel something akin to, “That’s unfortunate. But tolerable. We’ve identified the root cause and eliminated that. We’ve backed up current state. And we’ve validated the data loss and time to recovery with our application owners so they can handle the time hit and the slight loss of data. And we have a rough idea how long this will take. We’re OK.” {?? Are you able to think like that now ??}

WHAT SHOULD BE DONE ON SQL SERVER?

You must work this one out yourself. There are many backup strategies and architectures: some provided by hardware partners, some by software vendors. Some approach the subject centrally through corporate policy and standardization. Others approach it simplistically. Whatever the specific solution is doesn't matter much, so long as you (or your DBA and SA) test and validate that it works and know who will be contacted if it doesn’t. I have worked with hundreds of different viable solutions. They all can work and they all can have problems. But those who have practiced and validated the DRP know the steps to get things done smoothly, even when things go awry. The audience for this blog is too broad to expect any narrow positioning on this topic.

At the very least, if you are entirely new to this space: Get a backup of the database. Bing “SQL Server and Backup” and research the topic or at least use SQL Server Management Studio and follow through in the information HERE.

8D) Understand the risks and limitations of RADL and TLOGREBUILDING

If you can’t restore from backup for some reason, and you have corruption, there are some options to go forward, but they are very extreme and should not be undertaken without direction from support professionals. They are not guaranteed to work and typically will result in loss of data and generally will introduce application level corruption. You should read a little bit more about them HERE.

WHY IS THIS IMPORTANT?

To better understand the risks associated with using either of them. Some industry ‘experts’ in such situations can advise on their use without clearly educating on the impacts to application consistency. And they can leave your situation with a perspective that: “You have a clean DBCC CHECKDB so you’re good to go.” Neglecting the potential negative impact to application consistency, which is the end goal to be achieved.

If at all possible, try not to ever use either of these options. Understand that confidence will be shaken as to what the state of application consistency in that database is after their use. Application experts will need to be involved afterwards to assess and work to re-establish application consistency afterwards.

Again, use them only under discussion and direction from support professionals.

8E) Can you use RESTORE in this case?

As mentioned before, first and foremost comes to: “Are your backups such that a database restore of them will return the system back to an acceptable state?”

WHY IS THIS IMPORTANT?

This is typically a binary decision that determines how you’ll need to move forward to eliminate the corruption. This is worth the effort to consider this thoroughly...again.

8F) Use RESTORE if you can { This is best and fastest solution if it can be used: Then skip the rest of the 8 steps and move to step “9)”}

{even consider OLDER backups……remember TLogs can span full and differential backup images}

If the decision was yes you know your restore procedure, and you have backed up the existing database about to be restored {per step “7) ”} as a contingency plan, then do it. Restore.

It is a good idea to take the time and get a full DBCC CHECKDB with NO_INFOMSGS, whenever you perform a restore procedure and before you resume normal operations. This allows for a very definitive statement on the state of that restored image prior to any other application work against it (should subsequent corruptions arise). Most customers when they have successfully identified a root cause and the time of the corrupting event, however, usually feel that prolonging the system outage in order to get another CHECKDB is too painful and choose to not do it and to accept the risk that the backup image might have corruption in it. Other customers conclude that since the system is already down that they can tolerate the additional time for another CHECKDB. Both perspectives are arguable. I have known customers that have skipped the DBCC after restoration and the image had corruption in it and it further delayed things because of not having performed the CHECKDB, and I have known customers where there was heightened stress while the business owners were waiting impatiently for the last minutes for the CHECKDB to conclude that, "all was well".

IF the decision was to restore, and the database is either confirmed to be clean of corruption (or is presumed to be), skip the next points and proceed to step “9) ”

WHY IS THIS IMPORTANT?

It will return the database back to a fully consistent state at a particular point in time by which you can then move forward.

This will be your quickest path to eliminate the corruption and return to normal operations.

WHAT SHOULD BE DONE ON SQL SERVER?

Know roughly the time that it takes to perform DBCC CHECKDB on the affected database and make the decision whether or not to run CHECKDB and confirm the database is clean before moving forward. The purpose of knowing the "time it takes" is to be able to make a more informed decision.

8G) RESTORE won’t work, so then Make Plans. Consider and assess Parallel actions. {Possible to install clean or QA system to PRD and salvaged data?}

For whatever reason it was concluded that restoring the database is not an option. The most common reasons encountered include:

“We have no backups”
“The backups are too old”
“Restoring from backups will cause a loss of too much critical data”
“The corruption exists in all the backup images too”

So, from here plans must be sketched out and considered. Generally, if resources are available, separate pathways can be considered in parallel to obtain a quicker resolution. Sometimes there are ways to get a production system back up and going (e.g. moving a recent QA or DEV system into production) and moving recovered data after-the-fact over to production at a later time. There are too many variables to get into this here. The most critical components are:

The number of spare, appropriately sized machines to be able to work with.
The number of paths forward being considered.

If you get stalled on this point, either consultants or support professionals might help you here. But, by-and-large, in such a situation as this….customers usually find themselves on their own in regards to driving the plans for this.

WHY IS THIS IMPORTANT?

To be able to communicate where you are trying to go and to increase awareness of the all the different paths forward that are being pursued.

Planning and managing these different pathways is not always trivial. It is typically under very stressful conditions. Clear communication is crucial helping to manage that stress.

Go back and reread step “2) ” and "8A) "

8H) Export all the data you can {Can you export all of the data and objects from the database?}

There are many different avenues for exporting all the data from the database. One can sometimes detach-copy-attach and then remove all the corrupt objects. Or use SSIS, SSMS, or an external tool to script and extract all of the data from the database. The end goal is the key. To get as much of the non-corrupt data over into a working database.

WHY IS THIS IMPORTANT?

If it would work perfectly you could get all of the data and metadata out of the database. And on successful reimporting into a new and clean database, you would only be left with the potential for application logical corruption. Secondarily, this also moves over all of the data that is not affected by the corruption.

The end goal is a new clean database that is only missing those objects and data that were involved in the corruption.

8I) Scope what data is affected. Identify likely affected objects

This is a further drill down from step “4)”. You can think about step “4)” as being the 30,000 level perspect. Now your perspective is around 1,000 feet.

More than likely, since there was corruption, the export did not work completely and some objects are missing. The goal here is to identify which objects were not able to be successfully exported without errors.

WHY IS THIS IMPORTANT?

It describes which objects/tables are the ones affected by the corruption in the database.

8J) Get useful details. {Get info on the affected objects (rowcounts and Primary Keys if possible, data if possible too)}

Get whatever useful information you can on the corrupted tables. e.g. rowcounts, structures, indexes, primary keys.

WHY IS THIS IMPORTANT?

It can be useful for the application owners to potentially work at identifying what data was lost, for addressing application inconsistencies. Primary keys can be very useful in this regard.

WHAT SHOULD BE DONE ON SQL SERVER?

sp_spaceused, sp_help, SELECT the Primary Key columns FROM the table ORDER BY the Primary Key columns

8K) Work to salvage data. {Take steps to repair}

Now its time for some creativity. It is time to extract as much data from a corrupt table as possible

Generally the approach here is that you extract and import as much data as possible from the corrupt database into a new clean database. Depending on resources, this working database can either reside on the same SQL Server instance or a different one. And then you try to work piece meal to get as much of the remaining data as possible out of the corrupt tables and into the new working database. And once that work has been done, as best as possible; the salvaged data is moved back into the final target destination. Where depending on the scenario, the salvaged data might either entirely replace the old corrupt database or could be added back into an already function database {depending on what was happening in the original source database while the data salvage effort has been going on}.

WHY IS THIS IMPORTANT?

You’re trying to get as much data out of the corrupt database as possible.

WHAT SHOULD BE DONE ON SQL SERVER?

INSERT…SELECT or BCP are your main tools. Then use search arguments, index hints, and order by clauses can be valuable to this end. Support professionals will be able to elaborate on these, if you get to it.

One source of general information on this can be found HERE

8L) Assess the repair effort

Compare what you extracted with what originally existed. e.g. rowcounts, structures, indexes, primary keys. Where are you in regards to the ideal % data recovery for each table and for all tables? And, if possible, also try to identify the actual lost data, even if it is just a window of lost data.

WHY IS THIS IMPORTANT?

For all to know how things are progressing and the time it is taking to progress.

WHAT SHOULD BE DONE ON SQL SERVER?

Likely the same things as in step “8H) ”

8M) Maximize physical recovery. {Repeat “8G)” through “8K)” if needed}

Repeat the steps “8G) ” to “8J) ”as necessary until complete or until no additional data is able to be salvaged.

WHY IS THIS IMPORTANT?

To come to closure that this is the best that can be done to recover data from this corrupt database.

There may be additional external data recovery experts that might be able to pull out additional data. Those can be pursued if financial justifications warrant it.

8N) Correct Application inconsistencies. {Make follow up appointments with Application owners to see if lost data can be resynthesized somehow}

As much data as possible has been extracted and imported into the new working database. Frequently, at this point the database may be able to be returned to normal production. But that is something for the planning team to assess and to consider the tables that were affected and the impact of adding new data into those tables where some data may be missing. In any regards, the data in this new working database should now be physically and logically consistent from the database perspective. i.e. a DBCC CHECKDB on this database should not return any errors at this point. The database should now be fully consistent from a RDBMS physical and logical consistency perspective. But, depending on the application’s interaction with the database, there may be Application inconsistency that was generated by the incomplete extraction of all the data from the corrupt database. Toward assessing if there is significant Application corruption and trying to correct such inconsistencies, when it is appropriate, support cases should be opened with the Application owners of the data in the affected tables {i.e. whomever is the controlling architect of that application data}.

WHY IS THIS IMPORTANT?

It is the end users that need to have confidence in the data and its uses. And it is the application consistency which provides that end user experience. The RDBMS and the data within it are the platform that the application consistency stands upon. Application consistency needs to be re-established.

8O) Backup this known, good, clean, repaired state

After salvaging as much data from the corrupt database as possible and re-establishing Application consistency, this working database is in the best state possible to move forward into the future. Get a CHECKDB to ensure no signs of additional corruption have arisen. Make sure that the CHECKSUM property is set for the database to help notice further corruption messages more quickly as is described HERE. And get a backup.

WHY IS THIS IMPORTANT?

This was the end goal for handling a database with corruption when there is no viable option to restore from an early backup.

9) Assess application functionality

Very few customers take the time to develop and exploit this. But it is a nice item when it exists. Some customers know the key tables and processes that must work reliably for the system to be considered functionally viable. e.g. A few order headers and order details…insert a few records, report out on them , and delete them afterwards. A batch of code to do this. And perhaps a few more batches to assess other key functionality. Bundle this all up as a test suite and keep assessment metrics on accuracy and performance. Does it catch everything? No, hardly. But it is a nice little QA check to use before resuming normal operations from which one can state, “Base functionality appears to be working as expected”

WHY IS THIS IMPORTANT?

Depends on your operational environment. Highly structured ones, or ones with critical or specific business needs might find value in developing test suites to that end.

10) Resume normal operations

You have now, hopefully, eliminated the source of the corruption, and fixed the symptoms of the corrupting event either by restoring or data recovery efforts. You have followed good contingency planning practices and have been backing up along the way. You should now be ready to resume normal operations.

11) Review the Event and Learn

Problems happen. But seldom are they obligated to be only bad. Redirect this typically stressful and unpleasant corruption experience toward a positive end. Leverage it.

Consider the following:

11A) Make plans for assessing and implementing SQL Server HA features

All of these can reduce the time to recoverability

SQL Server on Windows Azure VMs

Always On

Database Mirroring

Log Shipping

11B) Review DRP and DRPV for adequacy

Do you have a Disaster Recovery Plan (DRP) and do you follow through with DRP Validation (DRPV) to ensure that it works and is known by others?

If your data is important, it deserves to be cared for now and through disaster events (of which corruption are only a small portion)

11C) Review lessons learned from this event

Learn from this event. Use this event to get more funding to better prepare against future events.

If you close out a corruption handling event with increased awareness and sometimes enhanced financial backing towards more DRPV development, then you can rightfully turn the pain of a corruption event into a
more pleasurable future where your stress and anxiety will be lessened, and the quality of the operations center and of your own life will be enhanced.

WHY IS THIS IMPORTANT?

To find ways to exploit every painful, unplanned corruption event, into a more robust operational position to better handle similar events in the future.

Corruption still happens. A surprising number of customers still don’t validate their Disaster Recovery Plans. They don’t know if their backups are good. Some don’t even know IF they have backups or how to restore them. This is NOT something to be struggling to figure out in the middle of the stress of a corruption handling event.

Sometimes there is no viable backup to restore to and using RADL is the only option to get things up and running again, either owing to a corrupt and not validated backup or from the restore leading to a completely unacceptable loss of data. Sometimes a bit of extra thought and effort can help customers substantially. Most commonly this help can be delivered by always preserving the information (not rushing too fast and simply restoring on top of things) and by getting at least the Primary Keys out from corrupt tables when possible, before forcing RDBMSCon when it is called for.

SUMMARY RECOMMENDATIONS

1) Understand the information in the databases that you are responsible. Comprehend the consistency needs of the applications that access that data. Know what the risk for application corruption is for those that use the data.

2) Validate your Disaster Recovery Plan for acceptability from maximal, worse case scenarios.

3) Plan on ONLY restoring from a valid, known good clean backup in the event of corruption. Are you ready to do that???

4) Involve Support professionals if you are not.

5) Don’t use RADL or LOGREBUILDING without involving support professionals.

General SQL Database Repair for Enterprise Applications

Additional resources