Restoring Through Transaction Log Errors

This post comes from a colleague of mine, Grant Carter.  Grant is a dedicated Premier Field Engineer for Microsoft.  He is based in Boise, Idaho and has been working with SQL Server since 2002.  His experience includes database administration, consulting, and support engineering.  Grant’s expertise is in performance tuning, high availability and disaster recovery, and replication.

 

By Grant Carter

Introduction

Recently I was involved in a situation where the database had been corrupted and was corrupted for several weeks. Transaction log backups were available and had been taken on a regular interval.

We knew the date of the last successful dbcc checkdb for the database, so we used that copy of the backup as our base for the recovery and then started applying transaction log backups from that point to recover the database going forward.  We had to apply about 2 weeks of transaction log backups to get the restored copy of the database current. 

The recovery started out well.  The restore of the full database backup was successfully performed and then we started applying transaction log backups in sequence.   We recovered 6 days of transaction log backups when we encountered the following error.

Msg 3456, Level 16, State 1, Server SERVERNAME\INSTANCENAME, Line 1

Could not redo log record (7566:8504507:13), for
transaction ID (1:37422114), on page (6:14348000), database 'DATABASENAME'
(database ID 5). Page: LSN = (7526:114730:14), type = 1. Log: OpCode = 5,
context 1, PrevPageLSN: (7516:7031381:25). Restore from a backup of the
database, or repair the database.

 

A number of questions arose from this error.

  1. Am I able to continue past this error, or will I only be able to restore up to this point-in-time and not continue to move forward?
  2. We only saw one error message printed to the command output, but does that mean only one log record couldn’t be redone?
  3. How do I know what the transaction I skipped was doing?

 

Continue Past Error

Starting in SQL Server 2005, an option was added to the restore command that will allow you to continue the restore past the error and to apply the log as best it can.  The option is CONTINUE_AFTER_ERRROR.

The CONTINUE_AFTER_ERROR option instructs the restore command to continue past errors it encounters with the restore operation, restoring what it can.  Using this option, the error encountered is displayed back to the restore command and also is logged into the SQL Server errorlog.  At the recovery point, the database will come online if it can.   Should you need to restore additional transaction log backups, you will then need to use the CONTINUE_AFTER_ERROR option on each transaction log restore command once you have used it.

Please Note: CONTINE_AFTER_ERROR can lead to data loss and we strongly suggest you only use the CONTINUE_AFTER_ERROR option after all other alternatives have been exhausted.  

We also recommend that you save all logs and restore output once you use the CONTINUE_AFTER_ERROR option so you can investigate what operations were skipped by the command.  

 

How Do You Know Which Records Could Not Be Redone

In our case, there was a single LSN marked that it couldn’t be redone.   A couple of questions arose as a result.

  1. Is only the LSN skipped or is the entire transaction skipped?
  2. Just because I had a single message in the log, does that actually tell me everything that was skipped?

The answer to the first question was that the transaction may not be skipped.  The error only impacts the LSN where the error is encountered.  It will try to redo as much as possible of the transaction before it gives up and moves on.  

The answer to the second question is that only the first redo failure is logged back to the client and the errorlog.  The rest of the errors encountered for that restore command are silently unlogged.  However, there is a traceflag which will expose all redo failures to the errorlog.  If you are going to use the CONTINUE_AFTER_ERROR command, we suggest enabling traceflag 3112.  It will cause all redo failures to be logged to the errorlog.  You will still only see the first failure in the command window, but all failures will be logged in the SQL Server errorlog where the restore is taking place.

 

How do I find what type of action occurred on the specific LSN in question?

The errorlog will tell you which LSN it could not redo.  In the previous example, the log record that could not be redone was (7566:8504507:13).  This is a decimal representation of the LSN value in question.  SQL Server provides a function called fn_dump_dblog that allows you to dump the contents of a transaction log backup.  

Please see the following posts about fn_dump_dblog by Paul Randal (https://www.sqlskills.com/blogs/paul/using-fn_dblog-fn_dump_dblog-and-restoring-with-stopbeforemark-to-an-lsn) and Dimitri Furman (https://blogs.msdn.com/b/dfurman/archive/2009/11/05/reading-database-transaction-log-with-fn-dump-dblog.aspx).  

By converting the decimal string version to a hex string representation, you can use this function to search for the specific LSN if you know which log backup file to search.   The example script below converts the decimal LSN string to a hex LSN string and then queries for the exact LSN inside of a log file provided as a parameter.   If you need to search multiple LSN values, then it would be more efficient to read the entire log into a table and index the [Current LSN] field so queries will be fast.   Please note that fn_dump_dblog is not documented in books online.  

--from errorlog output

declare @lsnDecimalFromLog varchar(20) = '7566:8504507:13'

 

--Hex string value of LSN

declare @lsnHexToFind nvarchar(46)

 

--Full path of the transaction log backup file to search

declare @logFileToSearch nvarchar(4000) = 'C:\Temp\Database1_17082015101502.trn'

 

--convert decimal LSN to Hex string

set @lsnHexToFind =

    substring(master.dbo.fn_varbintohexstr((CONVERT(VARBINARY(8), cast(substring(@lsnDecimalFromLog, 1, 4) as int)))), 3, 1000) + ':'

    + substring(master.dbo.fn_varbintohexstr((CONVERT(VARBINARY(8), cast(substring(@lsnDecimalFromLog, 6, 7) as int)))), 3, 1000) + ':'

    + substring(master.dbo.fn_varbintohexstr((CONVERT(VARBINARY(8), cast(substring(@lsnDecimalFromLog, 14, 4) as int)))), 7, 1000)

 

--search log for specific LSN

select

    *

from

    fn_dump_dblog (

        null, null, N'DISK', 1, @logFileToSearch,

        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,

        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,

        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,

        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,

        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,

        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,

        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,

        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,

        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT)

where [Current LSN] = @lsnHexToFind

 

The output will tell you what type of operation was occurring, what the context of the operation was, which transaction it was associated with, which allocation unit and page the operation occurred on, which spid the transaction was associated with, if it was a user transaction, the time that the transaction began, along with a description of some of the operations.   Using this you should be able to identify if the transaction was part of user transaction, and what type of operation was skipped.   It can be useful to understand if you skipped an operation that might lead to data loss.

 

Perhaps this will help, if you encounter errors on restoring your transaction logs.