SQL Server (2000, 2005, 2008): Recovery/Rollback Taking Longer Than Expected

I have blogged on the various ways to determine the state of recovery but this week I was involved in an interesting discussion around rollback. The transaction has run for 14 hours and then a KILL SPID was issued. The SPID goes into rollback and happed to do so for 2 days and 4 hours.

The natural question is why not 14 hours to rollback?

It is hard to tell without some specific details but if the original query used parallel workers it could have done more than 14 hours of work. For example if the update ran with 4 workers, in parallel, it would have done 56 hours of work or 2.3 days.   Rollback is mostly single threaded so it might take 4x times as long to complete the rollback.   Also, read ahead is often more efficient during execution because the plan is known where rollback has to work from log records and might not be able to drive buffer fetches in the same way.

The next question was: Should I restart the SQL Server service? Would that be faster?

Generally it is not faster because you have to start recovery again from the begin transaction and determine where to start fix-ups of the databases. This means lots of I/O again and this time with a cold cache and probably no other access to the database.

Still Think Recovery/Rollback Is Stuck

If you really think that recovery or rollback should have completed you can issue the following T-SQL commands to capture a dump of the SQL Server process and provide it to Microsoft SQL Server Support for further analysis.

dbcc traceon(-1, 2546, 2551)

dbcc stackdump(1)

dbcc traceoff(-1, 2546, 2551)

This will generate a .mdmp file in the LOG directory. Continue to allow the recovery or rollback to continue and contact Microsoft SQL Server Support for further assistance.

Large Log File

You should also be aware that allowing the log file to grow large and generate huge numbers of VLFs could increase recovery time.

If the database being recovered has a lot of VLFs (Virtual Log Files), the discovery phase that occurs during pre-recovery can take a long time to complete. The number of rows returned by DBCC LOGINFO (dbname) will confirm if this is a possibility. In a typical problem scenario, this query will return several 1000 in a difficult scenario 1.2 million VLFS. This problem could occur even if there are no transactions to recover. The best solution is to shrink the log file to a very small value, confirm that the number of VLFs is few 100s or less, then re-size the log by expanding (or growing ) in a large increment.

Bob Dorr
SQL Server Principal Escalation Engineer