SQL Server (2000, 2005, 2008): Recovery/Rollback Taking Longer Than Expected

I have blogged on the various ways to determine the state of recovery but this week I was involved in an interesting discussion around rollback. The transaction has run for 14 hours and then a KILL SPID was issued. The SPID goes into rollback and happed to do so for 2 days and 4 hours.

The natural question is why not 14 hours to rollback?

It is hard to tell without some specific details but if the original query used parallel workers it could have done more than 14 hours of work. For example if the update ran with 4 workers, in parallel, it would have done 56 hours of work or 2.3 days.   Rollback is mostly single threaded so it might take 4x times as long to complete the rollback.   Also, read ahead is often more efficient during execution because the plan is known where rollback has to work from log records and might not be able to drive buffer fetches in the same way.

The next question was: Should I restart the SQL Server service? Would that be faster?

Generally it is not faster because you have to start recovery again from the begin transaction and determine where to start fix-ups of the databases. This means lots of I/O again and this time with a cold cache and probably no other access to the database.

Still Think Recovery/Rollback Is Stuck

If you really think that recovery or rollback should have completed you can issue the following T-SQL commands to capture a dump of the SQL Server process and provide it to Microsoft SQL Server Support for further analysis.

dbcc traceon(-1, 2546, 2551)

dbcc stackdump(1)

dbcc traceoff(-1, 2546, 2551)

This will generate a .mdmp file in the LOG directory. Continue to allow the recovery or rollback to continue and contact Microsoft SQL Server Support for further assistance.

Large Log File

You should also be aware that allowing the log file to grow large and generate huge numbers of VLFs could increase recovery time.

If the database being recovered has a lot of VLFs (Virtual Log Files), the discovery phase that occurs during pre-recovery can take a long time to complete. The number of rows returned by DBCC LOGINFO (dbname) will confirm if this is a possibility.  In a typical problem scenario, this query will return several 1000 in a difficult scenario 1.2 million VLFS.  This problem could occur even if there are no transactions to recover. The best solution is to shrink the log file to a very small value, confirm that the number of VLFs is few 100s or less, then re-size the log by expanding (or growing ) in a large increment.

Bob Dorr
SQL Server Principal Escalation Engineer

Comments (5)

  1. AspiringGeek says:

    This is a fundamental yet little understood area.  Thanks for adding some clarity by publishing this great post.

  2. ajaymalloc says:

    Really good to know this aspect of ROLLBACK behavour.

    Is there any way to detect how much time the rollback will take. From your blog it seems that ROLLBACK will always take longer than actual work done, because the mechanism like single rollback thread, log read dependency, of SQL Server is supports transactions more than that of Rollback. But for my experience it has worked opposite.

  3. Miguel de la Broadbento says:

    We do full refreshes throughout our DW builds and all the rollbacks do is get in the way. Too bad there isn't an option to

    KILL SPID and just leave the data in place as is. The rerun of our process would then do either table truncate or drop and we'd be on our merry way. In other words, the rollback is superfluous within the context of our local use of SQL Server.

  4. Kailas Pawar says:

    Could  we kill the simple rollback command running on SQL Server 2008.

    Please could you  anser on kailasspawar@gmail.com

    I have some doubt to kill rollback statement. Your help is realy appriciated.


    Kailas Pawar