Slow recovery times and slow performance due to large numbers of Virtual Log Files

In my previous post about database mirroring configuration problems caused by large numbers of virtual log files, I mentioned that such a scenario can have other unwanted side effects, with or without database mirroring being enabled. Here’s an example that demonstrates one of these scenarios. This shows how recovery times after a disaster can be significantly extended by allowing your log file to grow in small increments, thereby creating thousands of VLFs. It’s worth noting as well that this scenario also demonstrates the performance impact of doing online file growth during periods of large activity. This scenario can affect you with or without mirroring enabled, it’s just quicker to demonstrate with configuring mirroring.

I mentioned before that to fully understand this optic you need to be familiar with Transaction Log Physical Architecture and also Factors That Can Delay Log Truncation.

There is one simple lesson to be learnt here, so I’ll write it early on in the article:

Pre-grow your log files (and your data files) in advance. Avoid repeated online auto-growth if at all possible.

Consider the following example scripts which are based upon the same routines as used in the previous post, but with some small changes.

These scripts do the following actions, one for a good scenario and one for the bad scenario.

1. Create a database named VLF

2. Create a starting backup point

3. Open a transaction in this database to ensure that the transaction cannot be truncated

4. Fill up the log with thousands of transactions until the log reaches the space limit set for it – in this case we test with a 5GB log. The test of the log running out of space is used just to guarantee log files of equivalent sizes for the recovery time comparison. This situation holds true for logs that haven’t run out of space.

5. Restart the server without taking a backup or dealing with the log space. This forces SQL to recover the entire contents of the active portion of the log, which is in effect, all of it.

We then just start the SQL Server service again and monitor the recovery times and look in the SQL Server error logs.

We notice a number of interesting points here:

1. The runtime required to fill up the 5GB log files is significantly faster when using the pre-grown log file. Less than 2 minutes for the pre-grown log file compared to over 9 minutes for the incrementally auto grown file. This demonstrates why auto-growths can be bad. They are expensive to manage and can harm your performance. The fact that the log fills quicker indicates that you are managing to throughput the transactions faster.

2. There is a delay of about 2 minutes when nothing is logged about the recovery phases of the VLF database in the bad scenario. This is the period in between the database starting up and the analysis phase beginning. In reality there is a large amount of work going on here; SQL Server is processing the 20000 VLFs that are present in the log file. This delay can be extended in proportion to the number of VLFs that you allow your log file to hold. I have seen real world examples where hundreds of thousands of VLFs have significantly delayed a recovery time.

3. The actual recovery times are similar (about 2 minutes) once the recovery actually begins. These recovery times are this length due to the fact that we created the old open transaction in the test meaning that SQL is forced to recover the entirety of the log.

4. As well as being significantly faster to fill up the log (in terms of runtime) which indicates that the processing of transactions is faster, when we look at the log in detail we see that we have processed about 6500 more transactions in the shorter period. Further evidence of how much better performance the pre-grown log files gives you. We’ve managed to process 25% more transactions in a quarter of the time!

The scripts and error log output are shown below. The scripts use the undocumented DBCC LOGINFO command to examine the number of virtual log files in the physical log. Remember as with all undocumented commands, use of it is not supported, and you use it at your own risk. The number of rows that this command outputs represent the number of VLFs currently in the physical log.

SCRIPT 1 – Bad scenario

--CONNECT to the test server instance

use master

go

if exists (select name from sys.databases where name = 'vlf')

begin

drop database vlf --(if you need to)

end

--create a database on the root of C with a small starting log size

--which then grows in small increments

create database vlf

on (name=vlf, filename='c:\vlf.mdf')

log on (name=vlf_log, filename='c:\vlf_log.ldf',size=1MB,maxsize=5GB,filegrowth=1MB)

go

 

use vlf

go

--create a dummy table to fill up

create table vlf_test

(

a int identity(1,1) not null,

b char(8000) not null,

c int not null

)

go

insert vlf_test values ('vlftest',0)

go

--create a starting point of backups

backup database vlf to disk = 'c:\vlf.bak' with stats = 1

go

backup log vlf to disk = 'c:\vlf1.bak'

go

 

--open a 2nd new connection and open an explicit uncommitted transaction

--to ensure that the transaction log cannot truncate

use vlf

go

begin tran

create table opentran

(

opentran int

)

Go

--switch back to the first connection

--now run the following to grow the log to create lots of VLFs

--run until the log runs out of space - a convenient stopping point

use vlf

go

set nocount on

go

declare @x int

declare @a char(8000)

declare @b char(8000)

set @a = replicate('a',8000)

set @b = replicate('b',8000)

set @x = 1

while 1=1

begin

update vlf_test set b = @b, c = @x where a = 1

update vlf_test set b = @a where a = 1

set @x = @x + 2

end

go

 

--let this run until the log runs out of space

--on the 2nd connection, use dbcc loginfo to check how many VLFs you have

DBCC LOGINFO (vlf)

--when you have reached 5GB of log file with several thousands VLFs

--simulate a server disaster by cutting the power to the principal server

--alternatively you could just power off your server ungracefully

shutdown with nowait

--now restart the SQL service or power on the server

I get the following results from the error log when I run this on my test server.

Runtime of filling the log: 9 minutes 21 seconds

Number of VLFs when log full: 20480

Time for VLF to recover fully: 3 minutes 40 seconds

Excerpts from the SQL Server log:

Server starts:

2008-05-16 13:22:40.85 Server Microsoft SQL Server 2005 - 9.00.3186.00 (X64)

VLF db starts up

2008-05-16 13:22:47.97 spid18s Starting up database 'vlf'.

2 minute pause whilst nothing is logged about vlf database, then analysis starts. It is during this period that SQL Server is processing the large number of VLFs.

2008-05-16 13:24:34.28 spid18s Analysis of database 'vlf' (19) is 3% complete (approximately 8 seconds remain). This is an informational message only. No user action is required.

Redo has started

2008-05-16 13:24:45.80 spid18s Recovery of database 'vlf' (19) is 3% complete (approximately 108 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

Note how many transactions have been rolled forward – this is how many we did in the test

2008-05-16 13:26:21.36 spid18s 26533 transactions rolled forward in database 'vlf' (19). This is an informational message only. No user action is required.

Undo has started

2008-05-16 13:26:21.39 spid18s Recovery of database 'vlf' (19) is 99% complete (approximately 0 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

Recovery complete

2008-05-16 13:26:21.49 spid5s Recovery is complete. This is an informational message only. No user action is required.

SCRIPT 2 – Good scenario

The script is identical to the bad scenario shown above, apart from the simple fact that we create the log to a pre-expand size of 5GB, using this command:

create database vlf

on (name=vlf, filename='c:\vlf.mdf')

log on (name=vlf_log, filename='c:\vlf_log.ldf',size=5GB,maxsize=5GB,filegrowth=1MB)

This gives the following results on the identical server.

Runtime of filling the log: 1 minutes 54 seconds

Number of VLFs when log full: 16

Time for VLF to recover fully: 2 minutes 2 seconds

Excerpts from the SQL Server log:

Server starts:

2008-05-16 13:40:16.10 Server Microsoft SQL Server 2005 - 9.00.3186.00 (X64)

VLF db starts up

2008-05-16 13:40:22.64 spid18s Starting up database 'vlf'.

6 second pause whilst nothing is logged about vlf database, then analysis starts. It is during this period that SQL Server is processing the VLFs, but in this case because there are only 16, it completes much quicker. This is the fundamental difference between the examples.

2008-05-16 13:40:28.99 spid18s Analysis of database 'vlf' (19) is 2% complete (approximately 157 seconds remain). This is an informational message only. No user action is required.

Redo has started

2008-05-16 13:40:44.22 spid18s Recovery of database 'vlf' (19) is 4% complete (approximately 106 seconds remain). Phase 2 of 3. This is an informational message only. No user action is required.

Note how many transactions have been rolled forward – this is how many we did in the test

2008-05-16 13:42:18.69 spid18s 33216 transactions rolled forward in database 'vlf' (19). This is an informational message only. No user action is required.

Undo has started

2008-05-16 13:42:18.69 spid18s Recovery of database 'vlf' (19) is 99% complete (approximately 0 seconds remain). Phase 3 of 3. This is an informational message only. No user action is required.

Recovery complete

2008-05-16 13:42:18.89 spid5s Recovery is complete. This is an informational message only. No user action is required.

I hope this demo gives further weight to the case for pre growing your SQL Server files in advance. Feel free to post a comment if you have a question.