SQL Server : large RAM and DB Checkpointing

SQL Server : large RAM and DB Checkpointing

 

Hi everyone,

This post’s purpose is to establish a summary of the specific behaviors with relation to DB Checkpoint that may happen within SQL Server when running with a large quantity of allocated memory and when applicable, how to best address them.

SQL Server 2016 improves many such scenarios out of the box, while SQL Server 2012 and 2014 have received partial backports of SQL Server 2016 Improvements thru Cumulative Updates, that require manual activation.

Finally, some scenarios still need special attention across all versions and are addressed in depth.

Read further to get all details. Here’s a kit of quick jumps links for better navigation.

Introduction : the DPM improvement

We will rely on the existing post https://blogs.msdn.microsoft.com/psssql/2016/04/12/sql-2016-it-just-runs-faster-indirect-checkpoint-default/ which gives a detailed insight on the mechanisms at play here, as preliminary read.

The key takeaways being the following :

The ‘legacy’ implementation of database Checkpointing, which we’ll call FlushCache, needs to scan the entirety of SQL Server’s Buffer Pool for the checkpoint of any given Database.

The duration of a checkpoint is usually driven by the number of dirty pages writes it requests, but with the RAM size of typical large production server ever increasing and reaching values of several Terabytes, the scan itself starts to take a meaningful time. Values recently observed were around 12s for a 2 TB server, but could vary a lot per system.

One must understand that the scan has to happen whatever the size of the database, and whether there are dirty pages or not. In ‘worst’ case, a nearly empty DB with zero dirty page will still get a 12s long checkpoint on a 2 TB server, if using the FlushCache path, because of the scan duration.

Note that the scan duration is directly tied to the size of the Buffer Pool. When we say ‘large RAM’ server, we therefore imply that the RAM is allocated to a given SQL Server instance (thru its max server memory setting), and that this instance has reached its nominal memory consumption. A freshly started instance that is still in ramp-up phase will not suffer from the issue as the Buffer Pool is still relatively small, even if the server sports 8 TB of RAM fully dedicated to SQL Server.

The blog entry already cited mentions a new implementation based on Dirty Page Managers (we’ll use the acronym DPM from now on, be mindful not to mix it with Data Protection Manager), on which the new Indirect Checkpoint option relies, and which doesn’t rely on the Buffer Pool scan anymore. SQL Server 2016 will by default create new databases with Indirect Checkpoint enabled, and therefore uses the DPM approach and won’t suffer from the delay tied to large RAM. All is well then and one could ask for the need of this new blog entry.

Well we will try here to bring clarity on what to expect from the previous various version of SQL Server. Indirect Checkpoint was first introduced in SQL Server 2012, but that doesn’t mean that the full DPM implementation is also available along with it.

Also, while the previous blog entry focused on the main “checkpoint” workflow in SQL Server 2016 (= the result of the recovery interval logic deciding that a checkpoint is needed, or the result of a manual or hardcoded checkpoint), which indeed directly leverages the DPM improvement , there are other places in SQL Server 2016 which still rely on the FlushCache implementation and will still hit the scan delay on large RAM servers. As servers with a few TB of RAM are usually important servers, we thought it best that those occurrences don’t catch our esteemed DBAs by surprise -especially if they assumed that using SQL Server 2016 would completely remove the large RAM delay pattern.

 

Exploration Toolbox to identify long Flushcache

 

First thing first, let’s equip ourselves with a way to make 100% sure of the implication of the FlushCache logic in a given scenario.

We’ll rely mainly on this traceflag : 3504, which will report flushcache stats to the Errorlog (SQL Server 2012 and later will in fact generate those messages automatically when they’re very slow). The TF has no other impact than this extra logging, and is therefore safe to enable.

Let’s go ahead and try this on any SQL Server 2012 or 2014 Database:

 

DBCC TRACEON(3504,-1)

Go

checkpoint

 

Result in ErrorLog :

 FlushCache: cleaned up 0 bufs with 0 writes in 1 ms (avoided 0 new dirty bufs) for db 8:0

average throughput:   0.00 MB/sec, I/O saturation: 0, context switches 0

last target outstanding: 2, avgWriteLatency 858993

 

So this is it : every time we see this sequence in the Errorlog under 3504, we know a FlushCache is taking place. Conversely, and this is important, not seeing the message reliably means a FlushCache is not taking place. We’ll use this approach throughout the article to explore key operations.

Also, in the case of large RAM server, the “in 1 ms” is the delay that will grow to several seconds. If like in the example above the checkpoint generated 0 buf write, then you can be confident that the delay reported is mainly tied to the BufferPool scan as no disk I/O was involved.

Users of SQL Server 2014 and later will have another TF available which doesn’t exist for SQL Server 2012 : 3502. This one will reflect a Checkpoint Operation taking place, whether it uses FlushCache or DPM implementation. This is useful because in the case of DPM path, there would be nothing reported under TF 3504 alone, which is always a bit trickier to interpret.

So on a SQL Server 2014 Database, we get :

 DBCC TRACEON(3502,-1)

DBCC TRACEON(3504,-1)

GO

checkpoint

 

Output in Errorlog :

 Ckpt dbid 9 started

About to log Checkpoint begin.

Ckpt dbid 9 phase 1 ended (8)

FlushCache: cleaned up 0 bufs with 0 writes in 1 ms (avoided 0 new dirty bufs) for db 9:0

           average writes per second:   0.00 writes/sec

           average throughput:   0.00 MB/sec, I/O saturation: 0, context switches 0

           bandwidth server limit:         1.00 MB/sec

About to log Checkpoint end.

Ckpt dbid 9 complete

 

SQL Server 2016 has an upgraded TraceFlag 3504 so that it generates either the FLushCache or the DPM message for each checkpoint, therefore we get a meaningful message in all cases with 3504 only, removing the need for 3502.

So, on a SQL Server 2016 we will see :

 DBCC TRACEON(3504,-1)

raiserror('=====================================Separator',10,1) with LOG

use master -- DBID 1 by default not enabled for indirect checkpoint on SQL 2016

checkpoint

raiserror('=====================================Separator',10,1) with LOG

use test -- DBID 10, using default indirect checkpoint

checkpoint

raiserror('=====================================Separator',10,1) with LOG

In errorlog :

 DBCC TRACEON 3504, server process ID (SPID) 54. This is an informational message only; no user action is required.

=====================================Separator

FlushCache: cleaned up 0 bufs with 0 writes in 1 ms (avoided 0 new dirty bufs) for db 1:0

           average writes per second:   0.00 writes/sec

                      average throughput:   0.00 MB/sec, I/O saturation: 0, context switches 0

           last target outstanding: 2, avgWriteLatency 0

=====================================Separator

DirtyPageMgr::ForceCatchupOrFlushCache: cleaned up 0 dirty pages in 0 ms for db 10

=====================================Separator

The checkpoint against first Master (which still uses the legacy logic) and Test (enabled for indirect checkpoint as any new user DB by default on SQL Server 2016) highlight the difference of checkpoint implementation : we’re getting our well know FlushCache in the first case, and the new DPM entry for the UserDB for the second checkpoint.

This is the improvement the previous blog entry detailed. It’s important to note that while the umbrella name for the improvement is Indirect Checkpoint, you would not get a difference for Checkpoint implementation on a SQL Server 2012 or 2014 database by simply activating indirect checkpoint, it would still use the FlushCache way.

 

Other FlushCache Scenarios

Great, so we have our base situation covered ! Now let’s get into some more advanced considerations.

The top-level operation we’ve been discussing so far is the standard Checkpoint, either triggered manually by Checkpoint TSQL command or initiated by the recovery interval logic.

But we have a need for FlushCache in other places. Let’s list some of the main ones in SQL Server 2012.

  • Backup Database
  • Backup LOG
  • Create Database
  • Add file to Database
  • Restore LOG (2 flushcache per restore log without recovery)
  • Recover Database
  • Shrink Database File
  • SQL Server ‘graceful’ Shutdown

So all of those are also candidates for the BufferPool scan delay on large RAM server. Some of them are rare/isolated enough that it shouldn’t be an issue. But some can happen repeatedly.

Available Improvements for SQL Server 2012 and 2014

Before we go any further we now have to mention a very important update : fix 3158396. The KB describe the specific scenario where the several FlushCaches involved by a create database will lead to a surprisingly long duration in the large RAM scenario. The issue is removed by installing the appropriate build, activating Indirect Checkpoint and enabling TraceFlag 3449. How do you think the issue got removed ? Well that’s right, we’re now calling the DPM logic instead.

And the fix is available for SQL Server 2012 (starting with SP3 CU3) and SQL Server 2014 (starting with RTM CU14 or SP1 CU7). Naturally, since SQL Server fixes are cumulative, any later release would also ship them.

But wait, there’s more to it ! The change brought by 3158396 will also upgrade several (but not all) FlushCache calls in SQL Server 2012 and 2014. All with the same TF 3449 as logical switch, and assuming the relevant Databases have been enabled for Indirect Checkpoint. This includes the mainstream checkpoint operation :

So on a SQL Server 2012 with SP3CU3 or later, let’s do something like :

 ALTER DATABASE test SET TARGET_RECOVERY_TIME = 60 SECONDS;

select database_id, name, target_recovery_time_in_seconds from sys.databases

DBCC TRACEON(3504,-1)

Go

use test

Go

checkpoint

DBCC TRACEON (3449,-1)

raiserror('=====================================Separator',10,1) with LOG

checkpoint

raiserror('=====================================Separator',10,1) with LOG

 

We would see in the ERRORLOG :

 Setting database option target_recovery_time to 60 for database 'test'.

DBCC TRACEON 3504, server process ID (SPID) 54. This is an informational message only; no user action is required.

FlushCache: cleaned up 0 bufs with 0 writes in 1 ms (avoided 0 new dirty bufs) for db 8:0

             average throughput:   0.00 MB/sec, I/O saturation: 0, context switches 0

             last target outstanding: 2, avgWriteLatency 858993

DBCC TRACEON 3449, server process ID (SPID) 54. This is an informational message only; no user action is required.

=====================================Separator

=====================================Separator

That’s right, no FlushCache anymore between the two “Separator” entries (that help highlight the absence of message where we used to have one).

With SQL Server 2014, let’s use both 3504 and 3502 and execute the same sequence as above :

ErrorLog :

 Ckpt dbid 9 started

About to log Checkpoint begin.

Ckpt dbid 9 phase 1 ended (8)

FlushCache: cleaned up 0 bufs with 0 writes in 1 ms (avoided 0 new dirty bufs) for db 9:0

           average writes per second:   0.00 writes/sec

                      average throughput:   0.00 MB/sec, I/O saturation: 0, context switches 0

           bandwidth server limit:         1.00 MB/sec

About to log Checkpoint end.

Ckpt dbid 9 complete

DBCC TRACEON 3449, server process ID (SPID) 63. This is an informational message only; no user action is required.

=====================================Separator

Ckpt dbid 9 started

About to log Checkpoint begin.

Ckpt dbid 9 phase 1 ended (8)

About to log Checkpoint end.

Ckpt dbid 9 complete

=====================================Separator

So thanks to Traceflag 3502 we’re positive a checkpoint is taking place in second case, yet we do not see any FlushCache report anymore. This means the DPM logic is at work.

Activation kit of 3158396 for SQL Server 2012 and 2014

 

Let’s repeat the kit to activate the DPM logic on SQL Server 2012/2014 :

  • required build
    • SQL Server 2012 : SP3 CU3 and later
    • SQL Server 2014 : RTM CU14 or SP1 CU7 and later
  • Indirect Checkpoint activated for the relevant Databases
    • ALTER DATABASE Dbname SET TARGET_RECOVERY_TIME = 60 SECONDS
    • Any non-zero value activates Indirect Checkpoint
  • Traceflag 3449 enabled
    • DBCC TRACEON (3449,-1)
    • (the traceflag activation/deactivation has immediate effect without service restart needed)
    • or more likely, startup parameter for the instance (-T).

Version and Builds Summary for main checkpoint scenarios

 

Let’s try to summarize the new landscape we have as of June 2017 across major and minor versions for Databases with Indirect Checkpoint enabled. We’ll then further discuss each scenario.

 

SQL Server Version Manual and automated DB CheckpointService Shutdown Create Database,add DB FileShrink File Backup LogBackup Database Restore Log AAG RedoCheckpoint
SQL 2012 RTM FlushCache FlushCache FlushCache FlushCache FlushCache
SQL 2012 latest + TF3449 DPM DPM DPM FlushCache FlushCache
SQL 2014 RTM FlushCache FlushCache FlushCache FlushCache FlushCache
SQL 2014 latest + TF3449 DPM DPM DPM FlushCache FlushCache
SQL Server 2016 RTM DPM DPM DPM FlushCache FlushCache
SQL Server 2016 latest DPM DPM DPM FlushCache FlushCache

 

SQL Server 2012 and 2014 scenarios improved by 3158396 :

In the context of large RAM server, avoiding extra delays for Create Database, add files, Shrink Database is obviously always welcome, and evaluating the activation of DPM logic is valid. But some scenarios deserve closer scrutiny.

Backup Log

While the extra delay in a regular TSQL “Backup Log“ run from a client application is not going to be leading to errors/issues beyond the extra duration itself, the situation could be very different when 3rd party VDI-based applications are used for backup, which often embed various hardcoded timeout for intermediate operations that could be reached because of the Checkpoint delay.

Therefore if your VDI backup application starts randomly timing out after say, a RAM upgrade of your SQL Server 2012/2014 architecture, and the frequency of issues is progressively increasing after service restart, with following error messages :

 SQL Server Errorlog :

Error: 3041, Severity: 16, State: 1.

BACKUP failed to complete the command BACKUP LOG someDB. Check the backup application log for detailed messages.

 

Backup application log :

 Backup of SomeDB failed.

The configuration of the Virtual Device Set could not be obtained.

Then it would be valid to activate the TF to materialize FlushCache, and if backup errors appear to be associated with long FlushCache operations, apply the changes to use DPM.

SQL Server Service Shutdown

When SQL Server service is gracefully shutdown (net stop, shutdown TSQL command, SSMS operation), it checkpoints all Databases to optimize recovery for the next start. In large RAM + Large number of user Databases context, this graceful shutdown could start taking extensive time. A ‘shutdown with nowait’ would bypass that but carries the risk of a long running recovery at next service start. So using the ‘DPM Kit’ would be useful here.

FCI (Clustered Instance) Move Group

Furthermore, in the case of an AlwaysOn FCI (Failover Cluster Instance), the manual ‘move’ of SQL Server Group would also trigger the graceful shutdown. But here, the extra delay could exceed Windows Cluster’s Operation Timeout and lead to cluster errors, not to mention a very slow failover. Here again a very good candidate for the switch to DPM. Note that an actual cluster failure (which can be also simulated from the Cluster Manager) would not involve a graceful shutdown, so the scenario doesn’t threaten the automatic failover time in case of actual failure.

 

Scenario still relying on Flushcache in SQL Server 2012, 2014 and 2016

As can be seen on the table above, some operations still use FLushCache even on SQL Server 2016 and must therefore be carefully reviewed and understood.

Restore Log

Each restore log without recovery will initiate two Flushcache, with two more for the last log restored with recovery. That might seem innocuous enough, however in the case of very long log backup chain the number of log restores can greatly amplify the overall duration needed to fully restore a given database: In a recent interaction, a 900 Tlog restore reached a 4h30 duration, most of which was spent in the BufferPool scan that each FlushCache generated. The same restore on a commodity hardware SQL Server instance would take a few minutes. Such a delay may negatively impact RTO. The DPM option is not available here, so here are some guidelines that can be applied alone or in combination.

  • Know what to expect. You’ve already reached this by reading this blog. Then simply adapt your expectations and procedures to account for the delay.
  • Quantify the delay per single log restore, and extrapolate the number of log restores that would start breaching your RTO in case of production database restore. From there, tweak your backup strategy so that your potential log restore sequence doesn’t typically reach the problematic numbers, by increasing frequency of full and differential restores, or slightly decreasing frequency of log backups (keeping RPO in mind of course).
  • Reduce SQL Server max memory right before starting the restore log sequence, to reduce the time taken by individual FlushCache scan. The amount of memory by which you should decrease can be evaluated from the duration of a single FLushCache : it will be linearly dependant of the RAM size, and therefore you should be able to establish how many seconds a given memory decrease should save (eg 2s per each 150 GB). Beware of the dirty page flushes when reducing max memory that could pin your server for some time : you may want to do this by chunks, or manually request Checkpoints of your databases before doing so.
  • Execute the restore sequence on a freshly restarted instance, ie. at a time the BufferPool is still small.

 

Always On Availability Groups (AAG)

The AAG scenario deserves our attention as it can be complex. The first important thing is that a secondary replica would redo any checkpoint happening on the Primary. This happens within the RedoCheckpoint routine, which relies on FlushCache and is independent of the Checkpoint routine used on the primary. That means that a AAG Primary DB could be using the DPM logic for its Checkpoint, but the redo of those checkpoint would still use the FlushCache approach on all the secondaries.

A second aspect, which can compound with the first one, is the backup log operation :

  • A backup log on a primary DB may generate up to two checkpoints on the primary that will therefore get redone on the Secondary.
  • A backup log on a secondary will typically not directly generate checkpoint on the secondary, but the backup event will be propagated to the primary, which as a consequence will execute one Checkpoint against the primary replica of the DB that has been backed up on secondary, which in turn will be redone by the secondary (everyone still with me ? :)).
    • So in the case where you emit a backup log on secondary, you would see a FlushCache happening slightly after the backup completes :
 raiserror('=====================================Separator',10,1) with LOG

backup log AGDB1 to disk='c:\work\deleteme.bak' with init

raiserror('=====================================Separator',10,1) with LOG
    • Would give in the errorlog :
 =====================================Separator

Log was backed up. Database: AGDB1, creation date(time): 2015/11/25(16:43:17), first LSN: 802:48727:1, last LSN: 802:48734:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {'c:\work\deleteme.bak'}). This is an informational message only. No user action is required.

=====================================Separator

FlushCache: cleaned up 0 bufs with 0 writes in 3 ms (avoided 0 new dirty bufs) for db 7:0

           average writes per second:   0.00 writes/sec

          average throughput:   0.00 MB/sec, I/O saturation: 0, context switches 0

           target timeframe: 100, target pages/tick 518.32, ahead of schedule: 0

           last target outstanding: 1600, avgWriteLatency 0
    • As you can see the FlushCache is outside the “Separators”, whereas it would be inside on Primary.

We’ve already established that the backup log can be upgraded to DPM logic, and that it will be used by default on SQL Server 2016. But even if you do so, that will not change the logic used by the redo on Secondary, that will still use FlushCache.

Now imagine the situation where one AAG secondary has a much larger RAM sizing (reaching TB range) to account for DWH/Reporting queries, and is hosting a large number of AAG databases, that have a frequent backup log strategy. The primary would run fast DPM checkpoints for Backup Operations (and for ‘regular’ automatic checkpoints) while the secondary would redo every single checkpoint of each DB and hit the FlushCache delay (exacerbated by large RAM) for each of these. Since the FlushCache is part of the AAG Redo mechanism, it could ultimately negatively impact the redo throughput to a large degree.

 

Conclusion

I hope this article will help you better understand some of the ramifications of very large RAM deployments of SQL Server, and put the best action plan in place in the various scenarios that may arise.

It’s important to remember the typical server RAM sizes that were mainstream at the time of the active development of SQL Server version, eg. for SQL Server 2012, that would be a period of several years some time before 2012. The most recent version of SQL Server is preferable when deploying on top-of-the-line hardware with extreme specs to avoid a generation gap between hardware and software. While this article illustrates the improvements that have been recently released to our older versions, SQL Server 2012 and 2014 have likely reached the full extent of what could be done against their existing infrastructure, while SQL Server 2016 may hope for further changes and improvements on the large RAM topic.

And before enabling DPM for all Databases as a standard practice, I would also recommend the read of blog entry https://blogs.msdn.microsoft.com/sql_server_team/indirect-checkpoint-and-tempdb-the-good-the-bad-and-the-non-yielding-scheduler/and any of its follow-up to identify all the considerations that should be taken into account for that decision.

HTH,

Guillaume Fourrat

SQL Server Escalation Engineer