AUDIT does not respect the MAX_ROLLOVER_FILES configuration

In one of the cases that I worked in the past months, customer reported that the SQL Server audit mechanism wasn´t respecting the option MAX_ROLLOVER_FILES.
The context is:

  1. Many SQL Server instances with had AUDIT configured, writing events to files locally.
  2. A view was created in each instance to access the audit files and maintain a standard nomenclature.
  3. A product from another company keeps reading all the data from those files to centralize them in another repository.
  4. When querying the data the tool always specify a timestamp, filtering the data to get only new records since the last time it was executed.
    1. E.g.: "SELECT * FROM dbo.vw_Audit WHERE event_time > '2016-11-30 18:54:00'"
      The issue here is that customer defined a maximum of 20 rollover files, but at some point they could find 50 or 100 files in the path, consuming all disk space.

The issue here is that customer defined a maximum of 20 rollover files, but at some point they could find 50 or 100 files in the path, consuming all disk space.

To simulate the scenario and understand SQL behavior, I created a new database and using the following instructions (not the complete set), configured an audit.

 CREATE SERVER AUDIT Reg3843
TO FILE
(FILEPATH = 'C:\Temp\Audit',
 MAXSIZE= 25MB,
  MAX_ROLLOVER_FILES = 5,
 RESERVE_DISK_SPACE = OFF)
WITH (ON_FAILURE = CONTINUE, QUEUE_DELAY = 1000)
GO

CREATE DATABASE AUDIT SPECIFICATION dbauditest
FOR SERVER AUDIT Reg3843
ADD (SELECT ON dbo.TabelaBase by public) 
GO

CREATE VIEW dbo.vw_Audit
as
    SELECT * 
    FROM sys.fn_get_audit_file('C:\temp\Audit\*',default,default)
GO

After generating some load to fill up audit files, I executed the query  "select * FROM dbo.vw_Audit WHERE event_time > '2016-11-30 18:54:00' ", that is supposed only to access data in the last audit file.

Using PROCMON from SysInternals (https://technet.microsoft.com/en-us/sysinternals/processmonitor.aspx) I was able to see that SQL Server open a handle for each audit file. It only read metadata for the first ones and when it comes to the last (which has the data related to the timestamp), actual content is read from the file, as you can see the offset changing (image 01). The interesting thing to notice here is that the handle for all the files is only released at the end, showed as CloseFile in image 02.

(Image 01)

(Image 02)

For reference:

CreateFile (https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx): The function returns a handle that can be used to access the file or device for various types of I/O depending on the file or device and the flags and attributes specified.
The file is actually closed when CloseHandle is invoked (https://msdn.microsoft.com/en-us/library/windows/desktop/ms724211(v=vs.85).aspx).

At this point we have a probable cause of the issue, so to simulate it you need to keep constantly querying the audit view, while adding a load to generate entries in the audit file. While this happens you can use PROCMON to capture data and observe the behavior.
In image 03 we can see thread 36612 writing data to the audit files (this is the thread that is recording the audit events). At some point it reaches end of file (due to 25 MB limit), faces a sharing violation when accessing existing files, and then creates a new audit file, continuing to write from offset 0.

On Image 04, I just filtered out the SUCCESS events and let threads 33612 and 25632, so you can see the concurrency aspect of it. 25632 is the one with open handles to all the files while is reading the data from the last audit file, and when 36612 try to access the file with "Desired Access: Generic Write, Read Attributes, ...", the door is closed for this thread (sharing violation).

(Image 03)


(Image 04)

Reading the documentation (why not do it first?!) we can see the exact moments that SQL Server tries to delete the older files (https://msdn.microsoft.com/en-us/library/cc280448.aspx):

    MAX_ROLLOVER_FILES ={ integer | UNLIMITED }
Specifies the maximum number of files to retain in the file system in addition to the current file. The MAX_ROLLOVER_FILES value must be an integer or UNLIMITED. The default value is UNLIMITED. This parameter is evaluated whenever the audit restarts (which can happen when the instance of the Database Engine restarts or when the audit is turned off and then on again) or when a new file is needed because the MAXSIZE has been reached. When MAX_ROLLOVER_FILES is evaluated, if the number of files exceeds the MAX_ROLLOVER_FILES setting, the oldest file is deleted. As a result, when the setting of MAX_ROLLOVER_FILES is 0 a new file is created each time the MAX_ROLLOVER_FILES setting is evaluated. Only one file is automatically deleted when MAX_ROLLOVER_FILES setting is evaluated, so when the value of MAX_ROLLOVER_FILES is decreased, the number of files will not shrink unless old files are manually deleted. The maximum number of files that can be specified is 2,147,483,647.

This clearly explains the behavior: when that thread receives a SHARING VIOLATION, it cannot delete the files that are beyond the MAX_ROLLOVER_FILES limit. But since this keeps happening all the time, due to the process querying audit files, it seems for the DBA that SQL Server is not respecting the configuration, which is not true.

As you should have anticipated by now, the suggested approach to avoid this issues is to query the file which holds the current timestamp, to avoid having open handles to files that are not necessary and just have old audit entries, something similar to the query below (if you use PROCMON, will see only this file being accessed):

 SELECT * FROM sys.fn_get_audit_file('C:\temp\Audit\Reg3843_7BE85D55-C73F-4C5A-8FAE-BC7F83B17E5C_0_131250056119510000.sqlaudit',default,default) WHERE event_time > '2016-11-30 18:54:00'

An important thing to notice is that this is not a product bug, since it does exactly what the documentation states and is intended by product group.

More important than this audit scenario, is that you should feel comfortable with a lot of tools (like the ones in SysInternals suite) when troubleshooting issues, that can help you to pinpoint the problem faster.

Luciano Caixeta Moreira -  [ Luti ]
luticm79@hotmail.com https://www.linkedin.com/in/luticm www.twitter.com/luticm