Discussion About SQL Server I/O

I received a request today for help on how SQL Server I/O behaves.   As I was writing up my e-mail response I thought it would also make an interesting blog post.

Sent: Friday, January 07, 2011 2:53 AM
To: Robert Dorr
Subject: Async IO questions


What we do know :
Considering that in Windows the I/O API's allow for sync requests via calls to the API such as WriteFile that will not return control to the calling code until the operation is complete.    
Async hands the request off to the operating system and associated drivers and returns control to the calling code.   

[[rdorr]] Correct, this is the key to how SQL does the IO so we can hand off the IO and use other resources (CPU) while the IO is being handled by the hardware.

So , when we look at SQL Server which mostly use Async I/O patterns , it  exposes the pending (async) I/O requests in the sys.dm_io_pending_io_requests ,the column ‘io_pending’ provides insight into the I/O request
and who is responsible for it , if the value is TRUE it indicates that 'HasOverlappedIOCompleted' in the Windows the I/O API returned FALSE and the operating system or driver stack has yet to complete the I/O.   

[[rdorr]] Correct and exactly why we exposed the column.   This lets us tell if it is a Non-SQL issue or if the IO has been completed and SQL is not processing it in a timely fashion.

Looking at the io_pending_ms_ticks indicates how long the I/O has been pending ,if the column reports FALSE for io_pending it indicates that the I/O still has been completed at the operating system level and
SQL Server is now responsible for handling the remainder of the request.

[[rdorr]] Correct.  We snap the start time when we post the IO (right before the read or write call) and the column is materialized as the result of HasOverlappedIoCompleted.   If you look at HasOverlappedIoCompleted this is really a macro that checks the INTERNAL member of the OVERLAPPED structures for != STATUS_PENDING (0x103, 259).   So in a dump we already look at the internal status on the OVERLAPPED for pending status value.
What we do NOT know and would like detailed feedback on :
When SQL server hands of the Async I/O request to the windows I/O API ,
A1) What is the order of action to complete a write from the I/O API's side in a SAN environment ?
A2) Lets consider an environment where we have delayed IO as a result of a substandard SAN or SAN configuration
a) In a case where the Windows I/O API could not complete the I/O (write) due to a time out or a missing I/O request below the driver / API stack, would the Windows I/O API try to reissue the I/O (write)
    OR would the Windows I/O API  thru some call back mechanism in the SQLOS layer notify the SQLOS of such a failure and then rely on the SQLOS to reissue the I/O write request ?
           Provide as much detail as possible.

[[rdorr]] The environment does not matter at all to SQL Server.  It could be direct attached, SAN or other Io subsystem.   SQL Server calls the write routine with the OVERLAPPED structure.   This is a kernel transition which builds an IRP and puts it in motion, returning control from the API to SQL Server.  Each of our IOs are placed on a pending list.  This list is associated with the local SQLOS scheduler the task is running on.  Each time the a switch takes place on the scheduler (< 4ms) the I/O list used HasOverlappedIoCompleted to check the status.   If the IO is no longer STATUS_PENDING the completion routine registered with SQL Server is fired.   This is not a completion routine setup with Windows it is an internal function pointer associated with the IO request structure maintained by SQL Server.   The callback routine will check the sanity of the IO (error code, bytes transferred, read check page id, etc..) and then release the EX latch so the page can be used by any requestor.

Not exactly sure what you mean by timeout of the IO.  SQL Server does not allow CancelIO so the IO will stay pending until it completes or returns an error.   From a SQL Server standpoint there is no timeout only success of failure.    If you are talking about HBA level timeouts it is driver specific and the hardware vendor implements the details.   For example a dual HBA system can have a timeout value.  When the HBA can’t service the IO and the timeout is exceeded it will transfer the IO to the alternate HBA controller.   We have seen issues in the past with this pattern where the timeout is not setup properly.  It was supposed to be set to 5 seconds but instead was 45 seconds.  This meant the IO was pending for 45 secs on an HBA that had lost communications with the SAN and would not move the IO to the second controller until that timeout was expired.   The query timeouts from the application were set to 30 seconds so when we went to flush a log record for a commit the SQL Server queries were getting cancelled at 30 seconds because of the underlying hardware setup.

I am not aware of any of the Windows APIs that retry the IO and even if they do SQL Server would not be aware of it.  The OVERLAPPED structure is only updated by the IRP completion routine here.   When the IRP completes one of the last things that occurs is the kernel callback routine if fired.  This callback routine does the basic cleanup, sets the values in the OVERLAPPED structure such as Internal (status value) InternalHigh (bytes transferred) and such so that a caller to GetOverlappedResult or HasOverlappedIoCompleted can obtain the proper state.   It then checks to see if the OVERLAPPED structure contains a valid event and if so will signal it and finally if the IO request is registered with a completion port it will queue the IO to the completion port.    For disk IO SQL Server does not use the completion port.   SQL Server posts the IOs with the OVERLAPPED and an event.  Every IO on the same SQLOS scheduler used the same event.  This allows a scheduler in an idle wait state wake on the event and check the IO list right away.

With that said there are a couple of retry locations in SQL Server but not like you might expect.   When you first attempt to post the IO Read/Write file you can get the error (1450 or 1452) returned which is an out of resources error indicating that the IO could not be started (no IRP created).  In these cases the SQL Server will Sleep for 1 second and attempt to post the IO again.   In these cases the IO is not on the IO list and may not show up in the DMV because the IO is not pending.

For Reads Only if the SQL Server completion routine detects damage (failure of some sort, bad page header, checksum, …) we can retry the same IO up to 4 times before we consider it a failure.   We have found (SQL and Exchange) that if a read fails you can retry the read and sometimes it works properly.  If you do fail 4 times in a row for the same page it is usually damaged and we will log the error.   In either case SQL Server logs information in the error log about this condition.

A failed write will leave the buffer in the buffer pool, hashed with the BUF->berrno set to a value other than 0.  This will essentially, poison the page and for data integrity SQL Server will no longer allow access to the page nor will it write the page to disk.   Since a write is often offline from the query (checkpoint or lazy writer) the original query that dirtied the data is usually not aware of the situation.  However, if that write was for the log records (WAL protocol requires log records to be flushed before commit if valid) not only is the query notified of the error but the database is often marked suspect.   SQL Server has some mechanisms to detect when a log write fails and will even try to take the database offline and bring it online to rectify the situation.  If this fails the database is marked suspect.  If this succeeds SQL Server has cleared a database error condition and allowed runtime but the DBA should be looking closely at the system and their backup strategies.

With all this said there are some situations we have found with snapshot databases that result in OS error 1450 or 1452 that can’t be resolved on Windows 2003.   The error code was changed by Windows 2008 when the spare file limitation is reached and the 1450/1452 can’t be resolved so SQL will stop the retry attempts and be given a clear error message instead.

b) Taking into account the answer above, How many times would the component responsible retry/reissue the I/O (write) request  before it reports a failure up to the calling components ? Sybase = 10 times.

[[rdorr]] Any write request results in a report to the error log.  As I stated the only failure for a write that is generally retried are (1450 and 1452) error conditions.    I went back to the code in SQL Server that handles acquiring the buffer latch (AcquireIoLatch and AcquireLatchActions) on SQL 2008 and it always checks the BUF->berrorno value as I outlined.  There are some special conditions to allow page level restore on enterprise and such but for the most part once a write fails SQL Server won’t retry the write as we can’t trust the data.   Let me give you an example.

If the database is enabled for checkpoint it enables a feature named constant page validation.   When the page is read into memory or was written and is again clean and has a checksum the LW and other actions may validate the checksum the first time the page prepared to become dirty.  If the checksum fails we know the page has been changed while in memory, incorrectly.  Perhaps bad ram a memory scribbler or such activity.   We would never want to flush this page so the BUF->berrno is set to indicate the checksum failure, preventing any further use of the page for any action.

c) and most importantly, can we influence the wait-time before an I/O request gets re-issued ?
d) Finally what is the difference in SQL2000 and SQL2008 wrt IO retries, and the has the IO engine been changed significantly between these two versions, specifically wrt retries, and layer where IO gets handled.

[[rdorr]] Read retry is the only major difference.  SQL 2000 did not retry a failed read because of a page header problem for example.    SQL 2008 can build a write request that is 2x the size of SQL 2000 but that is a perf issue.  Read ahead sizes have always been controlled by SKU.

      We think the answer is YES, so any answer would require some detail discussion on this point, possibly with a follow-up. We will do some tests to validate our suspicions with evidence (Shannon)

B1) Does IO to the ERRORLOG happen in an asynchronous way with queuing or measures to minimise SMP contention as well as Engine sync IO spin-wait?

[[rdorr]] Error log is written sync, it does not use the same IO path as database and log writes.

    We have trace flags switched on which log's each login into the SQL-server in the ERRORLOG (details/...)
    I.E. if IO to the errorlog disk is slow, or happens from 8 engines, could that slow things down significantly.

[[rdorr]] The error log path is a different animal.   The path for handling a report to the log is really quite long.   It has checks for all the dumptrigger behavior, the message has to be formatted each time which involved FormatMessage and the default heap, the messages are reported to the event log with ReportMessage and this path gets longer on a cluster and we write the information to the log in a serialized way.   Also, this requires the workers to go preemptive to avoid holding the OS scheduler so it puts different pressure on the Windows scheduler than you might expect as well.

One way to remove the Event log from the picture as a test is to start SQL Server with the command line parameter of (-n) to avoid writing to the event log.

So I suppose it would be possible that a slow drive could appear to slow SQL Server down but if you could maintain that logic rate the other parts of the logic path are generally much longer than the error log.  For integrated connections you have to talk to Windows (DC usually), you have to validate that login information such as database context, permissions and such.   The longer part of the login would be outside the actual IO to the drive.

Any of this is really easy to review using the public symbols.   You can use the Windows debugger and set a breakpoint on CreateFileW.  When SQL starts up you can see how it opens the file (which flags) as documented by the Windows API on MSDN.com.

Here is a SQL Server 2008 opening the error log

00000000`05727b08 00000000`035831f1 kernel32!CreateFileWImplementation
00000000`05727b10 00000000`035d4a9f sqlservr!DiskCreateFileW+0xf1
00000000`05727b70 00000000`00cab289 sqlservr!DBCreateFileW+0x1bf
00000000`05727e70 00000000`0123560d sqlservr!CErrorReportingManager::InitErrorLog+0x4e9

       FileName = struct _UNICODE_STRING "C:Program FilesMicrosoft SQL ServerMSSQL10.SQL2008MSSQLLogERRORLOG"
      dwDesiredAccess = 0x40000000  = GENERIC_WRITE(
      dwFlagsAndAttributes = 0x000080  
ß Normal (0x80)


By default it allows system cache involvement to avoid some of the performance issues you might be suspecting, but you can force it to use FILE_FLAG_WRITE_THROUGH  (-T3663).   One thing to watch here is registry checkpoints.  We have seen a few system that when the registry is checkpointed the IO path slows it response and if the system drive is shared with the error logs or even database files that are some really limited conditions that can impact performance.

We open the error log with either FILE_ATTRIBUTE_NORMAL(0x00000080) or FILE_FLAG_WRITE_THROUGH(0x80000000) and we do not use FILE_FLAG_OVERLAPPED(0x40000000) so the IO is handled by Windows sync, NOT Async.

Testing this is really easy.  You can produce messages at any rate you want with the following.

raiserror('Bob' ,25,1) with log

00000000`1606b878 00000000`7773328a kernel32!WriteFile
00000000`1606b880 00000000`04013055 kernel32!WriteFileImplementation+0x36  
00000000`1606b8c0 00000000`0173ffb6 sqlservr!DiskWriteFile+0xa5  
00000000`1606b900 00000000`01740b6e sqlservr!CErrorReportingManager::WriteToErrLog+0xc6  
00000000`1606b960 00000000`01746dbb sqlservr!CErrorReportingManager::SendErrorToErrLog+0xb7e  

[[rdorr]] The other thing I will caution you on is server based trace here.  This has to do the IO and when tracing to a file we guarantee no event loss.  So I have seen lots of cases where customers setup a trace of logins for example but the drive should be as good as the LDF drive and they have pointed the trace destination to a network share or something and caused a bottleneck

Bob Dorr - Principal SQL Server Escalation Engineer

Comments (5)

  1. Brent Ozar says:

    GREAT post.  I really appreciate you putting this into the blog.

  2. Ray Herring says:

    I appreciate this information.  I hope you will expand on it in at least one area.  The scenario is multiple databases (16 to 20 including tempdb) with multiple File Groups and Files in each database.  The SAN Administrator insists that a single LUN is suitable because of all of the wonderful distributed storage behind it.

    How does this affect the I/O queuing?  Are all of the I/O requests for all databases queued up sequentially to the single logical drive?

  3. Neil Hambly says:

    This is a most enlightening post and is great to find such an in-depth post about such an important topic, my knowledge has been

    increased as a result, and I thank you for that gift.

  4. rdorr says:

    Thanks to everyone for the kind comments.  Ray, allow me to fill in some more details related to your comment.

    I have talked about this a bit in my Myths post (blogs.msdn.com/…/sql-server-2005-rdtsc-truths-and-myths-discussed.aspx) but allow me to elaborate further.

    For the vast majority of operations SQL Server does not look at the physical I/O characteristics at all.  We allow full flexibility of the installation to get the most out of your application design.

    Area #1:  Database Creation – We sort all files (data and log) by volume identifier and assign a worker to handle all zero file activity for a given volume.   Much of this has been reduced with the instance file initialization feature, with proper permissions the data files no longer need to be zero’ed.

    Area #2: We create a worker per device specified in the backup/restore command to handle each stream.

    The rest of the read and write actions of SQL Server allow the worker, processing the query, to post their own I/O requests.

    Writes are handled in 3 core ways (Checkpoint, lazy writer, eager writes).  The eager writes are when a query is doing bulk operations and once they have dirties ~32 pages they will start issuing writes to help smooth out the impact on the buffer pool and checkpoint.

    Checkpoint is serialized among databases.  Checkpoint requests are queued and the checkpoint worker (single worker) is used to process a checkpoint request for a single database at a time.   So if you share luns between databases on the same instances only a single checkpoint will occur at a time across those databases.   However, additional eager, and lazy writes could be at play and reads are always a possibility.

    The SAN vendors are correct in that you can share databases but from my perspective it is like consolidating virtual machines (Hyper-V).  You don’t consolidate your 2 largest applications onto the same physical hardware.  If each application needs to run at 80% of a single machine it would be hard to consolidate them onto a single machine at 160% of a single machine without the right hardware layout.

    You combine databases that same way in that you have to consider your business needs and peak load requirements. Let me provide some more concrete examples from some cases I have worked.

    Customer was reporting stalled I/O messages but when I started looking at the data it appears to be an influence outside of SQL Server.  After some investigation I found that the SAN was shared with another system.  When the other system would start rebuilding indexes the first system would get I/O starved.  So until the maintenance window was encountered sharing of the resources worked great.  In fact, it reduced costs for the customer as you can imagine.   All they had to do was synchronized the maintenance jobs between the systems and they could share the LUNs.

    Just last week I helped on an issue where the customer was working with the SAN vendor and the vendor indicated they were bursting 8K I/O requests at rates > 1000 sec at the SAN and causing negative response times.   Looking at this closer it was during checkpoint that the delays where encountered.   We found that checkpoint was not posting larger I/Os because the design of the table was such that there were not dirty pages beside each other.   SQL Server will try to combine multiple, contiguous dirty pages into a single, larger write.   What was happening was a data load into a table with a clustered index on a GUID column.   While this reduced contention at a specific insert location it dirtied pages all over the database file so checkpoint and lazy writer would only post 8K I/Os.  They need to look at spreading the table across multiple files in a file group and aligning those files with separate LUNs so they can still maximize the database concurrency and allow the hardware to maximize the throughput as well.

    I know of a vendor that allows late acquisition.  You can create a database, using instance file initialization, and the physical sectors are NOT acquired until you actually write data on the page.   This means you can create many 1TB databases on the same 1TB I/O path but you only use the space the database has actually loaded data for.   You monitor the need to add physical disks in their facilities and SQL Server is none the wiser of what is going on in the I/O path.   This has advantages for DBA maintenance of database sizes as you can see as well.

    So it all comes down to your database needs.  If you can combine them in a way that allows the SAN to effectively use the resources you can save maintenance, time and money.  You just have to understand your database patterns and peak load needs.

    !!! CAUTION !!!:  I would caution you to look closely at the queries when you are looking at any SAN load.   I have seen many a SAN blamed for poor disk performance and lots of money spent on upgrades when a much simpler solution is at hand.  

    SQL Server added parallel query processing years ago but in my general experience 80% of the time I see a parallel query used it is because of a poorly written query or bad construct.   Take the example of a missing index.  SQL Server many have to sort all the data and filter against the predicate.  So it is faster to do this with multiple workers than a single worker so a parallel query is selected.   This can drive I/O loads up very quickly.  If an index was added the query could do a couple of reads and produce the result in less time using less CPU and I/O resources.   You don’t need a larger SAN or multiple LUNs in this case you just need to understand the database needs.   This is why I always suggest the Q/A server run a series of load tests with the application when sp_configure, max degree of parallism is set to 1 to help point out queries that could be tuned before you have to dig them out of a parallel workload that is impacting resources in a way you were not expecting.


    – Bob Dorr

  5. RDORR says:

    A few more questions I received.

    4. Should a log-write  I/O remain outstanding for an extended period of time, to the point where some component decides the I/O failed.

    [RDORR] Never, it is just an async request and should complete in milli-seconds.

    b. We expect that all log-writes are serialized, i.e. log writes will be forced to be one after the other.

    [RDORR] That is not the case.  We flush the log as needed and in buffers and with multiple LDF files it could don’t assume the order or IO posting.   It is usually going to post in order.

    i.e., If spid 101’s log I/O was issued first, what happens (theoretically) to spid 60, who’s physical log flush was scheduled after 101, and the rest scheduled after these two….

    [RDORR] There are lots of moving pieces here but the design is based on a log block and reference counts.   The SPIDs acquires log space *reserves it* and takes out a reference count on the block then starts writing its log changes (small window but the reference count allows multi sessions to use the log block space at the same time).  In your scenario 101 and 60 would both have reserved space and reference counts.  They each can add their log record information to the block without touching each other and when done release the reference count. (Very small windows)   When a commit is encountered a FlushToLSN is called so 101 would ask that all log records up to its commit be flushed.   All log blocks up to that point are flushed to the log.   This means that all reference counts are released so 60 would have written its log record, the block is compressed and written out.   So when 60 asks for a commit it would be fast because the log records where flushed with 101.

    No one session ‘owns’ the log in anyway.  It is all based on blocks of accumulated log records and reference counting under small windows.   No log block is held very long because the reference count is only for the short time the log record is written to the block.

    c. Please elaborate on what you would expect to happen to the ecosystem on a busy server, and the type of issues we should expect if a number of log-I/O writes are delayed for extended times,

    d. and finally exactly what will will happen if such an I/O should fail. This is an equally important question for us.

    [RDORR] Slow log I/Os will result in longer transactions.   The transactions hold locks so you generally see blocking and concurrency issues.  All worker threads start to get tied up because of the blocking, new connections can’t get workers so things start to timeout and the backup occurs.

    A failed log IO will result in the commit failure.   SQL Server will generally take the database offline and back online to cause all activity to rollback and attempt to return the database to a running state.   If that fails the database is marked suspect so no further changes occur.

    – Bob Dorr

Skip to main content