Should I run SQLIOSim? - An e-mail follow-up from SQL PASS 2008

From: Robert Dorr [mailto:rdorr@microsoft.com]
Sent: Thursday, November 20, 2008 11:29 AM
To: A SQL PASS MEMBER - CSS First Aid Station Question About SQLIOSim

As we discussed at SQL PASS – SQLIOSim is an independent utility and has been used as part of the Windows Hardware Compatibility Testing (WHCL) suite for several years. Many of the major hardware manufacturers use it as part of their release criteria testing as well. I recently worked with a major hardware vendor. They were getting ready to release a new HBA controller series and SQLIOSim was reporting errors. After some investigation the root cause was a firmware bug that would have lead to SQL Server corruption. The firmware was corrected before the product was released.

SQLIOSim performs various I/O patterns. These include ones that mimic various SQL Server I/O requests (read ahead, bcp, backup, dbcc, random writers and readers, …) It also performs stress testing by posting un-throttled amounts of I/O and other random patterns. This is why it is perfect for data consistency checking but cannot be used for performance testing.

I had numerous inquiries at PASS about the output of SQLIOSim and I provided the same guidance. If you are getting errors you should consider it an I/O path problem that will lead to SQL Server corruption with the potential of total database loss. Here is a quote from another customer: “Our file system administrators don’t like us to run SQLIOSim because it puts heavy stress on the SAN. However, the last 3 times we have attempted to add a new machine to production it has triggered an error because the machines where incorrectly configured. I won’t put a machine in production if SQLIOSim does not report success.”

The easiest way to make SQLIOSim completes with success is to set the configuration option to STOP ON ERROR. This way if a data integrity error is encountered the utility logs extended details and shutdown.

For your specific issue I took a look at the page dumps and you really need to revisit the hardware and I/O path configuration.

If you have not read my white papers on “SQL Server I/O Basics” Chapter 1 and Chapter 2 please take the time to do so. It outlines how a stale reads, lost writes, torn pages, checksum failures and other I/O failure conditions are detected and what this means to the data.

https://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIObasics.mspx
https://www.microsoft.com/technet/prodtechnol/sql/2005/iobasics.mspx

SQLIOSim used the secure crypto APIs to create random data patterns on the pages. On the page is a small header containing the page id and the expected checksum value. These are used to help determine the state of the page.

Diag File

Condition/Characteristics

SqlSimErrorDump01863

Classic Stale Read Pattern

 

The PageId between on expected page image and the page read from disk is the only data that is the same. All other data is different. This is usually a condition where the hardware returned a stale image of the page from cache – firmware bug.

 

NOTE: We re-read this from disk 15 times and still received the older buffer. Probably from hardware cache.

 

Data mismatch between the expected disk data and the read buffer:

File: F:sqliosim.mdf

Offset: 0xD2F9CE000

Expected FileId: 0x0

Received FileId: 0x0

Expected PageId: 0x697CE7

Received PageId: 0x697CE7

Expected CheckSum: 0xA6DDED06

Received CheckSum: 0x76E039CE (does not match expected)

Calculated CheckSum: 0x76E039CE

Expected Buffer Length: 0x2000

Received Buffer Length: 0x2000

Synchronous read was not successful after 15 attempts

SqlSimErrorDump01864

Classic Stale Read Pattern

SqlSimErrorDump01864

Classic Stale Read Pattern

SqlSimErrorDump01888

Classic Stale Read Pattern

SqlSimErrorDump01909

Classic Stale Read Pattern

To help track down the root cause of a stale read issue we often disable the hardware I/O cache. If this the test runs successfully it is the handling of the hardware cache which should be investigated closely.

Bob Dorr
SQL Server Principal Escalation Engineer