How It Works: SQLIOSim - Checksums

SQLIOSim, like its predecessor SQLIOStress, is designed to read pages it has written and validate the data.   SQLIOSim does this using a checksum algorithm.

  • When SQLIOSim starts up it creates a set of buffers and used the Cryto APIs to generate random data on them.  It then calculates the checksum for each of the buffers.  These become statically stored.
  • When a page is DIRTIED by SQLIOSim a random seed is calculated.   Using this random seed value MOD (%) with the number of available buffers, a memcpy is used to dirty the write buffer.
  • The page header is correctly updated (these fields are not part of the checksum) and the page is written to disk.

Auditing reads are issued at various intervals to validate the checksum.  There are two angles to checksum validation.

  • The checksum calculated for the page should match that stored on the page.  (Page physically damaged)
  • The checksum and seed stored on the page should match that of the last write to the page.  (Page physically damaged or incorrect version of page returned I.E. Stale Read)

The SQLIOSim error log contains an error message sequence such as the following.

<ENTRY TYPE='ERROR' TIME='17:21:48' DATE='03/04/08' TID='5080' User='CPU Idle User' File='e:yukonsosbranchsqlntdbmsstorengutilsqliosimbuffer.cpp' Line='791' Func='CBUF::ValidateBuffer' HRESULT='0x80070467' SYSTEXT='While accessing the hard disk, a disk operation failed even after retries.'>

<EXTENDED_DESCRIPTION>Buffer validation failed on F:sqliosim.mdx Page: 87366, offset 0x8</EXTENDED_DESCRIPTION>

</ENTRY>

When a significant error is detected a text file is created showing extended details. In this case the error information shows SQLIOSim encountered a checksum failure and attempted 15 retries. Each retry does a sleep between retry attempts. After the 16 total reads the problem could not be resolved and the extended text file dump is generated.

<ENTRY TYPE='ERROR' TIME='17:21:48' DATE='03/04/08' TID='5080' User='CPU Idle User' File='e:yukonsosbranchsqlntdbmsstorengutilsqliosimpage.cpp' Line='1043' Func='ErrorDumpHandler' HRESULT='0x00000000' SYSTEXT=''>

<EXTENDED_DESCRIPTION>Dump file successfully written: H:SQLIOSimX86SqlSimErrorDump00006.txt</EXTENDED_DESCRIPTION>

</ENTRY>

The text file contains several sections which I have outlined below. To assist in interpreting the output it helps to understand the page header definition.

DWORD m_dwPage;

      DWORD m_dwFile;

      DWORD m_dwPageSeed;

      DWORD m_dwCheckSum;

I used the following page to illustrate the information here.

87366 = 0x00015546 or byte swapped 46 55 01 00

This is an example of the stale read (stable media returned previous version of the page) showing the differences are the seed and checksum and all the rest of the data. Other conditions may be issues such as a single bit damaged on the page, swapped 512 sectors or even the wrong page (offset) returned.

Header Shows the basic information about the dump. Data mismatch between the expected disk data and the read buffer: File: F:sqliosim.mdx Offset: 0x2AA8C000 Expected FileId: 0x0 Received FileId: 0x0 Expected PageId: 0x15546 Received PageId: 0x15546 Expected CheckSum: 0x28DBAE9B Received CheckSum: 0x31D71152 (does not match expected) Calculated CheckSum: 0x31D71152 Expected Buffer Length: 0x2000 Received Buffer Length: 0x2000 Synchronous read was not successful after 15 attempts
Data buffer received

The raw dump of the data as read from stable media.

0x000000 46 55 01 00 00 00 00 00 66 1A 00 00 52 11 D7 31 C4 D7 68 54 52 44 98 F1 32 0D 81 F7 49 81 90 D3 FU......f...R..1..hTRD..2...I...

0x000020 21 14 B9 B7 F5 9E AB 77 11 FC 7C 99 47 4B 11 D5 B2 68 3A 86 50 3E 68 CE 95 61 9E BB 7B C1 24 08 !......w..|.GK...h:.P>h..a..{.$.

0x000040 78 54 68 48 73 92 9A 4F BB 79 83 CE B1 FE 68 D4 67 C0 5B 0A 3C 61 AB 04 D1 39 EF CE F5 D9 AB 74 xThHs..O.y....h.g.[.<a...9.....t

Data buffer expected The in-memory version of the expected page data. 0x000000 46 55 01 00 00 00 00 00 1F 39 00 00 9B AE DB 28 FF 5B AD F3 21 2E A6 FD 4B 1D 34 AB 7D 04 38 33 FU.......9.....(.[..!...K.4.}.83 0x000020 48 4C 95 23 83 44 56 58 E6 51 DE 07 64 C4 14 78 8E F7 F7 6D 46 18 6F 39 E9 08 69 F1 7F 58 68 A2 HL.#.DVX.Q..d..x...mF.o9..i..Xh. 0x000040 2C B5 E3 34 84 92 61 30 72 3C 9A 85 4E 50 89 2C 48 AE BC 4A 2C 68 C1 5A E3 4D E1 19 DB F7 DB 56 ,..4..a0r<..NP.,H..J,h.Z.M.....V
Data buffer difference Shows the bitwise differences between the expected and received buffers. 0x000000 79 23 C9 BF 0C 19 3B 8C C5 A7 73 6A 3E 0C 79 10 B5 5C 34 85 A8 E0 0x000020 69 58 2C 94 76 DA FD 2F F7 AD A2 9E 23 8F 05 AD 3C 9F CD EB 16 26 07 F7 7C 69 F7 4A 04 99 4C AA 0x000040 54 E1 8B 7C F7 FB 7F C9 45 19 4B FF AE E1 F8 2F 6E E7 40 10 09 6A 5E 32 74 0E D7 2E 2E 70 22 0x000060 51 75 A6 E0 AD 71 D3 8F FD 73 70 D2 53 4C 1C AF 0E C9 62 9B 92 01 FA 9B D7 10 D3 FD 2C 87 94 BB 0x000080 4D 28 88 1E 07 56 46 9A BB B6 99 AB DB 34 1D EF 5B 22 AD B7 F8 5B 6A 1C F8 08 A9 8A E0 50 A5 B7
File IO calls history dump A dump of the current API call ring buffer. Since this is a ring buffer information about the specific page could have been lost. Use the page number * 8192 combined with the Offset and Bytes values to locate the entries that contain the offset of the damaged page. Function Handle Offset Bytes Ret bytes Start End Rslt Error TID GetOverlappedResult 0x334 0x7ba3e000 32768 32768 1527013921 1527013921 1 0 4956 ReadFile 0x334 0xe9f90000 8192 0 1527013921 1527013921 1 0 4956 GetOverlappedResult 0x334 0x1374e000 73728 73728 1527013921 1527013921 1 0 1560 GetOverlappedResult 0x334 0xe9f8a000 8192 8192 1527013921 1527013921 1 0 5092 GetOverlappedResult 0x334 0xe85ce000 98304 98304 1527013921 1527013921 1 0 6028 ReadFile 0x334 0xe9f8a000 8192 0 1527013921 1527013921 1 0 4956 GetOverlappedResult 0x334 0xe9f7a000 8192 8192 1527013921 1527013921 1 0 4740 ReadFileScatter 0x334 0x7ba3e000 32768 32768 1527013921 1527013921 1 0 4740 GetOverlappedResult 0x334 0xdd99a000 212992 212992 1527013921 1527013921 1 0 4736 GetOverlappedResult 0x334 0xb9e50000 32768 32768 1527013921 1527013921 1 0 4736 ReadFileScatter 0x334 0xb9e50000 32768 32768 1527013921 1527013921 1 0 4140

These types of failures are typically configuration or hardware related. When you encounter such issues be sure to contact your operations and hardware vendors.

 

Bob Dorr
SQL Server Senior Escalation Engineer