I am very hesitant to post anything that I don’t have the full details on. However, with SSD deployments moving so rapidly I thought it might be helpful to share some of my learning's to date.
I make no claims of being an expert in this area. However, I have been doing research that I found eye opening. After reading this blog you are likely to have open questions, just as I do. This blog is intended to help us all ask the right questions when installing and using SSD based solutions.
SSD is not Magically Failure Resistant
I don’t have to go into this in detail but many of you road out the assumptions that 64 bit processors were twice as fast as 32 bit ones. Having to explain over and over again that the processor speeds are the same and in some instances slower than the 32 bit predecessor, etc.
SSDs have picked up a similar ‘assumption’ that they are more reliant than spinning media. While there are advantages to SSD they are still susceptible to many of the same failure patterns as spinning media.
The failure patterns exhibited by SSD are similar to spinning media. After reading this document (https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf) it is easy to see the need for a caching controller and making sure data is flushed properly.
· Bit Corruption, Records exhibit random bit errors
· Flying Writes, Well-formed records end up in the wrong place
· ShornWrites, Operations are partially done at a level below the expected sector size
· Metadata Corruption, Metadata in FTL is corrupted
· Dead Device, Device does not work at all, or mostly does not work
· Unserializability, Final state of storage does not result from a serializable operation order
Recommendation: Treat the SSD storage as you would spinning media making sure the appropriate safeguards are in place for power failure (I.E. battery backed cache, etc.)
Capacitor Power Up
The power outage testing document points out many interesting issues that might occur and the systems need to protect against. I specifically found, the need for the capacitor to ‘power up’ thought provoking. The charging behavior makes those power outages that occur and 10 or 15 seconds later another power flicker occurs very interesting indeed.
Most SSDs report 512 byte sector sizes but use 4K pages inside the 1MB erasure blocks. Using 512 byte aligned sectors for the SQL Server log device can generate more (R)ead (M)odify (W)rite activities which could contribute to slower performance and drive wear.
Recommendation: Make sure the caching controller is aware of the correct page size of the SSD(s) and is able to align physical writes with the SSD infrastructure properly.
The common view of a newly formatted drive is one holding all zeros. It is interesting to note that an erased block of an SSD is all 1’s making a raw read of an erased block all F’s. It is unexpected for a user to read an erased block during normal I/O operations. However, just last week I reviewed a report that seems to align with this behavior.
A technique we have used in the past is to write a known pattern to the entire drive. Then as we execute database activity against that same drive we can detect incorrect behavior (stale read / lost write / read of incorrect offset / etc.) when the pattern unexpectedly appears.
This technique does not work well on SSD based drives. The erasure and (R)ead (M)odify (W)rite (RMW) activities a write destroys the pattern. The SSD GC activity, wear leveling, proportional/set-aside list blocks and other optimizations tend to cause writes to acquire different physical locations unlike spinning medias sector reuse.
Flying Writes / Incorrect FLT Mapping
Like many of us the flying writes seem more like a servo and head movement problem. However, in December I worked on a system were the GPT data (sectors) that should be at the start and end of the volume would show up during a read of the database file. The first part of the database page was all zeros followed by the GPT information as outlined for the GPT in MSDN. This was occurring without a power outage/cycle and we continue to investigate FLT mapping bug possibilities.
As you can imagine non-serialized writes are a database killer. Breaking the WAL protocol and making it difficult at best to diagnosis how the data transitioned to damaged state.
The firmware used in SSD drives tends to be complex when compared to the spinning media counterparts. Many drives use multiple processing cores to handle incoming requests and garbage collection activities. Just last week I was made aware of a firmware fix. The cores shared the same memory area, leading to a race condition corrupting the SQL Server Log File (ldf.)
Recommendation: Make sure you keep the firmware up-to-date on the SSDs in order to avoid known problems.
Read Data Damage / Wear Leveling
The various Garbage collection (GC) algorithms tend to remain proprietary. However, there are some common, GC approaches that tend to be well known. One such activity is to help prevent repeated, read data damage. When reading the same cell repeatedly it is possible the electron activity can leak and cause neighboring cell damage. The SSDs protect the data with various levels of ECC and other mechanisms.
One such mechanism relates to wear leveling. The SSD keeps track of the write and read activity on the SSD. The SSD GC can determine hot spots or locations wearing faster than other locations. The GC may determine a block that has been in read only state for a period of time needs to move. This movement is generally to a block with more wear so the original block can be used for writes. This helps even the wear on the drive but mechanically places read only data at a location that has more wear and mathematically increases the failure chances, even if slightly.
The reason I point this behavior out is not to specifically recommend anything but to make you aware of the behavior. Imagine you execute DBCC and it reports and error and you run it a second time and it reports additional or a different pattern of errors. It would be unlikely but the SSD GC activity could make changes between the DBCC executions.
OS Error 665 / Defragmentation
I have started investigations as to what fragmentation means on an SSD. In general, there is not much to do with fragmentation on an SSD. There are some defragmentation and trimming activities that can be of note: http://www.hanselman.com/blog/TheRealAndCompleteStoryDoesWindowsDefragmentYourSSD.aspx
Spinning media needs to keep blocks near one another to reduce the drives head movement and increase performance. SSDs don’t have the physical head. In fact, many SSDs are designed to allow parallel operations on different blocks in parallel.
SSD documentation always indicates that serial activities are the best I/O patterns to maximize I/O throughput.
I was recently able to reproduce OS Error 665 (File System Limitation) on the database file (MDF). I can cause the same problem using spinning media but it commonly takes far longer to trigger the error. The scenario is a BCP in of 1TB of data but I started the database at 10GB and only allowed auto-grow in 1MB chunks. When I reached ~480GB I started encountering the OS Error 665.
Using utilities such as FSUTIL I was able to see the SSD has millions of fragments. As my previous blogs for OS Error 665 have highlighted the NTFS storage attribute list is finite. Control of the attribute list size depends on the version of Windows and format settings.
I attempted to use Windows Defrag.exe. It completed quickly but the number of fragments didn’t change significantly. What I ended up doing was:
- Take the database offline
- Copy the MDF off the SSD
- Copy the MDF back to the SSD
- Bring database back online
This reduced the fragments by allowing the SSD firmware to detect the serial write activity. I didn’t test a backup and restore sequence but I suspect a backup and restore with replace sequence would result in similar defragmentation like outcome.
Recommendation(s): Use an appropriate, battery backed controller designed to optimize write activities. This can improve performance, reduce drive wear and physical fragmentation levels.
Consider REFS to avoid the NTFS attribute limitations.
Make sure the file growth sizes are appropriately sized.
I am still trying to understand if there are any real impacts from the SSD compression behaviors. Some of the SSD documentation mentions that writes may be compressed by the SSD. The compression occurring for the SSD is part of the write operation. As long as the drive maintains the intent of stable media, compression could elongate the drive life and may positively impact performance.
- Maintain proper backup and disaster recovery procedures and processes.
- Keep your firmware up-to-date.
- Listen closely to your hardware manufactures guidance.
In am learning something new about SSD deployments every day and plan to post updates when appropriate.
Bob Dorr - Principal SQL Server Escalation Engineer