Building the next generation file system for Windows: ReFS

Article
01/16/2012

We wanted to continue our dialog about data storage by talking about the next generation file system being introduced in Windows 8. Today, NTFS is the most widely used, advanced, and feature rich file system in broad use. But when you’re reimagining Windows, as we are for Windows 8, we don’t rest on past successes, and so with Windows 8 we are also introducing a newly engineered file system. ReFS, (which stands for Resilient File System), is built on the foundations of NTFS, so it maintains crucial compatibility while at the same time it has been architected and engineered for a new generation of storage technologies and scenarios. In Windows 8, ReFS will be introduced only as part of Windows Server 8, which is the same approach we have used for each and every file system introduction. Of course at the application level, ReFS stored data will be accessible from clients just as NTFS data would be. As you read this, let’s not forget that NTFS is by far the industry’s leading technology for file systems on PCs.

This detailed architectural post was authored by Surendra Verma, a development manager on our Storage and File System team, though, as with every feature, a lot of folks contributed. We have also used the FAQ approach again in this post.
--Steven

PS: Don't forget to track us on @buildwindows8 where we were providing some updates from CES.

In this blog post I’d like to talk about a new file system for Windows. This file system, which we call ReFS, has been designed from the ground up to meet a broad set of customer requirements, both today’s and tomorrow’s, for all the different ways that Windows is deployed.

The key goals of ReFS are:

Maintain a high degree of compatibility with a subset of NTFS features that are widely adopted while deprecating others that provide limited value at the cost of system complexity and footprint.
Verify and auto-correct data. Data can get corrupted due to a number of reasons and therefore must be verified and, when possible, corrected automatically. Metadata must not be written in place to avoid the possibility of “torn writes,” which we will talk about in more detail below.
Optimize for extreme scale. Use scalable structures for everything. Don’t assume that disk-checking algorithms, in particular, can scale to the size of the entire file system.
Never take the file system offline. Assume that in the event of corruptions, it is advantageous to isolate the fault while allowing access to the rest of the volume. This is done while salvaging the maximum amount of data possible, all done live.
Provide a full end-to-end resiliency architecture when used in conjunction with the Storage Spaces feature, which was co-designed and built in conjunction with ReFS.

The key features of ReFS are as follows (note that some of these features are provided in conjunction with Storage Spaces).

Metadata integrity with checksums
Integrity streams providing optional user data integrity
Allocate on write transactional model for robust disk updates (also known as copy on write)
Large volume, file and directory sizes
Storage pooling and virtualization makes file system creation and management easy
Data striping for performance (bandwidth can be managed) and redundancy for fault tolerance
Disk scrubbing for protection against latent disk errors
Resiliency to corruptions with "salvage" for maximum volume availability in all cases
Shared storage pools across machines for additional failure tolerance and load balancing

In addition, ReFS inherits the features and semantics from NTFS including BitLocker encryption, access-control lists for security, USN journal, change notifications, symbolic links, junction points, mount points, reparse points, volume snapshots, file IDs, and oplocks.

And of course, data stored on ReFS is accessible through the same file access APIs on clients that are used on any operating system that can access today’s NTFS volumes.

Key design attributes and features

Our design attributes are closely related to our goals. As we go through these attributes, keep in mind the history of producing file systems used by hundreds of millions of devices scaling from the smallest footprint machines to the largest data centers, from the smallest storage format to the largest multi-spindle format, from solid state storage to the largest drives and storage systems available. Yet at the same time, Windows file systems are accessed by the widest array of application and system software anywhere. ReFS takes that learning and builds on it. We didn’t start from scratch, but reimagined it where it made sense and built on the right parts of NTFS where that made sense. Above all, we are delivering this in a pragmatic manner consistent with the delivery of a major file system—something only Microsoft has done at this scale.

Code reuse and compatibility

When we look at the file system API, this is the area where compatibility is the most critical and technically, the most challenging. Rewriting the code that implements file system semantics would not lead to the right level of compatibility and the issues introduced would be highly dependent on application code, call timing, and hardware. Therefore in building ReFS, we reused the code responsible for implementing the Windows file system semantics. This code implements the file system interface (read, write, open, close, change notification, etc.), maintains in-memory file and volume state, enforces security, and maintains memory caching and synchronization for file data. This reuse ensures a high degree of compatibility with the features of NTFS that we’re carrying forward.

Underneath this reused portion, the NTFS version of the code-base uses a newly architected engine that implements on-disk structures such as the Master File Table (MFT) to represent files and directories. ReFS combines this reused code with a brand-new engine, where a significant portion of the innovation behind ReFS lies. Graphically, it looks like this:

NTFS.SYS = NTFS upper layer API/semantics engine / NTFS on-disk store engine; ReFS.SYS = Upper layer engine inherited from NTFS / New on-disk store engine

Reliable and scalable on-disk structures

On-disk structures and their manipulation are handled by the on-disk storage engine. This exposes a generic key-value interface, which the layer above leverages to implement files, directories, etc. For its own implementation, the storage engine uses B+ trees exclusively. In fact, we utilize B+ trees as the single common on-disk structure to represent all information on the disk. Trees can be embedded within other trees (a child tree’s root is stored within the row of a parent tree). On the disk, trees can be very large and multi-level or really compact with just a few keys and embedded in another structure. This ensures extreme scalability up and down for all aspects of the file system. Having a single structure significantly simplifies the system and reduces code. The new engine interface includes the notion of “tables” that are enumerable sets of key-value pairs. Most tables have a unique ID (called the object ID) by which they can be referenced. A special object table indexes all such tables in the system.

Now, let’s look at how the common file system abstractions are constructed using tables.

File structures

As shown in the diagram above, directories are represented as tables. Because we implement tables using B+ trees, directories can scale efficiently, becoming very large. Files are implemented as tables embedded within a row of the parent directory, itself a table (represented as File Metadata in the diagram above). The rows within the File Metadata table represent the various file attributes. The file data extent locations are represented by an embedded stream table, which is a table of offset mappings (and, optionally, checksums). This means that the files and directories can be very large without a performance impact, eclipsing the limitations found in NTFS.

As expected, other global structures within the file system such ACLs (Access Control Lists) are represented as tables rooted within the object table.

All disk space allocation is managed by a hierarchical allocator, which represents free space by tables of free space ranges. For scalability, there are three such tables – the large, medium and small allocators. These differ in the granularity of space they manage: for example, a medium allocator manages medium-sized chunks allocated from the large allocator. This makes disk allocation algorithms scale very well, and allows us the benefit of naturally collocating related metadata for better performance. The roots of these allocators as well as that of the object table are reachable from a well-known location on the disk. Some tables have allocators that are private to them, reducing contention and encouraging better allocation locality.

Apart from global system metadata tables, the entries in the object table refer to directories, since files are embedded within directories.

Robust disk update strategy

Updating the disk reliably and efficiently is one of the most important and challenging aspects of a file system design. We spent a lot of time evaluating various approaches. One of the approaches we considered and rejected was to implement a log structured file system. This approach is unsuitable for the type of general-purpose file system required by Windows. NTFS relies on a journal of transactions to ensure consistency on the disk. That approach updates metadata in-place on the disk and uses a journal on the side to keep track of changes that can be rolled back on errors and during recovery from a power loss. One of the benefits of this approach is that it maintains the metadata layout in place, which can be advantageous for read performance. The main disadvantages of a journaling system are that writes can get randomized and, more importantly, the act of updating the disk can corrupt previously written metadata if power is lost at the time of the write, a problem commonly known as torn write.

To maximize reliability and eliminate torn writes, we chose an allocate-on-write approach that never updates metadata in-place, but rather writes it to a different location in an atomic fashion. In some ways this borrows from a very old notion of “shadow paging” that is used to reliably update structures on the disk. Transactions are built on top of this allocate-on-write approach. Since the upper layer of ReFS is derived from NTFS, the new transaction model seamlessly leverages failure recovery logic already present, which has been tested and stabilized over many releases.

ReFS allocates metadata in a way that allows writes to be combined for related parts (for example, stream allocation, file attributes, file names, and directory pages) in fewer, larger I/Os, which is great for both spinning media and flash. At the same time a measure of read contiguity is maintained. The hierarchical allocation scheme is leveraged heavily here.

We perform significant testing where power is withdrawn from the system while the system is under extreme stress, and once the system is back up, all structures are examined for correctness. This testing is the ultimate measure of our success. We have achieved an unprecedented level of robustness in this test for Microsoft file systems. We believe this is industry-leading and fulfills our key design goals.

Resiliency to disk corruptions

As mentioned previously, one of our design goals was to detect and correct corruption. This not only ensures data integrity, but also improves system availability and online operation. Thus, all ReFS metadata is check-summed at the level of a B+ tree page, and the checksum is stored independently from the page itself. This allows us to detect all forms of disk corruption, including lost and misdirected writes and bit rot (degradation of data on the media). In addition, we have added an option where the contents of a file are check-summed as well. When this option, known as “integrity streams,” is enabled, ReFS always writes the file changes to a location different from the original one. This allocate-on-write technique ensures that pre-existing data is not lost due to the new write. The checksum update is done atomically with the data write, so that if power is lost during the write, we always have a consistently verifiable version of the file available whereby corruptions can be detected authoritatively.

We blogged about Storage Spaces a couple of weeks ago. We designed ReFS and Storage Spaces to complement each other, as two components of a complete storage system. We are making Storage Spaces available for NTFS (and client PCs) because there is great utility in that; the architectural layering supports this client-side approach while we adapt ReFS for usage on clients so that ultimately you’ll be able to use ReFS across both clients and servers.

In addition to improved performance, Storage Spaces protects data from partial and complete disk failures by maintaining copies on multiple disks. On read failures, Storage Spaces is able to read alternate copies, and on write failures (as well as complete media loss on read/write) it is able to reallocate data transparently. Many failures don’t involve media failure, but happen due to data corruptions, or lost and misdirected writes.

These are exactly the failures that ReFS can detect using checksums. Once ReFS detects such a failure, it interfaces with Storage Spaces to read all available copies of data and chooses the correct one based on checksum validation. It then tells Storage Spaces to fix the bad copies based on the good copies. All of this happens transparently from the point of view of the application. If ReFS is not running on top of a mirrored Storage Space, then it has no means to automatically repair the corruption. In that case it will simply log an event indicating that corruption was detected and fail the read if it is for file data. I’ll talk more about the impact of this on metadata later.

Checksums (64-bit) are always turned on for ReFS metadata, and assuming that the volume is hosted on a mirrored Storage Space, automatic correction is also always turned on. All integrity streams (see below) are protected in the same way. This creates an end-to-end high integrity solution for the customer, where relatively unreliable storage can be made highly reliable.

Integrity streams

Integrity streams protect file content against all forms of data corruption. Although this feature is valuable for many scenarios, it is not appropriate for some. For example, some applications prefer to manage their file storage carefully and rely on a particular file layout on the disk. Since integrity streams reallocate blocks every time file content is changed, the file layout is too unpredictable for these applications. Database systems are excellent examples of this. Such applications also typically maintain their own checksums of file content and are able to verify and correct data by direct interaction with Storage Spaces APIs.

For those cases where a particular file layout is required, we provide mechanisms and APIs to control this setting at various levels of granularity.

At the most basic level, integrity is an attribute of a file (FILE_ATTRIBUTE_INTEGRITY_STREAM). It is also an attribute of a directory. When present in a directory, it is inherited by all files and directories created inside the directory. For convenience, you can use the “format” command to specify this for the root directory of a volume at format time. Setting it on the root ensures that it propagates by default to every file and directory on the volume. For example:

 D:\>format /fs:refs /q /i:enable <volume>

D:\>format /fs:refs /q /i:disable <volume>

By default, when the /i switch is not specified, the behavior that the system chooses depends on whether the volume resides on a mirrored space. On a mirrored space, integrity is enabled because we expect the benefits to significantly outweigh the costs. Applications can always override this programmatically for individual files.

Battling “bit rot”

As we described earlier, the combination of ReFS and Storage Spaces provides a high degree of data resiliency in the presence of disk corruptions and storage failures. A form of data loss that is harder to detect and deal with happens due to “bit rot,” where parts of the disk develop corruptions over time that go largely undetected since those parts are not read frequently. By the time they are read and detected, the alternate copies may have also been corrupted or lost due to other failures.

In order to deal with bit rot, we have added a system task that periodically scrubs all metadata and Integrity Stream data on a ReFS volume residing on a mirrored Storage Space. Scrubbing involves reading all the redundant copies and validating their correctness using the ReFS checksums. If checksums mismatch, bad copies are fixed using good ones.

The file attribute FILE_ATTRIBUTE_NO_SCRUB_DATA indicates that the scrubber should skip the file. This attribute is useful for those applications that maintain their own integrity information, when the application developer wants tighter control over when and how those files are scrubbed.

The Integrity.exe command line tool is a powerful way to manage the integrity and scrubbing policies.

When all else fails…continued volume availability

We expect many customers to use ReFS in conjunction with mirrored Storage Spaces, in which case corruptions will be automatically and transparently fixed. But there are cases, admittedly rare, when even a volume on a mirrored space can get corrupted – for example faulty system memory can corrupt data, which can then find its way to the disk and corrupt all redundant copies. In addition, some customers may not choose to use a mirrored storage space underneath ReFS.

For these cases where the volume gets corrupted, ReFS implements “salvage,” a feature that removes the corrupt data from the namespace on a live volume. The intention behind this feature is to ensure that non-repairable corruption does not adversely affect the availability of good data. If, for example, a single file in a directory were to become corrupt and could not be automatically repaired, ReFS will remove that file from the file system namespace while salvaging the rest of the volume. This operation can typically be completed in under a second.

Normally, the file system cannot open or delete a corrupt file, making it impossible for an administrator to respond. But because ReFS can still salvage the corrupt data, the administrator is able to recover that file from a backup or have the application re-create it without taking the file system offline. This key innovation ensures that we do not need to run an expensive offline disk checking and correcting tool, and allows for very large data volumes to be deployed without risking large offline periods due to corruption.

A clean fit into the Windows storage stack

We knew we had to design for maximum flexibility and compatibility. We designed ReFS to plug into the storage stack just like another file system, to maximize compatibility with the other layers around it. For example, it can seamlessly leverage BitLocker encryption, Access Control Lists for security, USN journal, change notifications, symbolic links, junction points, mount points, reparse points, volume snapshots, file IDs, and oplocks. We expect most file system filters to work seamlessly with ReFS with little or no modification. Our testing bore this out; for example, we were able to validate the functionality of the existing Forefront antivirus solution.

Some filters that depend on the NTFS physical format will need greater modification. We run an extensive compatibility program where we test our file systems with third-party antivirus, backup, and other such software. We are doing the same with ReFS and will work with our key partners to address any incompatibilities that we discover. This is something we have done before and is not unique to ReFS.

An aspect of flexibility worth noting is that although ReFS and Storage Spaces work well together, they are designed to run independently of each other. This provides maximum deployment flexibility for both components without unnecessarily limiting each other. Or said another way, there are reliability and performance tradeoffs that can be made in choosing a complete storage solution, including deploying ReFS with underlying storage from our partners.

With Storage Spaces, a storage pool can be shared by multiple machines and the virtual disks can seamlessly transition between them, providing additional resiliency to failures. Because of the way we have architected the system, ReFS can seamlessly take advantage of this.

Usage

We have tested ReFS using a sophisticated and vast set of tens of thousands of tests that have been developed over two decades for NTFS. These tests simulate and exceed the requirements of the deployments we expect in terms of stress on the system, failures such as power loss, scalability, and performance. Therefore, ReFS is ready to be deployment-tested in a managed environment. Being the first version of a major file system, we do suggest just a bit of caution. We do not characterize ReFS in Windows 8 as a “beta” feature. It will be a production-ready release when Windows 8 comes out of beta, with the caveat that nothing is more important than the reliability of data. So, unlike any other aspect of a system, this is one where a conservative approach to initial deployment and testing is mandatory.

With this in mind, we will implement ReFS in a staged evolution of the feature: first as a storage system for Windows Server, then as storage for clients, and then ultimately as a boot volume. This is the same approach we have used with new file systems in the past.

Initially, our primary test focus will be running ReFS as a file server. We expect customers to benefit from using it as a file server, especially on a mirrored Storage Space. We also plan to work with our storage partners to integrate it with their storage solutions.

Conclusion

Along with Storage Spaces, ReFS forms the foundation of storage on Windows for the next decade or more. We believe this significantly advances our state of the art for storage. Together, Storage Spaces and ReFS have been architected with headroom to innovate further, and we expect that we will see ReFS as the next massively deployed file system.

-- Surendra

FAQ:

Q) Why is it named ReFS?

ReFS stands for Resilient File System. Although it is designed to be better in many dimensions, resiliency stands out as one of its most prominent features.

Q) What are the capacity limits of ReFS?

The table below shows the capacity limits of the on-disk format. Other concerns may determine some practical limits, such as the system configuration (for example, the amount of memory), limits set by various system components, as well as time taken to populate data sets, backup times, etc.

Attribute	Limit based on the on-disk format
Maximum size of a single file	2^64-1 bytes
Maximum size of a single volume	Format supports 2^78 bytes with 16KB cluster size (2^64 * 16 * 2^10). Windows stack addressing allows 2^64 bytes
Maximum number of files in a directory	2^64
Maximum number of directories in a volume	2^64
Maximum file name length	32K 255 unicode characters (for compatibility this was made consistent with NTFS for the RTM product)
Maximum path length	32K
Maximum size of any storage pool	4 PB
Maximum number of storage pools in a system	No limit
Maximum number of spaces in a storage pool	No limit

Q) Can I convert data between NTFS and ReFS?

In Windows 8 there is no way to convert data in place. Data can be copied. This was an intentional design decision given the size of data sets that we see today and how impractical it would be to do this conversion in place, in addition to the likely change in architected approach before and after conversion.

Q) Can I boot from ReFS in Windows Server 8?

No, this is not implemented or supported.

Q) Can ReFS be used on removable media or drives?

No, this is not implemented or supported.

Q) What semantics or features of NTFS are no longer supported on ReFS?

The NTFS features we have chosen to not support in ReFS are: named streams, object IDs, short names, compression, file level encryption (EFS), user data transactions, sparse, hard-links, extended attributes, and quotas.

Q) What about parity spaces and ReFS?

ReFS is supported on the fault resiliency options provided by Storage Spaces. In Windows Server 8, automatic data correction is implemented for mirrored spaces only.

Q) Is clustering supported?

Failover clustering is supported, whereby individual volumes can failover across machines. In addition, shared storage pools in a cluster are supported.

Q) What about RAID? How do I use ReFS capabilities of striping, mirroring, or other forms of RAID? Does ReFS deliver the read performance needed for video, for example?

ReFS leverages the data redundancy capabilities of Storage Spaces, which include striped mirrors and parity. The read performance of ReFS is expected to be similar to that of NTFS, with which it shares a lot of the relevant code. It will be great at streaming data.

Q) How come ReFS does not have deduplication, second level caching between DRAM & storage, and writable snapshots?

ReFS does not itself offer deduplication. One side effect of its familiar, pluggable, file system architecture is that other deduplication products will be able to plug into ReFS the same way they do with NTFS.

ReFS does not explicitly implement a second-level cache, but customers can use third-party solutions for this.

ReFS and VSS work together to provide snapshots in a manner consistent with NTFS in Windows environments. For now, they don’t support writable snapshots or snapshots larger than 64TB.