Redesigning chkdsk and the new NTFS health model

We’ve written about tons of improvements in the OS kernel, networking, and file system. While for most client PCs, the tried and true chkdsk utility is one we rarely use anymore except in very rare circumstances, we are using Window 8 as an opportunity to improve this utility. We wanted to focus on rethinking how the utility works to increase availability and reduce downtime due to chkdsk operations. In looking at the real world usage of chkdsk, we note that corruptions are exceedingly rare though running chkdsk is not. While we’ve worked hard to reduce the manual invocation of disk tools (like defrag) we know many prefer to run them manually “just in case” and so we worked to improve the overall throughput of chkdsk, since running it reduces availability of the machine. With disk capacities becoming extremely large and multi-disk systems more common, we wanted to improve the utility. Kiran Bangalore, a program manager on our core system team, authored this post.
--Steven


In this blog post, I’ll talk about the new NTFS health model for Windows 8 and our redesigned tool for disk corruption detection and fixing, the chkdsk utility.

We’ve all experienced the frustration that can be caused by an unexpected chkdsk that pops up while restarting a computer at home or a server at the office. Beyond the surprise, there’s the interruption while waiting for the process to complete and Windows to be available. With Windows 8, we provide quick resolution to these problems when they arise, putting the user in control and making systems more available and more scalable.

One of our key design goals for Windows 8 was to increase availability and reduce the overall down-time of systems; this feature, along with other storage features such as Storage Spaces and the new ReFS file system, helps reduce the complexity of fixing corruptions and increases the overall availability of the entire system.

The previous chkdsk and NTFS health model

While exceedingly rare, there are a variety of unique causes for disk corruption today. Whether they are caused by media errors from the hard disk or transient memory errors, corruptions can happen in file system metadata (the information used to map physical blocks to that vacation photo you took last year). To maintain access to your data, Windows must isolate and correct these errors, and the way to do this is by running the chkdsk utility.

In past versions, NTFS implemented a simpler health model, where the file system volume was either healthy or not. In that model, the volume was taken offline for as long as necessary to fix the file system corruptions and bring the volume back to a healthy state. Downtime was directly proportional to the number of files in the volume.

Reliable telemetry data from systems all over the world have shown us that, although corruptions are quite rare, when chkdsk is needed, it can take between a few seconds to a few hours to run, depending on the number of files in the drive–and even longer for larger storage servers.

In Windows Vista and Windows 7, we made significant optimizations to the speed of chkdsk but, as hard disk capacities have continued to double every 18 months and the number of files per volume is increasing at an equal rate, chkdsk has taken longer and longer to complete (even with speed improvements) .

So in Windows 8, we’ve changed the way we approach the health model of NTFS and changed the way we fix corruptions so as to minimize the downtime due to chkdsk. We’ve also introduced a new file system for the future, ReFS, which does not require an offline chkdsk to repair corruptions.

File system health redone

The incredible growth in storage capacity and user data files has necessitated the redesign of the NTFS health model and chkdsk.

There were three important requirements for file system health that our customers made clear:

  1. Downtime caused by file system corruptions must be zero in continuously available configurations and nearly zero in all other configurations.
  2. A User or Administrator must be made aware of the file system health at all times.
  3. A User or Administrator should be able to easily fix their file system when a corruption occurs in a scheduled manner.

Our design included changes both in the file system and the chkdsk utility to ensure the best availability. The new design splits the process into the following phases to ensure a coordinated, rapid, and transparent resolution to the corruption.

Flow diagram. Detect corruption (NTFS detects a perceived anomaly in file system metadata), ARROW TO Online self-healing (NTFS attempts to rapidly self-heal, Volume remains online) ARROW TO Online verification (NTFS validates if issue is transient or genuine, Volume remains online) ARROW TO Online identification and logging (If not self-healed, NTFS identifies and logs corrective actions, user or admin is notified, volume remains online) ARROW TO Precise and rapid correction (User or Admin can take the volume offline when convenient, and logged corruptions are then corrected in seconds, With CSV, I/O is transparently paused for rapid correction and then automatically resumed.

We developed a new method of communication that describes types of corruptions as “verbs” that act upon the key components and points of the design – the file system driver (NTFS), the self-healing module, the spot-verification service, and the chkdsk utility. All file system corruptions are classified as needing one of 18 different “verbs” that we’ve defined in Windows 8. We have also left room for possible new verb definitions that can help us diagnose issues even better in the future.

Key design changes to help improve availability:

    1. Online self-healing: The NTFS self-healing feature was introduced in Windows Vista (and in Windows Server 2008) to reduce the need to run chkdsk. Self-healing is a feature built into NTFS that fixes certain classes of corruptions encountered during normal operation, and can make these fixes while still online. If all issues that are detected are self-healed online, there is no need for an offline repair. In Windows 8 we increased the number of issues that can be handled online and hence reduced any further need for chkdsk.
    2. Online verification: Some corruptions are intermittent due to memory issues and may not be a result of an actual corruption on the disk; so we added a new service to Windows 8, called the spot verification service. It is triggered by the file system driver and it verifies that there is actual corruption on the disk before moving the file system along in the health model. This new service runs in the background and does not affect the normal functioning of the system; it does nothing unless the file system driver triggers it to verify a corruption.
    3. Online identification and logging: When an issue is verified, this triggers an online scan of the file system, which runs as a maintenance task in the file system. In Windows 8, scheduled tasks that are for the maintenance of the computer run only when appropriate (during idle time, etc.). This scan can run as a background task while other programs continue to run in the foreground. As the file system is scanned, all issues that are found are logged for later correction.
    4. Precise and rapid correction – At the user or administrator’s convenience, the volume can be taken offline, and the corruptions logged in the previous step can be fixed. The downtime from this operation, called “Spotfix,” takes only seconds, and on Windows Server 8 systems with cluster shared volumes, we’ve eliminated this downtime completely. With this new model, chkdsk offline run time is now directly proportional to the number of corruptions, rather than being proportional to the number of files as in the old model .

Bar chart compares chkdsk on Windows Server 2008 R2 vs. on Windows Server 8. On the older system, it takes close to 2 minutes to check and fix 100 million files, 4:48 to check and fix 200 million files, and more than 6 minutes to check and fix 300 million files. On Windows Server 8, it takes less than 2 seconds to spotfix each of these.
Comparison of Windows Server: chkdsk /f vs chkdsk /spotfix

  1. Better manageability – To enable better transparency into the new health model, Windows now exposes the state of the file system via the following interfaces:
    • Action Center – The health of the drive is most visible in the Action Center as the “Drive Status” (see figure below), which tells you when you need to take an action to bring the volume to a healthy state.
    • Explorer: The health state is also exposed in Explorer, under Drive properties.
    • PowerShell: You can also invoke the chkdsk functionality using a new cmdlet in PowerShell, REPAIR-VOLUME, which can be helpful for remote management of file system health.
    • Server Manager: In Windows Server, you can also manage the volume health states directly from the server manager utility.

The new file system health model

In the new health model, the file system health status transitions through four states – some that are simply informational, and others that require you to act. The health states are:

  1. Online and healthy
  2. Online spot verification needed
  3. Online scan needed
  4. Spot fix needed

Flow goes from Healthy, to Spot verification needed, to Online scan needed, to Spot fix needed, to Healthy.
Windows 8 file system health states

    1. Online and healthy – In this state there are no detected file system corruptions and there is no action required of you. The file system remains in this state most of the time.

Action Center shows no action needed

    1. Online spot verification needed– The file system stays in this transient state only for a brief instant after the file system finds a corruption that it cannot self-heal; it puts the volume in this state until the spot verification service verifies the corruption. Again, there is no user action required.
    2. Online scan needed– When the spot-verification service confirms the corruption, it puts the file system in the “online scan needed” state. In the next maintenance window, an online scan is performed; there is no user action required. This state is reflected in the Action Center, so you can run the scan manually if you want to do that before the next maintenance window. The scan is run as a background operation, which means that you can continue using the computer while the scan is performed. During this online scan, all verified issues and fixes are logged for later repair. On Windows Server 8 systems, idle time is determined by monitoring the CPU and storage idle times.

Message saying Scan drive for errors, link to Open Action Center

Under Maintenance, it says to scan drive for errors, We found potential errors on a drive, and need to scan it. You can keep using the drive during the scan. Button: Run scan. Action link: Turn off messages about drive status.

  1. Spot fix needed– The file system puts the volume in this state after the online scan is completed, if required, and this state is reflected in the Action Center. On client systems, you can restart the PC to fix all the file system issues logged in the previous step. The restart is quick (adding just a few additional seconds) and the PC is returned to a healthy state. For Windows Server 8 systems, a restart is unnecessary to fix corruptions on data volumes. Administrators can simply schedule a spot fix during the next maintenance window.

Notification says Restart to repair drive errors. Click to restart your PC.

Restart to repaie drive errors (important) We found errors on a drive. To repair these errors and prevent loss of data, restart your PC now. Button: Restart.

For more advanced users who want to avoid restarting their system to fix a non-system volume corruption, they can open the Properties dialog for the affected volume, and on the Tools tab, they’ll see an option to check the drive for file system errors. Corruption on drives that are not currently in use can be fixed without needing a full restart of the computer.

On Tools tab, under Error checking, This option will check the drive for file system errors. Button: Check

Error Checking (Chk2 (D:)) Repair this drive We found errors on this drive. To prevent data loss, repair this drive now. Repair drive You won't be able to use the drive while Windows repairs errors found in the last scan. You might need to restart your computer. Button: Cancel.

Conclusion

In Windows 8, we have made the detection and correction of file system errors more transparent and less intrusive. We believe these changes will be a welcome enhancement for you and we look forward to hearing your feedback.

-- Kiran Bangalore
    Senior Program Manager, Windows Core Storage and File Systems

Your browser doesn't support HTML5 video.

Download this video to view it in your favorite media player:
High quality MP4 | Lower quality MP4

FAQ

Q) Will the new health model work on removable drives?
Yes, this works on removable drives that report fixed media, like most external hard drives.

Q) How do I enable the new file system health model?
You don’t need to do a thing—the new file system health model is enabled by default.

Q) Will the new file system health model apply to Windows Server?
Yes, the health model is identical for both server and client. One thing that will be different by default is that the data drives will not be checked or fixed during boot of the system – this maintenance will be left to the administrator when time permits.

Q) Can I move between Windows 8 and Windows 7 and not affect the file system health model?
Yes, the file system health model will adapt to whichever operating system version it is mounted on.

Q) Will ReFS need to run chkdsk?
ReFS follows a different model for resiliency and does not need to run the traditional chkdsk utility.

Q) Will I ever need to run the old chkdsk /f?
There are cases where failing hardware can produce such severe corruption as to make the file system un-mountable; in these cases, you should perform a full, offline chkdsk to fix the file system. If for some reason this fails, we recommend that you restore from a backup.

Q) Is a reboot absolutely required to fix non-system volumes?
No, but the Action Center generally provides the simplest experience. If you’re an advanced user, you can fix non-system volumes by opening the properties of the drive, or by running chkdsk \scan <volume>: and chkdsk \spotfix <volume>: from the command line.

Q) I run chkdsk /f often to check the status of our drives, is that needed anymore?
No, the system will inform you when a corruption is found, and you can then choose to run the chkdsk /scan to detect all the issues. An online chkdsk /scan will not take away from the availability of the drive or system.

Q) I run read-only chkdsk today to check the status of our drives; do I still need to do this?
No, we recommend you run chkdsk/scan instead, since this will also perform all possible online repairs and will also prepare for a spotfix, if needed.