Random musings on the introduction of long file names on FAT


Tom Keddie thinks that the format of long file names on FAT deserves an article. Fortunately, I don't have to write it; somebody else already did.

So go read that article first. I'm just going to add some remarks and stories.

Hi, welcome back.

Coming up with the technique of setting Read-only, System, Hidden, and Volume attributes to hide LFN entries took a bit of trial and error. The volume label was the most important part, since that was enough to get 90% of programs which did low-level disk access to lay off those directory entries. The other bits were added to push the success rate ever so close to 100%.

The linked article mentions rather briefly that the checksum is present to ensure that the LFN entries correspond to the SFN entry that immediately follows. This is necessary so that if the directory is modified by code that is not LFN-aware (for example, maybe you dual-booted into Windows 3.1), and the file is deleted and the directory entry is reused for a different file, the LFN fragments won't be erroneously associated with the new file. Instead, the fragments are "orphans", directory entries for which the corresponding SFN entry no longer exists. Orphaned directory entries are treated as if they were free.

The cluster value in a LFN entry is always zero for compatibility with disk utilities who assume that a nonzero cluster means that the directory entry refers to a live file.

The linked article wonders what happens if the ordinals are out of order. Simple: If the ordinals are out of order, then they are invalid. The file system simply treats them as orphans. Here's an example of how out-of-order ordinals can be created. Start with the following directory entries:

(2) "e.txt"
(1) "Long File Nam"
"LONGFI~1.TXT"
(2) "e2.txt"
(1) "Long File Nam"
"LONGFI~2.TXT"

Suppose this volume is accessed by a file system that does not support long file names, and the user deletes LONGFI~1.TXT. The directory now looks like this:

(2) "e.txt"
(1) "Long File Nam"
(free)
(2) "e2.txt"
(1) "Long File Nam"
"LONGFI~2.TXT"

Now the volume is accessed by a file system that supports long file names, and the user renames Long File Name2.txt to Wow that's a really long file name there.txt.

(2) "e.txt"
(4) "e.txt"
(3) "ile name ther"
(2) "really long f"
(1) "Wow that's a "
"WOWTHA~1.TXT"

Since the new name is longer than the old name, more LFN fragments need to be used to store the entire name, and oh look isn't that nice, there are some free entries right above the ones we're already using, so let's just take those. Now if you read down the table, you see that the ordinal goes from 2 up to 4 (out of order) before continuing in the correct order. When the file system sees this, it knows that the entry with ordinal 2 is an orphan.

One last historical note: The designers of this system didn't really expect Windows NT to adopt long file names on FAT, since Windows NT already had its own much-better file system, namely, NTFS. If you wanted long file names on Windows NT, you'd just use NTFS and call it done. Nevertheless, the decision was made to store the file names in Unicode on disk, breaking with the long-standing practice of storing FAT file names in the OEM character set. The decision meant that long file names would take up twice as much space (and this was back in the days when disk space was expensive), but the designers chose to do it anyway "because it's the right thing to do."

And then Windows NT added support for long file names on FAT and the decision taken years earlier to use Unicode on disk proved eerily clairvoyant.

Comments (21)
  1. Anonymous says:

    Using UTF-8 would use half as much space most of the time and be as future-proof as using UTF16-LE was.

    ["Hey, let's design our storage format to use an encoding that hasn't been invented yet!" -Raymond]
  2. Anonymous says:

    Wait, does this mean I can trigger some fun weirdness by picking two LFNs with the same checksum and deleting one of them using a non-LFN aware application?

  3. Anonymous says:

    "Using UTF-8 would use half as much space most of the time and be as future-proof as using UTF16-LE was."

    Unicode codepoints were 16-bit until Unicode 2.0, so UTF-8 didn't exist yet.

  4. Anonymous says:

    the decision taken years earlier to use Unicode on disk proved eerily clairvoyant.

    Heh. No, it proves that designers were very smart and had experienced problems with forward compatibility before. :-) The "right thing" may look expensive but in reality is often the cheapest way of doing things, even when discounting future expenses with a high interest rate…

    @B: I'm not quite sure that "fun" would cover it. But weirdness sounds about right. :-D

  5. Anonymous says:

    Hey, let's design our storage format to use an encoding that hasn't been invented yet!

    Rob Pike says UTF8 was invented in 1992, when was this FAT work done?

    (But the RFC is dated 1996, so maybe it was not widely known until then)

    Using UTF-8 would use half as much space most of the time

    This is, of course, an American perspective. The Japanese perspective might be "Using UTF-8 would use one and a half times as much space most of the time".  Western Europeans would just say "meh" (except that word had not been invented).

  6. Anonymous says:

    Awww. Too much time spent double-checking before being the first with a witty 'time machine' response to the UTF-8 thing. Oh well. Now I'm wondering if there could be nearly space equivalent work-around if they hadn't done the Unicode thing years earlier.

  7. Anonymous says:

    More relevant than the possible existence of UTF-8 when the system was designed is the fact the Unicode was at that time supposed to be a 16-bit character set, so UCS-2 would be a fixed-width encoding. The practical advantages of working with a fixed width per character would have been compelling enough to justify the increased space demand over UTF-8.

    However, once the upper planes came along and UCS-2 became UTF-16, with surrogate pairs and variable characters, all of that was negated and one might as well have used UTF-8 (with the added benefit that many more developers would have natural occasion to test that their code was variable-width clean, instead of now where characters that require surrogates in UTF-16 are all somewhat obscure). Unfortunately, by then it was too late to go back on the fundamental decision.

  8. Anonymous says:

    I used the equiv. to meh long before 1995, except it was pronounced more like ænh.

  9. Yuhong Bao says:

    On that matter, I wonder why in Win9x the conversion from/to Unicode was done in IFSMGR with no interface exposed to user mode that I know of to directly access the Unicode filenames.

  10. Anonymous says:

    @dave, Greeks and Russians may well say "meh" to the difference between UTF-16 and UTF-8, space-wise. But a very large fraction of the characters in a typical  text in a Western European language are ASCII, so the space savings in UTF-8 are quite relevant here. (The letters that aren't ASCII are 2 bytes wide in UTF-16 as well as UTF-8, but they are not the majority of characters in a typical text).

  11. Anonymous says:

    I used a column handler shell extension on Windows 2000/XP to see short file names in Details view (for some legacy apps and games that require specifying the path in 8.3 style). I can't on Vista/7 because column handlers are gone. So I have to use dir /x. Just another example where GUI is destroyed/broken and a step back to command line has to be taken. Point to be noted: bring back column handlers.

  12. Anonymous says:

    What really surprises me is how they were able to kludge the long file names into the FAT. Most of the time situations like these require creating a hidden file that contains the new information.

  13. Anonymous says:

    LFN wasn't so fun before FAT32 came out because the root directory in FAT12/16 has a fixed number of file slots. Imagine the fun I had when I tried for the first time to copy a whole bunch of files to a 360K floppy and only 15 of them would go!

    That said, I think that LFN is a wonderful hack; for the most part, it was completely backwards-compatible, didn't mess with the disk structure, and was space-efficient. I salute the team that developed it!

  14. Yuhong Bao says:

    See this Long Filename Specification 0.5 from December 1992 for some more history:

    http://www.osdever.net/…/LongFileName.pdf

  15. Anonymous says:

    @ErikF: Just reformat the disk with a larger root directory size in the disk's boot record first! (Don't forget to reduce the number of clusters appropriately.)

  16. Anonymous says:

    @Neil: I tried that and found out that MS-DOS 6.x will corrupt any disks formatted that way. Win 9x and NT seem to work just fine though.

  17. Anonymous says:

    First BillG demo I ever did was of LFNs and showing how backwards compatible it was with 3.1. That was back when Bill would come to our building and go from office to office watching the demos and asking questions.  

  18. Anonymous says:

    @Joshua: MS-DOS is inclined to ignore bits of the boot sector, depending what's in the OEM ID. Unfortunately the Windows 95 volume tracker likes to scribble all over the OEM ID.

  19. Anonymous says:

    Thanks to Unicode support in LFNs, it was always fun when we had to deal with floppies from Russia (Windows would display just a bunch of underscored for filenames, and you couldn't access these files, though IIRC, either scandisk or chkdsk did "fix" this by renaming the files to garbage names that could then be open).

  20. Yuhong Bao says:

    ender: I think this is a direct consequence of the codepage conversions being done in IFSMGR. I wonder why was it designed that way.

Comments are closed.