FAT Filesystem Performance Issues for DVR

Time to switch gears a bit away from audio...

First a minor digression into what we were building (which some people may find interesting, and this is still a Multimedia blog ;-). Then I'll get to how this involves the FAT filesystem.

One of the major components of the Tomatin feature pack was a DVR engine to allow OEMs to develop their own digital video recorders for IP video broadcasts. At its heart, a DVR must support the ability to write a video stream to hard disk and at the same time read a video stream back from hard disk for playback, all at real-time speeds and without glitches or dropouts. Other important features include the ability to fast-forward and rewind playback (while still displaying video onscreen).

The core DVR design was ported from another Microsoft team building an NTSC broadcast DVR. I'm never sure how much info I can give out in these blogs about other groups and project code names, so I don't mean to deny credit- those guys (IMHO) did a nice job and delivered to us exactly what they promised (how often can you say that about code that gets dropped in your lap from another team).

In any case, their DVR implementation included a hardware decoder which took NTSC video and encoded it to an MPEG2 Program Stream. The DVR engine handled writing/reading this stream to/from a hard disk, and also managed a myriad of other details like tracking and indexing key frames for fast forward/rewind/seek, managing A/V synchronization, compensating for drift between the record and playback streams, and a ton of other stuff. The output of the DVR engine was also an MPEG2 Program Stream, for which they had a hardware decoder which would generate and output the final video and audio data.

IPTV systems typically transmit their data as MPEG2 Transport Streams. These are somewhat different from MPEG2 Program Streams, and the two are neither subsets nor supersets of each other. To convert from one to the other you really need to do a full demultiplex/remultiplex operation. The bad news for us was that our DVR was designed to accept Program Streams, not Transport Streams; and we frankly didn't have the time or expertise to rewrite and retest that portion of the code. The good news was that IPTV hardware typically includes a Transport Stream demultiplexer which can produce demultiplexed (AKA elementary) video and audio streams, so we just needed to remultiplex the streams to get a Program Stream. The bad news was that we didn't have such a multiplexer, nor did anyone else inside of Microsoft as far as we could tell.

To make a long story short, one of our developers wrote, from scratch, an MPEG2 Program Stream multiplexer which takes elementary audio and video streams and turns them into an MPEG2 Program Stream suitable for use by the DVR engine. This DShow component shipped as part of the Tomatin feature pack.

For testing purposes, we also ported the desktop's implementation of the MPEG2 Transport/Program Stream Demultiplexer. We used this component strictly for our own testing, and never ran it through its own system/unit test cycle, so this component didn't ship with Tomatin. However, it did turn out to be quite stable and it was made available at https://codegallery.gotdotnet.com/wincedemux.

Back to our story...

At the start of the project our goal was to record and playback standard definition content (e.g. 640x480). Partway through the project a key OEM came onboard and added the requirement to handle HDTV content at 20Mbit/sec. Ugh.

The hardware this had to run on was essentially a low-end x86-compatible processor with hardware acceleration of MPEG2 TS demultiplexing and audio/video decoding and playback. Storage was on a fairly generic 7200RPM 300GB IDE hard disk. All our code had to do (performance wise) was remultiplex the elementary streams into a program stream, write it to disk, read it from disk, demultiplex it to separate audio and video streams, and send those streams to the decoding/playback hardware. While this might not seem like that much work, at 20Mbits/sec it was pretty tough to do on low-end x86 hardware, and we spent alot of time optimizing our inner loops (e.g. the code to scan for the next MPEG2 packet start code).

The only filesystem well tested and available from within Microsoft for WinCE is the FAT filesystem. We had heard anecdotal stories of performance issues with FAT with large hard disks, but we didn't have time in the schedule to write our own filesystem (we did spend a bit of time looking around for something elsewhere in the company that we could reuse, but ultimately didn't find anything suitable from a risk/schedule standpoint).

During design and testing we did experience a number of bottlenecks due to the FATFS design, and we got really familiar with the FAT FS architecture and its limitations. Ultimately, working with the filesystem team, we got fixes for issues that could be fixed (which have since been released as CE 5.0 QFEs), and developed workarounds in our code for problems inherent in the filesystem. That's what this blog is about. Keep in mind that I'm a multimedia guy, so I'll probably gloss over things a bit. Feel free to comment on things I get wrong or wasn’t clear about.

To understand FAT limitations you need a little background.

Sectors: Sectors are the smallest unit of storage on a disk. A hard disk sector is (typically) 512 bytes. A 300GB hard disk will have approximately 586 million sectors.

Clusters: The cluster is a logical grouping of sectors. For a given FAT disk format, a cluster is made up of a fixed number of sectors which is a power of 2. Filesystems typically organize data at the cluster level. Choosing an appropriate cluster size is a trade-off: the bookkeeping overhead for a file is going to be proportional to the number of clusters it contains, and a file always utilizes an integer number of clusters. If you choose too small a cluster size, the amount of overhead (both storage and CPU) associated with a file may be quite large. If you choose too large a cluster size, you'll end up wasting unused cluster space at the end of each file. (Note: Other non-FAT filesystems have come up with a variety of ways to alleviate this problem, but we're dealing with FAT here).

FAT: The FAT, or File Allocation Table, is an array at the start of the disk where each element in the array is associated with a specific cluster. There are different variants of FAT which are represented by the size of each FAT entry. We're only interested in FAT32, where each FAT entry is a 32-bit value, and which is the only version appropriate for relatively large hard disks.

For FAT32 the cluster size may vary depending on how it was formatted. However, the maximum cluster size for FAT32 is 32k, which is 64 sectors. On a 300GB hard disk there will be roughly 9 million clusters. The FAT table itself will take approximately 37.5MB of disk space, spanning 73 thousand sectors. If you've enabled the backup FAT table, the FAT filesystem keeps an additional copy of the FAT (I'm not sure if it does this by default, or if that's controlled by the FATFS_ENABLE_BACKUP_FAT flag).

This is one of the first limitations of the FAT design: the 32k maximum cluster size is too small for such large hard disks, and contributes to the overhead in a variety of ways.

Each entry in the FAT represents a specific cluster on the disk. The set of clusters which make up a specific file are tracked by using the FAT array as a singly linked list: the FAT entry for a given cluster will contain the number of the next cluster in the file. The FAT entry of the last cluster in the file will have the FAT entry value of 0xFFFFFFFF. To seek to a specific location in a file, one must traverse the entire FAT chain of that file to find the cluster associated with that location. A 4GB file will contain 125,000 clusters, so seeking to the end of that file will entail traversing 125,000 FAT entries. This is one source of performance bottlenecks.

Unallocated clusters are represented by the value 0 in the cluster's associated FAT entry. There is no mechanism built into the FAT on-disk architecture to track unallocated clusters. When allocating a new cluster to a file, the filesystem code must search the FAT until it finds an unallocated cluster. On a large mostly allocated hard disk this can take an extremely large amount of time- yet another bottleneck.

It might help to understand how FAT searches for a free cluster, which is optimized to attempt to allocate clusters contiguously: whenever FAT needs to allocate a new cluster to a file, it searches the FAT table starting at the last cluster it successfully allocated. If you’re doing a single allocation of multiple clusters, it remembers (as a local variable in the loop) this last cluster it allocated to that specific file and will do a good job allocating contiguously from that point for that file. However, if you’re growing the file via individual cluster allocations (e.g. performing lots of relatively small append operations to the file to grow it), each allocation is going to start based on the last cluster allocated on the volume (by any file, not just yours). In this case, if you’ve got multiple threads allocating individual clusters at the same time, there’s a higher probability the files are going to take alternating clusters, causing worse fragmentation.

 

Moving on to the directory structure: Each directory on a disk consists of a file entry table. Each file within a directory occupies an entry in that table (I'll get to long file names in a bit). Each of these entries includes (among other things) the file's 8.3 filename, the index of the first cluster in the file, and a 32-bit filesize. The use of a 32-bit filesize means that individual files are limited to 4GB. This is another limitation, especially for DVR scenarios where extremely large files may be generated.

File entries within a given directory are not kept in any particular order. When a file is opened, the table for the directory is scanned, starting at the begining, to search for the file. When a new file is created, the entire table for that directory must be scanned to determine if the file already exists. When a file is deleted, the entry for that file is cleared, but the remainder of the table is not repacked to reclaim the entry (although the entry may be reused if a new file is created). What all this means is the amount of time to open or create a file in that directory will rise linearly with the number of files in that directory. A directory with a large number of files may impact performance.

The use of long file names can impact directory performance as well: long file names are allocated as additional directory entries, so each normal file which also includes a long file name will occupy additional directory entries. On WinCE, long filenames do not need to be generated if the filename is 8.3 compatible: it must contain all uppercase characters, no unicode characters, and be no longer than 8 characters long with no more than a 3 character extension. For example, FILENAME.TXT would not generate a long filename and uses only one directory entry. FileName.Txt, filename.txt, and FILENAME0.TXT would all genenerate a long filename. Frankly, this wasn't something we thought of during the DVR project (I didn't think of it until I started working on this blog). Our DVR filenames weren't 8.3 compatible and thus did generate long filenames, but we never observed a problem caused by this that I recall.

As an aside: during our testing we hit a bug in the FAT Filesystem code (which has since been fixed via QFE). There was an overflow in the cluster calculation when creating files which used the last cluster before the 4GB limit (i.e larger than 4GB-32K). The result was that that when attempting to create a file of that size, the system would allocate all available clusters on the disk to the file until it ran out of disk space and returned a failure code. At that point the disk would be left in a state where the directory would show the file we created as having 0 bytes allocated to it, yet the disk would appear full. Deleting the file would correct the situation.

Now, to summarize the problems we had, and the fixes or workarounds we came up with:

1. The 4GB filesize limit: There's really nothing that can be done about this. For our project the DVR engine code was already designed to record video streams as a set of relatively small files (e.g. holding a few minutes of data each), rather than one large file. This solved a couple of different problems: it avoided the 4GB filesize limit and it reduced the seek overhead of large files. It also solved a problem peculiar to DVR implementations: when recording to a "temporary" buffer which should only keep the last 60 minutes of video, one needs to conceptually record to the end of the file while simultaneously truncating the file from the beginning. There's no way to do this with a single file, whereas with multiple files it can be done easily by just deleting the earliest file in the set. I believe we ended up with files in the 128MB to 256MB range (this is a registry-tunable parameter). 

 

2.   Seeking within very large files: When seeking within a file, we have to traverse the file’s FAT chain to find the desired cluster. As the file grows toward the 4GB filesize limit, the number of links that need to be traversed gets very large (for 32k clusters, up to 2^17 entries). As mentioned above, we partially avoided this issue by not using the full 4GB filesize available to us.

However, at the same time, the Filesystem team developed an improved algorithm which should greatly help this situation: The original FAT code we were working with included a single per-open-file 32-bit cached entry of the last cluster accessed within the file, so sequential reads/writes always hit this cache and didn’t need to retraverse the FAT chain. However, for DVR applications we were sometimes thrashing this entry by simultaneously reading and writing different locations within the file. The recent QFE fixed this by adding additional cache entries so multiple threads shouldn’t thrash against each other. There are probably still pathological situations which might have perf issues (e.g. repeatedly seeking backward within a very large file). Note- the “cache” entries I’m talking about here are not the same as the FAT cache I'll mention at the end of this blog: That's a cache of the actual sectors comprising the FAT table. Increasing the FAT cache size would also help this issue as well though.

3.   Appending to a file on a very large, almost full disk: When appending to the end of a file, we need to search the FAT table looking for a free cluster. On very large disks which are almost full this search can potentially take a very long time. The DVR code already had a solution to this problem by spinning off a low-priority thread which would run out ahead of our disk-write thread and preallocate data files to be used later by the write thread. Since that time the filesystem team has shipped a QFE which keeps an in-memory data structure to track free clusters; this QFE should greatly improve performance in this situation.

4.   Opening/creating a file in a directory filled with a large number of files: When you open or create a file, we need to search the directory looking for a file with the requested name (we need to do this even if creating a file to ensure we don’t create two files with the same name). If the directory has a very large number of files, this algorithm can take a long time. The FAT directory structure isn’t sorted and there’s no efficient way of searching it other than looking at every entry; the worst-case situation is creating a new file, since we need to search the entire list before we realize the entry doesn’t already exist. There is no QFE to solve this issue. Increasing the Data cache may help, but the best thing is not to create too many files within a single directory.

This was a serious issue for our DVR code, which was creating hundreds of files. However, we had control over the filesystem names (which were typically something like xxxx0000.yyy, xxxx0001.yyy, xxxx0002.yyy) and where they were stored. Our solution was to generate a directory name hash off of the filename and divide the files across multiple subdirectories.

Other useful bits of information:

 

The FAT filesystem code keeps in-memory caches for both the disk FAT and disk data. The sizes of these caches may be configured via the registry:

[HKEY_LOCAL_MACHINE\System\StorageManager\FATFS]

            "FatCacheSize"=0xXXXXX - Size of the FAT Table Cache

            "DataCacheSize"=0xXXXX - Size of the Data Cache

 

Where XXX is the number of sectors. Must be a power of 2 and at least 16 sectors. If 0, FATFS will determine best cache size. The maximum size that it defaults to is 256K which is way too small for 120G disk. Pick a number and multiply by 512: that is the amount of memory you will take.

If you do need to preallocate space to a file, keep in mind that the minimum set of calls to do so are something like:

    hFile = CreateFile(szFilePath, GENERIC_WRITE, FILE_SHARE_READ, NULL, CREATE_ALWAYS, 0, NULL);

    long FileSizeLow = 0x80000000; // e.g. create a 2GB file

    long FileSizeHigh = 0;

    dwRet = SetFilePointer(hFile, FileSizeLow, &FileSizeHigh, FILE_BEGIN);

    bRet = SetEndOfFile(hFile);

    CloseHandle(hFile);

You do not need to actually write any data (doing so would just be wasted overhead).

Note that the docs on SetFilePointer are a little vague; you must include the &FileSizeHigh param: a side-effect of including this is to force SetFilePointer to interpret FileSizeLow as an unsigned value rather than a signed value. The other caveat is that if you’re preallocating the file you’ll need to keep track of how much data you’ve actually written to the file elsewhere (you can’t rely on GetFileSize, or just seeking to the end of the file and writing to append data to the file).

Finally, I should note that the ExFAT filesystem which was recently shipped with Windows CE 6.0 should further alleviate some of these issues. My understanding is that ExFAT supports files larger than 4GB and implements bitmap-based unallocated-cluster mechanism. I'm not intimately familiar with it though and haven't tried it yet, so I can't go into any more detail.

That’s about all I have in me for now. Feel free to leave any comments, good or bad.

-Andy Raffman