Taxes: Files larger than 4GB


Nowadays, a hard drive less than 20 gigabytes is laughably small, but it used to be that the capacity of a hard drive was measured in megabytes, not gigabytes. Today, video files and databases can run to multiple gigabytes in size, and your programs need be prepared for them.

This means that you need to use 64-bit file offsets such as those used by the function SetFilePointerEx (or SetFilePointer if you're willing to fight with the somewhat roundabout way it deals with the high 32 bits of the offset). It also means that you need to pay attention to the nFileSizeHigh of the WIN32_FIND_DATA structure. For example, if your program rejects files smaller than a minimum size, and I give you a file that is exactly four gigabytes, and you check only the nFileSizeLow, then you will think that the file is too small even though it is actually enormously huge.

More indirectly, it means that you can't blindly map an entire file into memory. Many programs simplify file parsing by mapping the entire file into memory and then walking around the file using pointers. This breaks down on 32-bit machines once the file gets to be more than about a gigabyte and a half in size, since the odds of finding that much contiguous free address space is pretty low. You'll have to map it in pieces or use some other parsing method entirely.

Comments (31)
  1. AB says:

    KB 925336 (http://support.microsoft.com/kb/925336) is a good story of such a blunder.

    There is a function in ADVAPI32 (SaferIdentifyLevel) which checks the digital signature on a file, given its path.

    Unfortunately, it also mapped it to virtual memory, thus it failed with files approaching 1GB in size (in 32-bit processes).

    Without KB925336 fix trying to open a large (1GB+) EXE file from a network share will result in a cryptic error (unless you disable the signature verification in Group Policy).

    The KB article, however, only mentions a similar Windows Installer issue, which is caused by the same problem.

  2. Hayden says:

    Augh! Memory mapped files is one of those "I-have-an-infinitely-big-Unix-box" paradigms that people should never learn. Yes, reading in the file in bufferSize chunks and parsing it is a bit harder, but let’s be grown-ups here!

  3. Jonathan says:

    AB: I think Visual Studio 2005 SP1 hit this.

    Joe: Once you "overcome" the address space limitation (which produces a fairly quick error message), you hit the physical RAM limitation (which throws you into paging frenzy, from which there’s rarely a way out). I expect that once the former is resolved with 64-bit, we’ll have more of the latter…

  4. Josh says:

    Hayden: Why is memory mapping such a bad idea? I’ll grant, if you write for 32 bit it’s an awful idea, but if you are writing a 64 bit program I’ve heard that it is often more efficient to rely on memory mapping.  It’s not like the OS actually reads the whole file into memory at once.  Memory mapping lets the OS handle caching more efficiently and greatly simplifies many uses of file data.  What’s not to like?

  5. Gabe says:

    Not only are HDs smaller than 20GB laughably small, they are literally laughably small. The only HD I could find (new, not used) smaller than 20GB is 8GB and weighs 13g. Just the other day I was joking with a friend about how the amount of storage that fits on his big toe is orders of magnitude larger than his first HD (20MB).

    I remember how I used to like making files larger than 4GB to see what would break. I couldn’t do this until NT 3.51 came out with file compression because 4GB disks didn’t exist at the time (or they cost thousands of dollars). As I recall, even DIR couldn’t add up files properly.

  6. AB says:

    Josh:Using memory mapping is "OK" in 64-bit programs, if performance is not that critical.

    One of the problems with memory mapping is that the "reads", although cached, are inherently synchronous. Thus, wise use of overlapped asynchronous I/O, combined with application-level caching, is always faster and less resource-intensive, although much more difficult to implement.

    There is a reason SQL Server uses asynchronous overlapped I/O, sets FILE_FLAG_NO_BUFFERING, and uses its own caching scheme instead.

  7. JM says:

    @AB: At this point there’s the obligatory pointing out that the only people who write SQL Server are the members of the SQL Server Team. The number of times the rest of the world needs to approach that sort of performance is small enough that saving development time will almost certainly win out.

    Knowing about the performance caveats is good, as long as people don’t use that as an excuse to trot out the vastly more complex solution because "it’s faster", or discard the easy solution altogether because "it might be slow". In particular, the vast majority of applications will not run into performance problems by synchronously accessing files (and indeed, the vast majority do just that).

    Aside from the >4G breakage, another non-performance related concern is that the only way to do error handling for memory-mapped files is to use structured exception handling, which will not give you any opportunity for determining the cause of the error and which may not fit in with the rest of your error-handling strategy.

  8. AB says:

    JM: In fact, you can retrieve the address, the access type, and the error code (NTSTATUS) from an EXCEPTION_IN_PAGE_ERROR (0xC0000006) exception [ http://msdn2.microsoft.com/en-us/library/aa363082(VS.85).aspx ]

    (sorry for going a bit offtopic)

  9. Cooney says:

    > Augh! Memory mapped files is one of those “I-have-an-infinitely-big-Unix-box” paradigms that people should never learn. Yes, reading in the file in bufferSize chunks and parsing it is a bit harder, but let’s be grown-ups here!

    Hehe, I have a 64 bit linux box – find me a file that’s more than about 8 exabytes :) Sure, it’s a gotcha, but it won’t matter in 5 years.

    [If you’re going to write 64-bit-only software then more power to you. Then again, your 64-bit linux software doesn’t have to worry about the WIN32_FIND_DATA structure anyway, so it’s not clear why you’re commenting here. -Raymond]
  10. Cooney says:

    I remember how I used to like making files larger than 4GB to see what would break. I couldn’t do this until NT 3.51 came out with file compression because 4GB disks didn’t exist at the time (or they cost thousands of dollars). As I recall, even DIR couldn’t add up files properly.

    This reminds me of raymond’s story about testign the limits of NTFS back in those days. LVM allowed a big pile of raid boxes to be ganged together to form a single FS and it only took about 3 days to write a file.

  11. Miral says:

    This is the sort of thing that generally I don’t worry about — I just use the 32-bit versions.  This is because I can count the number of times I’ve needed to support larger files on the fingers of one foot (not a typo).  But then, I don’t write programs that manipulate media files, either.  (Or a shell/directory type program, for that matter.)

    Still, I keep it in the back of my mind, just in case :)

    [Even if you don’t manipulate media files, somebody might just try to File.Open a file bigger than 4GB and you need to do be careful not to do anything stupid. That’s my point. -Raymond]
  12. sandman says:

    <I>your 64-bit linux software doesn’t have to worry about the WIN32_FIND_DATA structure anyway,</I>

    Well he could be writing it using libwine, maybe for portability. (But there are much better ways).

    But more usefully – what about windows64 only software – not that I can find MapViewofFile64() (or equiv) on msdn . In this space in would be sensible to think about remember to use the 64 bit function.

    At least I assume there is one? Or has it been left out for some reason ?

    [Um, try MapViewOfFile. -Raymond]
  13. Joe says:

    Didn’t early versions of GNU/Hurd map the entire *filesystem* into memory?

    On a related note, even parsers that don’t map the entire file into memory can have problems, e.g. (pseudo-C):

    while(!end_of_data(file)){

     if(check_data_validity(file)) add_data_to_array(file);

    }

    Since this amounts to actually loading the entire file into memory (as opposed to just mapping it) in some situations (RAM+Pagefile < 4GB) it could run out into problems even faster.

  14. Cooney says:

    [If you’re going to write 64-bit-only software then more power to you. Then again, your 64-bit linux software doesn’t have to worry about the WIN32_FIND_DATA structure anyway, so it’s not clear why you’re commenting here. -Raymond]

    It’s not like your blog is inapplicable to unix – it’s mostly windows (and fish) centric, but the topics address specific instances of common problems.

    [But saying “Solve the problem of the constrained 32-bit address space by using a 64-bit address space” doesn’t really help somebody who has a 32-bit system. It’s like responding to an article about how to conserve memory with “Solve the problem of running out of memory by adding more memory.” A true statement which misses the point of the exercise. -Raymond]
  15. Mike Dimmick says:

    @sandman: if Raymond’s terse answer does not satisfy, consider that Microsoft’s aim was to keep 32-bit and 64-bit code source compatible. When the Win64 effort started, a number of new types which are 32-bit on 32-bit systems and 64-bit on 64-bit systems were introduced. Then, Windows APIs were adjusted to use these types where appropriate.

    In new versions of the SDK, after these changes were made, MapViewOfFile was modified so that the dwNumberOfBytesToMap parameter is now a SIZE_T (not a DWORD). In turn that’s a typedef for ULONG_PTR which is a pointer-sized number, i.e. matches the system address size.

    MapViewOfFile already supported 64-bit offsets into the file. This change only affects the number of bytes you can map in one go.

  16. hexatron says:

    I recently ported a program to 64bits. I was really fearing the part where StackWalk was called (to produce a detailed state when something went wrong).

    The 64bit version is StackWalk64, and there are a bunch of other functions that become function64.

    Simply renaming those functions worked–first time.

    (Well, registers are now named Esp instead of Rsp, etc)

    I was amazed, dumbfounded, delighted, and left early. My heartiest thanks to whoever busted their butts to make this happen.

  17. Cereal says:

    I’ve always wondered how mmap/CreateFileMapping is supposed to make file parsing any easier?  It doesn’t solve

    • Misaligned data on architectures where that matters
    • Struct packing problems

    • I/O errors in situations where exception handling is cumbersome

    Although file mapping is a good solution to some very narrowly-defined problems.

  18. Daniel Cheng says:

    AB: I think SQL Server (and most other database server) do application-level buffer due to the complex MVCC / ACID requirements. Unless your application is on a dedicated server, it’s not a good idea to allocate a large pool of cache in application level..

  19. AB says:

    @Daniel Cheng: I meant "application-level caching" in its broadest sense.

    While SQL Server caches ‘raw’ database pages themselves, working with any huge data structure without mapping it entirely into address space is also a form of "caching", just implemented differently.

    For media files, for example, you usually only need to "cache" stream metadata.

  20. Dave says:

    Given Microsoft’s (very laudable) attempts to maintain transparent backwards-compatibility if at all possible, I wonder how long it’ll be before some sufficiently large cu$tomer request a small hack to MapViewOfFile(), just a minor change, won’t affect anything else, and we’d really appreciate it… and I wonder how many people in whichever group would have to implement this would plan on taking their annual leave at about this time?

  21. Worf says:

    Actually, mapping files into memory can be used to help solve a class of problems that is a bit more difficult using the stream-of-bytes representation (read/write or ReadFile/WriteFile). Just like stream-of-bytes method can be more useful in a different set of problems.

    First, think of cases where a file actually represents in-memory data. For example, perhaps an operating system wants to load an executable image into memory. Sure it could read the headers, figure out where the sections are, allocate memory, copy the file to RAM… or it can read the headers, map the regions of the file to the right place, and let the MMU handle it.

    Or perhaps you have a big array-like data structure, where bits may change inside it. It’s an array because you know it’s randomly read and written to. So, you can in your initialization, open the file, read it all into memory, close it, then on shutdown, reopen the file, save it all, and exit. Or do the much simpler method of open the file, map it into memory (note – no storage allocated!), then on shutdown, unmap the memory (which also happens to close and commit any unflushed changes to the disk file).

    If you are processing data linearly, say, a video file, the stream-of-bytes method is perfect. If however you’re going to read the entire file into memory in the first place, maybe memory mapping is better. It can certainly avoid having to allocate that much RAM storage, and at the very least, random reads/writes all over the place doesn’t mean you need to do seeks and writes and reads continually.

    They’re used for different things, and somethings, the stream of bytes is the perfect way of doing things, for other things, seeing the entire file at once is ideal. Especially if all you’re doing is copying the file to RAM in the first place.

  22. Jim says:

    Worf:

    Don’t get too caught up in a single OS/Computer thinking.

    I agree that a small set of problems involve directly mapping a file into ram, however, that should be only for the OS and not applications.  Applications should take an agnostic approach and not depend on a certain byte ordering or memory layout.

  23. sandman says:

    <i>Applications should take an agnostic approach and not depend on a certain byte ordering or memory layout.</i>

    Agreed. But that doens’t make memory mapping useless – you just need to ensure you use wrapper functions to access any binary fields. As a previous poster pointed out it all depends on your processing model.

    @Mike Dimmink.

    I thought of that – but the msdn on MapViewofFile() list as in the win32 API – so I assumed (wrongly) it was the 32bit size_t. That will teach me.

    But if 32bit and 64bit are compatible then why are then any taxes to pay. As long as you check that your not try to map is sane – which is where we started I guess. I thought about testing against (size_t)(-1)/4 but that is huge on a 64 bit system – but I’m not sure whether it is unreasonable huge in all cases for ever.

    [This tax is only tangentially related to 64-bit Windows. If you test only the nFileSizeLow then you are buggy with files larger than 4GB. You can make this mistake on both 32-bit and 64-bit Windows. I’m sorry I brought up memory-mapped files. Once again, people focus on the colorful details and miss the point of the article. -Raymond]
  24. Bramster says:

    [Even if you don’t manipulate media files, somebody might just try to File.Open a file bigger than 4GB and you need to do be careful not to do anything stupid.]

    Indeed.  I have to sort files larger than 4GB. .

    I have to be very, very careful with my 32-bit programs.

  25. Dean Harding says:

    Hehe, the previous trackback is funny :-)

  26. DriverDude says:

    "Once again, people focus on the colorful details and miss the point of the article. -Raymond"

    Because we all hate taxes.

    Though I’m not sure this is just a "tax" in the manner of , e.g, the power management tax. An app that is ignorant of power management is an inconvience to users but generally isn’t a bug. An app that incorrectly handles files > 4GB has a bug, or can lose data if it incorrectly truncates large files.

    I would say it is a *requirement*, at a minimum, to correctly reject large files if an app doesn’t want to be >4GB clean.

    And we have to worry about legacy file formats too, such as .ZIP that has a 32-bit file size limit.

  27. Miral says:

    [Even if you don’t manipulate media files, somebody might just try to File.Open a file bigger than 4GB and you need to do be careful not to do anything stupid.]

    My point was that in the apps where I’ve even *got* a File->Open (rare), typical interaction with the file would be "open, read 32 bytes, close, display messagebox ‘Invalid file’" (since the magic number in the header wouldn’t be there since it can’t be one of my files if it’s that big).  None of which need to worry about 64-bit file lengths :)

    I realise I’m probably a corner case though.

    (Although a different 64-bit thing did bite me recently.  time_t changed from 32 bits in VC6 to 64 bits in VC8, which broke one of my data structures.  But never mind, I got it sorted out in the end.  My fault for not paying attention to the "breaking changes" notes, probably.)

  28. David Walker says:

    Errors similar to this seem to abound.  I tried to install a patch to a customer contact software program about a year ago, and it failed with a "not enough free disk space" error.  Strange, since the disk had 32 GB free.

    It turns out that if the amount of free space on the installation target disk was close to a multiple of 4 GB, the installer (not MSI but one of the other 2 big ones) would fail saying "not enough free disk space".  

    The workaround was to create a huge file on your C drive to take up space.  The installer was eventually patched to fix the arithmetic.  

    Apparently, big numbers are hard for (some) programmers to understand…

  29. file > mem says:

    Mapping a file to memory is ALWAYS a bad idea. In 16-bit DOS this was a bad idea. In 32-bit windows this is a bad idea. In 64-bit windows this is a bad idea. DONT DO IT. Use buffering/streaming instead.

  30. Dean Harding says:

    "file > mem": Are you one of those people who also think the goto statement is "evil"?

    Memory mapping has it’s advantages and disadvantages, just like anything else. To say "DON’T DO IT" is premature and rather limiting.

Comments are closed.

Skip to main content