When I tell Windows to compress a file, the compression is far worse than I get if I ask WinZip to compress the file; why is that?

A customer noted that when they took a very large (multiple gigabyte) file and went to the file's Properties and set "Compress contents to save disk space", the file shrunk by 25%. And then they needed to copy the file to a USB stick, so they used an old copy of WinZip to compress the file, and the result was half the size of the original.

Why is it that an old 10-year-old program can compress files so much better than Windows 2012's built-in disk compression? Is the NTFS compression team a bunch of lazy good-for-nothings?

Transparent file compression such as that used by NTFS has very different requirements from archival file compression such as that used by WinZip.

Programs like WinZip are not under time constraints; they can take a very long time to analyze the data in order to produce high compression ratios. Furthermore, the only operations typically offered by these programs are "Compress this entire file" and "Uncompress this entire file". If you want to read the last byte of a file, you have to uncompress the whole thing and throw away all but the last byte. If you want to update a byte in the middle of the file, you have to uncompress it, update the byte, then recompress the whole thing.

Transparent file compression, on the other hand, is under real-time pressure. Programs expect to be able to seek to a random position in a file and read a byte; they also expect to be able to seek to a random position in a file and write a byte, leaving the other bytes of the file unchanged. And these operations need to be O(1), or close to it.

In practice, what this means is that the original file is broken up into chunks, and each chunk is compressed independently by an algorithm that strikes a balance between speed and compression. Compressing each chunk independently means that you can uncompress an arbitrary chunk of a file without having to uncompress any chunks that it is dependent upon. However, since the chunks are independent, they cannot take advantage of redundancy that is present in another chunk. (For example, if two chunks are identical, they still need to be compressed separately; the second chunk cannot say "I'm a copy of that chunk over there.")

All this means that transparent file compression must sacrifice compression for speed. That's why its compression looks lousy when compared to an archival compression program, which is under no speed constraints.

Comments (33)
  1. CmdrKeene says:

    Another life lesson: don’t attribute to malice or incompetence what is probably just your own ignorance.

    1. Mr Cranky says:

      Nice corollary to Lazarus Long’s famous axiom.

  2. Jane's Fleet Command says:

    When can we look forward to NTFS supporting the O(-1) space/time performance of the Pied Piper compression algorithm?

  3. Many, many, many moons ago – in the days of Stac drive compression – you could recompress data with a higher compression level, per-file, using a supplied utility. It was something I often did with the help of a 4DOS batch file that checked the dates, and recompressed data that hadn’t been touched for ages.
    Of course, the downside was that if you did so, they took a lot longer to save if you did edit then them – as you rightly say.

    I always wondered why the obvious trade-off wasn’t made – allow a high compression method, but it makes the file read-only. (I suppose my batch file could have done that anyway!)

    Such wondering was simply an academic exercise though. A quick glance at ever-tumbling storage costs answered the question – it’s not worth it, because a bigger hard disk is always getting cheaper.

  4. Also, Microsoft Windows can create ZIP files without third-party software since Windows ME. Using this feature, I suppose you get results close to WinZip’s (although I haven’t used third-party compression software since ME).

    1. Yuri Khan says:

      Yes, and the resulting ZIP archives have invalid file name encoding until Windows 7 or some non-default updates for XP.

      1. I think it’s rather unfair to require Windows XP to support a feature that didn’t exist at the time it was written.

        1. Yuri Khan says:

          Hm, you’re right. UTF-8 was added to the ZIP spec as an official feature in version 6.3.0 on 2006-09-29. Before that, the subject of file name encoding was not even discussed in the spec, technically making ZIP an unsuitable format for cross-locale interoperation.

  5. Ben L says:

    I’ve always been impressed with the “Transparent file compression.” It works flawlessly.

  6. DWalker says:

    Yes, these constraints to allow reading a byte from the middle of a file, and changing a byte in the middle of the file, do exist… and we have to allow for that.

    I wonder what percentage of file reads are sequential, full-file reads? I’ll bet it’s a high percentage.

    Aren’t EXE and DLL files always read sequentially, in their entirety? And most data files including Microsoft Office files (Word, Excel, Powerpoint, maybe not Access) are too. Not that this helps the real-time disk compression requirements..

  7. cheong00 says:

    On the other hand, I wonder when the industry will settle the choice of ZIP64 and Deflate64. This makes choosing ZIP format for moving large files around on Windows and *nix systems undesirable.

  8. Ray Koopa says:

    And then I experienced that my disk usage actually went up noticably when telling it to compress my drive. I think it was like 120GB free of 300GB data, and after applying this disk compression stuff, I had only 80GB free… not sure about exact numbers anymore, I fear it since then and never tried that again. Did I just hallucinate or why is that?

    1. xcomcmdr says:

      I was constantly using it since the XP era with no problem.

      Once I got a SSD, I experienced what you described.

      I don’t the drive type is related, but since then I stopped using it.

      Plus once a whole drive is compressed, you’ve got a massive case of fragmentation to resolve with Windows’ defragmenter (only a problem for HDDs, obviously), which can take days on XP-era machines.

      1. Ray Koopa says:

        It was quite some time ago, in 2011 or so, I had an HDD back then. It’s crazy it happened to you too. I also think it had something to do with fragmentation…

      2. M Hotchin says:

        Perhaps fragmentation was the problem – if all the files are highly fragmented, the MFT has to grow in order to records the huge number of fragments. Is there a tool that shows the size of the file system metadata, like the MFT and the indexes?

        1. M Hotchin says:

          Ah, it looks like CHKDSK will at least give an overview. Doesn’t give much detail though.
          176524 KB in 57045 indexes.
          0 KB in bad sectors.
          503083 KB in use by the system.
          65536 KB occupied by the log file.

  9. Neil says:

    This of course begs the question as to how the compressed chunks are stored; for example a write operation might increase the compressed size of a chunk.

    1. Chris Long says:

      Apparently, the compressed file is a special case of a sparse file, which allows some magic to happen…:


    2. Medinoc says:

      I’ve worked on something that dealt with compressed files and fragmentation: chunks are stored back-to-back as individual “extents” of a file (which is easy to detect to consider all consecutive extents a single fragment), so my first guess would be that a chunk changing size would have to be re-written somewhere else, fragmenting the file.

  10. MarcK4096 says:

    I bet Windows 2012 R2 deduplication would have been more competitive. I’ve never been a fan of NTFS compression, but I think Windows Server 2012 R2 dedup is pretty good.

  11. Medinoc says:

    On the other hand, I did experience genuinely poor compression on Windows, using a .Net 2.0 System.IO.Compression.GZipStream to compress a TAR archive containing mostly JPEGs (i.e. already-compressed files); I did expect the resulting file to be slightly bigger than the original, but I expected *slightly* bigger, with an overhead in the 5%-10% range. Instead, the resulting tar.gz file was MORE THAN FIFTY PERCENT BIGGER than the original!

    1. cheong00 says:

      Even ZIP or RAR can compress a file with greater size than original in it’s worst case, that’s why in implementing compression functions, you have to check the resulting file size with original, and if it’s larger, “compress” the file(s) again with “store only” in compression quality option.

      1. Medinoc says:

        As I said, I expected that. I just didn’t expect it to be *that* bad, because even the most basic RLE compressor has a code to say “follow X bytes of unaltered data”!

        I would have been OK with an output 10% bigger than the original. *Not* with an output 50% bigger.

    2. Joshua says:

      Don’t use the system copy of GZipStream; use the copy from ICSharpCode.SharpZipLib.dll

  12. Kemp says:

    I wouldn’t expect them to be the same, though I’ve never had to articulate the reasoning behind that intuition. Makes perfect sense with a little thought. On the other hand, the decompression performance of the built-in Zip support is abysmal. *shakes fist at zip team*

    1. Richard says:

      I am pretty sure that the Windows built-in ZIP support is using a Schloem-the-painter algorithm for directory listing.

      It is stupendously slow at merely navigating a ZIP containing a few thousand files – it can take almost a minute to do a simple directory listing.

      It’s quite shocking that the QuaZip library can list *and* extract all the files in less time than Windows can drop one level in the directory tree – same hardware, same ZIP file.

  13. jader3rd says:

    Is there an API to call, to zip a file in the same manner how Windows Explorer zips a file?

  14. Matteo Italia says:

    TRWTF is that people in 2016 still install nagware like WinZip or WinRar to compress files.

  15. DWalker says:

    I know that the file system needs to support “reading a file in the middle”; “reading the last byte”; etc. But I’ll bet that 95% of file reads are from the start to the end.

    Surely, EXE files and DLLs are read from start to finish. And Excel files, Word files, Powerpoint files… maybe not Access databases.

    This doesn’t help the issue here.

    1. Medinoc says:

      Actually, for EXEs and DLLs I doubt it since memory-mapped sections are used: It’s likely parts would be paged on-demand (though I don’t know how prefetch stuff factors into this)…

  16. 640k says:

    I always wished for real zipfile compression (call it transparent if you want) in a OS when storing files, with all expected drawbacks that would be more obvious, instead of some NIH “transparent” compression with leaking abstractions that doesn’t work well anyway, and usually contains a lot of black box surprises that in some cases works like “magic”.

    1. Erik F says:

      I am not a compression expert, but this is my take on why transparent file compression doesn’t use ZIP or other popular compression formats: archival compression doesn’t support modifications. The benefits of ZIP and friends are that they can take advantage of dictionaries to remove redundant data in the entire file, but by doing this it’s next to impossible to modify the resulting data stream. NTFS compression, on the other hand, essentially compresses in chunks of multiples of 8KB (see https://blogs.msdn.microsoft.com/ntdebugging/2008/05/20/understanding-ntfs-compression/).

      What would end up happening? Every write commit would require a brand-new data stream, so your single-byte modification could result in a multi-megabyte transaction (recomputing the dictionary and regenerating the compressed stream.) This does not seem like a very good idea.

      As a read-only file system, I can’t see any problems with standard archival formats (that’s what they are internally!), but they’re not designed for and really shouldn’t be used as a R/W file system.

  17. smf says:

    “However, since the chunks are independent, they cannot take advantage of redundancy that is present in another chunk. (For example, if two chunks are identical, they still need to be compressed separately; the second chunk cannot say “I’m a copy of that chunk over there.”)”

    Actually that shouldn’t be difficult to add, because you must have a central list of pointers to the compressed chunks. If two chunks were identical then you can just have two pointers. Writes become more complicated because you can’t necessarily overwrite a chunk if it changes. However writing is already complicated as chunks are variable length and you really don’t want to rewrite the entire file every time a byte changes. So it might not be that much more complicated.

    A project I’m involved with did this because it needed seekable compression, our changes are stored in a separate file though. That way you can either merge the changes in, or delete them and go back to the original. Or even keep multiple sets of differences.

Comments are closed.

Skip to main content