System.IO.Compression Capabilities [Kim Hamilton]

We often get asked about the capabilities of the .NET compression classes in System.IO.Compression. I'd like to clarify what they currently support and mention some partial workarounds for formats that aren't supported.

The .NET compression libraries support at the core only one type of compression format, which is Deflate. The Deflate format is specified by the RFC 1951 specification and a straightforward implementation of that is in our DeflateStream class.

Other compression formats, such as zlib, gzip, and zip, use deflate as a possible compression method, but may also use other compression methods. In the case that they use deflate, you can think of these formats as a wrapper around deflate: they take bytes generated by deflate compression and tack on header info and checksums.

Our GZipStream class does exactly that – it uses DeflateStream and then adds header info and checksums specific to the gzip format. The gzip format is specified in RFC 1952.

So, out of the box, we support deflate and gzip formats.

Until we provide support for the other formats, which we plan to do soon, there are partial workarounds that may help you out in some situations, but they're definitely not a complete solution.

Working with zlib

The zlib format is specified by RFC 1950. Zlib also uses deflate, plus 2 or 6 header bytes, and a 4 byte checksum at the end. The first 2 bytes indicate the compression method and flags. If the dictionary flag is set, then 4 additional bytes will follow (which explains why the header will be 2 or 6 bytes). Note that in the wild, preset dictionaries aren't very common (and our classes don't support them).

This diagram from RFC 1950 shows the zlib structure:

            0   1
         +---+---+
         |CMF|FLG|   (more-->)
         +---+---+


      (if FLG.FDICT set)

           0   1   2   3
         +---+---+---+---+
         |     DICTID    |   (more-->)
         +---+---+---+---+

         +=====================+---+---+---+---+
         |...compressed data...|    ADLER32    |
         +=====================+---+---+---+---+

This means that to read a zlib file using only the .NET libraries, you can often just chop off the first two bytes and 4 end bytes and use DeflateStream on the rest of the stream as normal. (It would be better to check the dictionary bit and not attempt to read anything in that case).

Going in the opposite direction isn't as trivial, so I'm not really suggesting to generate zlib files this way. However, a couple people have asked in the past so I'll sketch an overview of that.

To start, you need to know which bytes to add at the beginning. With our deflate implementation, those bytes are 0x58 and 0x85. If you're curious about how this is derived from RFC 1950, see section 2.2 "Data format" and note that we use a window size of 8K and the value of FLEVEL should be 2 (default algorithm).

After that, you need to add the Adler-32 checksum at the end. The checksum will depend on the payload that you're compressing so you need to calculate it programmatically. Because of this, the easiest way to generate the checksum is to subclass DeflateStream and override the Write/BeginWrite methods to update the checksum. Steven Toub's NamedGZipStream article (mentioned at the end) shows an example of creating such a subclass for generating named gzip files.

Working with other compression formats

The big format you're probably thinking about is zip. Currently the .NET libraries don't support zip but the J# class libraries do. The following article describes using these libraries with a C# app.

https://msdn.microsoft.com/msdnmag/issues/03/06/ZipCompression/default.aspx

But if you don't want to rely on the J# class libraries, we'll need to provide a better solution.

Now that you're familiar with some compression specifications, let's focus on zip a little more. A zip specification is here:

https://www.pkware.com/documents/casestudies/APPNOTE.TXT

Notice that zip also allows deflate. Again the same principle applies – there are deflate bytes packaged in a header and footer. This may tempt you into writing a zip reader/writer based on DeflateStream (as described above for zlib), but there are two key differences that make zip more complicated.

First, the zip header contains a lot more information than the zlib header. To read a zip file, you'd definitely have to parse the header to figure out how many bytes to skip over because the header contains variable length items such as a file name.

Second, zip tools actively use different compression methods. For example, use Windows compression tool on a very small text file (with just a few words in it) and then a bigger file, say around 20 KB. Chances are it used no compression (yes, that's an option) for the small file and deflate for the 20 KB file.

Because different compression methods are used, an extension of the zlib technique described above may not help you much if you want to use the .NET libraries to read zip files. You'd definitely have to read the compression method to determine how to proceed. If it's deflate, then chop off the header and proceed as above. If it's no compression, chop off the header and read the bytes as a normal stream of bytes. If it's something else, then the .NET libraries have no built-in support for it.

Additional Note: Using WinZip with our GZipStream

Steven Toub observed in an MSDN article that WinZip can't handle our GZipStream because it requires filename info. He's created a NamedGZipStream implementation that generates files readable by WinZip

https://msdn.microsoft.com/msdnmag/issues/05/10/NETMatters/

Our Future Compression Plans

We'd like to address the shortcomings of our compression library in future releases. The following items are our highest priority compression requests:

  • Support for more formats, such as ones described above
  • Better compression ratio
  • Better compression speed

Are there any others you'd like us to address?