More on Zip in .NET [Richard Lee]

First, I’d like to thank everybody for their comments on the Zip APIs. It’s great to know that I’m working on something that a lot of people will hopefully find useful. I’ll try to address the themes that came up in the comments.

Streams

A lot of the comments mentioned support for streams. The API does support creating a ZipArchive with a stream as the backing store, and input from streams. The following code sample takes the contents of instream and writes a Zip archive to outstream containing just that one file:

 using (ZipArchive archive = new ZipArchive(outstream, ZipArchiveMode.Create))
{
  ZipArchiveEntry entry = archive.CreateEntry("data.dat");
  using (Stream entryStream = entry.Open())
   {
       instream.CopyTo(entryStream);
   }
}

This will write out the Zip archive directly to outstream, without buffering the entire contents of the archive in memory or writing to a temporary file. We really think of the methods used above as the core APIs. Methods like CreateEntryFromFile and ExtractToDirectory are purely convenience methods – their main purpose is to make some of the more common scenarios with files easier.

Compression and Encryption

Another common theme was custom encryption and compression algorithms. The vast majority of Zip archives that are meant to be interoperable with the widest range of libraries, tools, and applications use the Deflate compression algorithm without encryption. Our main goal for this API is to be able to read and write such archives. As such, we’re currently planning to support writing Zip archives with Deflate, reading Zip archives that use Deflate or no compression, and to not support encryption. Not only will this enable reading/writing interoperable Zip archives, it also scopes the work to something reasonable that can be delivered during my internship.

If we’re not providing built-in support for additional compression or encryption algorithms, an obvious question, then, is why not provide extensibility hooks so that custom compression or encryption algorithms can be plugged-in? We explored doing this and it turns out it’d be more complex than you might initially think.

A lot of people mentioned the CryptoStream and ICryptoTransform model as a powerful way to allow for extensibility. This works well because the ICryptoTransform only needs to do one thing – transform a series of bytes into another series of bytes. Unfortunately, encrypting Zip files is a much more complex operation. Fields and flags need to be set to appropriate values and headers need to be encrypted depending on which algorithm is used. Implementing either of the two secure encryption methods mentioned in the Zip specification would require access to essentially all of the metadata in the Zip file. The resulting interface for such an extension would be enormously complex.

Compression is a bit simpler, as only one entry is compressed at a time. However, there are still fields and flags that need to be set in the headers, depending on the algorithm used. Furthermore, for both encryption and compression, only a few compression/encryption methods are specified in the Zip specification. We don’t want to give the impression that any compression method can be used to produce a Zip file, when that isn’t the case. If you want to use a compression or encryption algorithm that isn’t specified in the Zip spec, you might as well just compress or encrypt the stream yourself.

So an extensibility interface would be substantially more complex than something like ICryptoTransform, and add significant complexity to the ZipArchive/ZipArchiveEntry class. We don’t think providing extensibility hooks in this way adds enough value to the API to justify the added complexity that they bring. However, we are open to adding built-in support for encryption or other compression algorithms in the future, based on customer demand. We’ve specifically designed the API in a way that would allow us to do this. If this is something you’re interested in, we’d love to better understand your needs.

Abstract Base Class

Another common question was around providing a base class for Archives. We explored this and have decided not to add such an abstraction at this time. The main reason we are holding off is because right now we’re only planning to provide support for Zip archives and have no plans to support other archive formats (such as CAB, RAR, etc.). As a design principle, we try to avoid adding abstractions when there is only one implementation, otherwise we risk getting the abstraction wrong. Then we’re either stuck with the bad abstraction or are forced to add another when adding another implementation. All of this adds additional complexity that we’d like to avoid.

The concerns about test-driven development are certainly valid, but there are ways around this. For example, because archives can be made with streams as the backing store, a FileStream could be mocked and put behind an archive. That may not be totally ideal, but these concerns don’t seem compelling enough to justify adding an abstract base class at this time.

Miscellaneous

There were a couple comments about treating the Zip archive like the filesystem, and supporting searching for certain kinds of files. We made the decision to treat the archive as a flat container of files because that is how they are actually stored. As a library rather than an application, we thought it made more sense to represent the archive as it exists on disk. Also, using LINQ on the entries means getting all of the files in a certain subdirectory that end in .txt is a relatively simple operation.

Another interesting comment was that MoveTo was confusing. MoveTo was intended to act like Rename, for renaming entries but keeping everything else about them the same. The method was named MoveTo to mimic the naming for the method on FileInfo. However, because it is confusing and probably has very few compelling usage scenarios, we’re thinking of cutting it from the API.

I hope I addressed some of your comments. We’d love to keep hearing from you, especially if you have extremely compelling use cases for some of the features that we’re not including in this version of the API. Being the designer of the API, I think it would be really cool, too, to support custom compression and encryption, or some of these other features. But at the risk of sounding like a broken record, I’m trying to build a simple, usable API, and these decisions are made with that in mind.