Working with Zip Files in .NET [Richard Lee]


Before getting started, I’ll introduce myself. My name is Richard Lee, and I’m a developer intern on the BCL for the summer. I’ve only been here for a few weeks, but it’s been great working here. The people, the environment, and my project are all great. Speaking of which, my project is to add general purpose .NET APIs for reading and writing Zip files, which we’re considering adding to the next version of the .NET Framework.

The most common Zip tasks are extracting to a directory and archiving a directory. For these mainline scenarios, we have static convenience methods. The following code takes all of the files in the Zip file, photos.zip, and extracts them to a folder on the file system:

ZipArchive.ExtractToDirectory("photos.zip", @"photos\summer2010");

This code does the reverse, putting all of the files in the folder into the Zip file:

ZipArchive.CreateFromDirectory(@"docs\attach", "attachment.zip");

For more sophisticated manipulations of Zip archives, there are two main classes. ZipArchive represents a zip archive, which is a collection of entries, and ZipArchiveEntry represents an archived file entry. The following code extracts only text files from the given archive.

using (var archive = new ZipArchive("data.zip"))
{
    foreach (var entry in archive.Entries)
    {
        if (entry.FullName.EndsWith(".txt", StringComparison.OrdinalIgnoreCase))
        {
            entry.ExtractToFile(Path.Combine(directory, entry.FullName));
        }
    }
}

Zip archives can also be created on-the-fly. This example creates a new archive with a readme file that is created without the need for a corresponding file on disk, and a file from the file system.

using (var archive = new ZipArchive("new.zip", ZipArchiveMode.Create))
{
    var readmeEntry = archive.CreateEntry("Readme.txt");
    using (var writer = new StreamWriter(readmeEntry.Open()))
    {
        writer.WriteLine("Included files: ");
        writer.WriteLine("data.dat");
    }

    archive.CreateEntryFromFile("data.dat", "data.dat");
}

The ZipArchive class supports three modes:

  1. In Read mode, data is read from the file on demand, using only a small buffer.
  2. In Create mode, data is written directly to the file using only a small buffer. Only one entry may be held open for writing at a time.
  3. In Update mode it is possible to read and write from existing archives, as well as rename or delete entries. This mode requires loading the entire archive into memory, and as such we recommend that it be used only with small archives when this functionality is needed.

Below is our current thinking on what the public API listing will look like (note that this hasn’t been finalized yet).

namespace System.IO.Compression
{
    public enum ZipArchiveMode { Read, Create, Update }

    public class ZipArchive : IDisposable {
        // Constructors
        public ZipArchive(String path);
        public ZipArchive(String path, ZipArchiveMode mode); 
        public ZipArchive(Stream stream);
        public ZipArchive(Stream stream, ZipArchiveMode mode);
        public ZipArchive(Stream stream, ZipArchiveMode mode, Boolean leaveOpen);

        // Properties
        public ReadOnlyCollection<ZipArchiveEntry> Entries { get; }
        public ZipArchiveMode Mode { get; }
        
        // Instance methods
        public ZipArchiveEntry GetEntry(String entryName);
        public ZipArchiveEntry CreateEntry(String entryName);

        public void Dispose();
        protected virtual void Dispose(Boolean disposing);

        public override String ToString();

        // Instance convenience methods
        public ZipArchiveEntry CreateEntryFromFile(String sourceFileName, String entryName);

        public void ExtractToDirectory(String destinationDirectoryName);

        // Static convenience methods
        public static void CreateFromDirectory(String sourceDirectoryName, String destinationArchive);
        public static void CreateFromDirectory(String sourceDirectoryName, String destinationArchive, Boolean includeBaseDirectory);

        public static void ExtractToDirectory(String sourceArchive, String destinationDirectoryName);
    }

    public class ZipArchiveEntry {
        // Properties
        public DateTimeOffset LastWriteTime { get; set; }
        public String FullName { get; }
        public String Name { get; }
        public Int64 Length { get; }
        public Int64 CompressedLength { get; }
        public ZipArchive Archive { get; }

        // Methods
        public Stream Open();
        public void Delete();
        public void MoveTo(String destinationEntryName);

        // Convenience methods
        public void ExtractToFile(String destinationFileName);
        public void ExtractToFile(String destinationFileName, Boolean overwrite);

        public override String ToString();
    }
}

We would love to hear what you think of the APIs so far, and how you plan on using them.

Comments (38)

  1. Max says:

    I would make the API based on Stream objects instead of working with files directly. There are a lot of cases where you need to zip something in memory, e.g. for writing to a database or to a network socket.

  2. Jesper says:

    It's unclear from the API whether you'd be able to add an empty folder to the zip file, which could be useful. I second being able to feed a stream into an entry in addition to a file.

  3. Rick says:

    The convenience methods would be much more convenient if they took a standard file mask as the third argument (*.png, *.jpg, etc.).

    Also, not to poo-poo a nice addition, but perhaps support for other compression formats?  At least in the form of creating an abstract base class "Archive" that ZipArchive implements for Zip.  That way we could derive and create for example SevenZipArchive and RarArchive.

  4. Skip says:

    It's not clear to me what ZipArchiveEntry::MoveTo() is used for, is it used to reuse the object, but point it at a different file in the archive?   That seems messy to me, and problematic from a usage standpoint.

  5. Trillian says:

    I definitely +1 the base class or interface suggestion. Zip archives are great and widely supported but it'd be nice to have the flexibility to code implementations for other formats and have a common base.

    Also, I think ZipArchiveEntry.Open should take some kind of FileMode flag to specify what kind of operations are to be supported. Other than that, great! It'll be a nice addition to the framework.

  6. Karl says:

    I also agree with the stream comments.  Also, will the ZipArchive class support multiple zip compression methods, or will this be a “Compressed Folder”?  It would be nice if it supported multiple methods and the ability to specify them, including the support of zip encryption methods.  I work in health care, and this would be a great feature of the API to help comply with security regulations.

  7. SharpZipLib is pretty good, but it'd be great to have structured ZIP support as part of the framework (obviously stream based as above comments; file stuff could just be convenience/extension methods).

  8. Barry Kelly says:

    For the Update mode, I'd prefer to see it based around a streaming approach: that is, having some way of having the archive buffer delta operations, then apply the deltas en masse as a transaction, reading from the source zip and writing to an output zip, with fixed memory usage. But such a lazy / delayed architecture may need a parallel / shadow API to distinguish it from the immediate mode of normal reading operations. You could also consider a log-peeking approach; that is, in your Update mode, you can still read e.g. a file that you just added, but internally it would read the data from the log of deltas, not from the underlying zip file (which wouldn't have been updated yet).

    I am presuming that you already have Stream overloads in mind, as restricting operations to files on disk would be ridiculously limiting. Both the ZIP itself and the entries should be readable and writable through Stream. For robustness, I'd consider writing these streams out to temporary files on disk to avoid them hogging memory when Updating or Adding – an appending Add mode ought not require as much work of rewriting.

    On support for other archive formats, I would advise not being too aggressive with base classes or interfaces, as the risk of over-abstraction, overengineering, interfaceitis, etc. are high. Actually, I would make your zip classes sealed with no general base class or interface at all *until* you have at least two alternate implementations, and the best approach for interface parity is clear.

    Finally, it would be nice to have a .ZIP format with at least rudimentary support for rewriting the zip index, the bit at the end of the zip file, by scanning through the zip entries.

  9. Matt says:

    Please, please don't forget testability. The base class / interface suggestion would be ideal – if you're hard set on static convenience methods just don't lock the rest of us out that unit test our code heavily and use mocking frameworks. It would be nice, for once, to be able to use a BCL offering without having to wrap it in something that allows me to test my code without resorting to things like TypeMock – example: DateTime.UtcNow.

    Also, thanks for being transparent about the design process – good to see!

  10. Larry Smith says:

    sevenziplib.codeplex.com

    Although I haven't looked into it, their examples seem to have the right "feel" to them. I do like the fact that LINQ can be used.

    I second (third, fourth, whatever) an open architecture so that concepts like finding the internal directory, encoding and decoding the contents of an elements, and so on, are exposed. IOW, have a CompressedArchive abstract class, of which traditional .zip files are processed by one set of child classes, but other formats (.cab, .rar, etc) could also be implemented (presumably by third parties) and then processed with the same set of APIs.

    Oh, and make sure that long (32K) path names inside the files are supported.

    Also, you don't say what exceptions you might throw. For example, what if the CRC (which you don't have an API to retrieve) doesn't match? And how does CRC mismatch fit into the suggestion for an in-memory Stream retrieval of the data? As much as I like the Stream idea, I don't like the idea of getting to the end of the stream (having updated files/databases/whatever), then finding out that the CRC didn't match and perhaps all the data up to then was flawed.

  11. Ant says:

    The ExtractToFile 'convenience' methods on ZipArchiveEntry seem out of place and overly specific to me.

    Personally I'd rather see a more generalized utility methods on the System.IO.File static class, e.g.

    File.WriteAllBytes(string path, Stream stream)  // reads from 'stream' and writes to the file at 'path'

    File.WriteAllBytes(string path, Stream stream, FileMode mode)

    Which could then be used to replace zipArchiveEntry.ExtractToFile(filename) as follows:

    using (Stream s = zipArchiveEntry.Open()) {

       File.WriteAllBytes(filename, s);

    }

  12. Dominik says:

    This approach is too specific because there are plenty of other formats:

    en.wikipedia.org/…/List_of_archive_formats

    It would make more sense if ZipArchive is a subclass of Archive.

  13. David Kemp says:

    I agree with @Dominik -> having an IArchive (or even Archive abstract base class) that ZipArchive, 7zArchive, TarBallArchive, RarArchive (and others) can all implement would be great. Also, something like a CompressedStream might be needed for archives that only have a single stream.

  14. L. says:

    I strongly concur about the suggestions for:

    * making it more Stream-based rather than File-based;

    * deriving from an abstract Archive class (you can choose a better name).

    Since zip files store file names as an array of bytes whereas .Net uses utf-16 for strings, I believe you should specify how you will handle encodings and provide a way to override the defaults (e.g. force encoding of file names to utf-8, or windows-1252).

    Please check your compression ratio compared to other zip implementations.  The current DeflateStream class might need some improvements.  I'm not asking that you match 7zip's deflate or kzip, but you really should compress as well as e.g. Info-zip (with -9).

    Convenience APIs for creating a Package from a ZipArchive (or, better, from an abstract Archive class) would be nice.

    A set of PowerShell cmdlets would be nice, too.

  15. Jaans says:

    Congrats on the new job Richard.

    I do quite a lot of Silverlight development and I often send data from the server side to the client side and this data compresses very well so it's ideal to gain that benefit. Problem is that the System.IO namespace for Silverlight does not include functionality to uncompress data that was compressed by .NET server side.

    So, I'd very much like to see some for of compression + decompression support for the Silverlight BCL.

    Currently I use 3rd party assemblies for both .NET and Silverlight which is annoying because I've run into a very frustrating scenario where the current compression functionality in .NET is incompatible with the 3rd party SharpZipLib. Not sure whose to blame… dont' care… but these things should be more standardised.

    Hope that helps

  16. Okeanos says:

    I find this design concept to be very high level api like. Thats okay for fast day-by-day usage and for beginners. But sooner or later more experienced developers might want to return to other compression libraries to have more options and and more fine-grained control and we are where we are today.

    You guys surely know that the things you do with the BCL are carved into stone for a long time so:

    If you decide to add a standard/implement a new feature set into the BCL (!) – which I would really like – do it as complete as possible, as clean as possible (BCL rules, naming, patterns, and so on) and as detailed as possible – even if placed into a child-namespace like ".Advanced" or whatever. If you don't do it that way we end up like System.Web.Mail which have received a complete rewrite within System.Net.Mail and we end up with another dead namespace.

    Things I would want to have, see or changed additionally to the things I already said:

    Support for Passwords

    Support for Compression Options

    especially CompressionLevel (!!) from StoreOnly To Fastest To Slowest with best compression rates.

    ExtractToStream (!)

    Very important. I do not want to extract data to temporary files just to read them in memory again.

    Deleting entries

    I havn't seen any methods for that.

    Nameing

    You might consider ZipFile instead of ZipArchive (and ZipFileEntry respectively). I know those "files" are normally called "archives", but I found "File" it be a more understandable, cleaner and more consistent with other BCL names.

    Asynchronous Support/Progress reporting/Cancelation support

    I miss any easy to use asynchronous support for that whatsoever, which is extremly important for compressed files because they tend to be slow to create, slow to extract therefore slow to use and I can't have it placing a user of my program into the program-frozen-mode and even if you use a BackgroundWorker (which I normally do) I can't give him any hints how long it will take or give him a way to cancel the operation.

    The easies way for you guys to get us that is by giving us the already mentioned ZipFileStream or whatsoever. With that we can do all that by ourselves. But that would not help beginner or other users of the high level api. Maybe you implement some simple to use asyn methods, maybe with progression report and cancelation support. You might consider adding something like:

    zipFileEntry.ExtractToDirectory(string directoryName, BackgroundWorker worker);

    Which is using the background worker given to it to provide cancelation and progress report (if respective properties are set to true).

    ZipFileStream

    Like the other compressed streams within the bcl used for lower-level access to zip files and for more fine-grained control.

    A little off-topic but currently a huge problem: ZLibStream implementation

    There are parts of the standard within DeflateStream but its not the full standard and not the real thing. Especially creating ZLibStream with DeflateStream does not work because it is just a small subset. So please add this class to the BCL I use it really often with content files, e.g. of games or other data-intense applications.

  17. Ed Blackburn says:

    Hi Richard, Here's my two pennies:

    – Where's the Archive abstraction? OCP: Give the option for MS and other devs to support alternative archive formats.

    – I echo the comments about streaming, not files.

    – Both the above will assist in facilitating _proper_ unit testing; why do you _have_ to tie the code directly into the concrete implementation, which is file system and integration heavy?

    Sorry if I sound negative / critical on the bright side…I love the idea of archiving in the BCL. Good luck and enjoy the project!

  18. Jeff Yates says:

    The API should be extensible to support multiple formats so that we can have interoperability with third party compression utilities. I would imagine something along the lines of the Cryptography APIs that support some standard algorithms but allow others to be provided.

  19. Nice work.

    I was wondering if you're considering the support of more advanced features such as zipping for optimal size, for optimal speed (just store, don't compress). Or splitting in several files.

    This could be achieved by an additional overload receiving an instance of a ZipArchiveSettings class where you'ld specify these kind of options.

    I also like the idea of making it more stream-based. Concerning the support for other archive types such as 7zip or rar, I understand that it may fall a bit out of the immediate scope of your project but creating the previously mentioned hierarchy will reduce the need of big refactorings if this feature ever gets requested.

    Good luck with your project!

  20. Looks like a great start on a new API for ZIP files. Thanks for posting!

    I would suggest that both options should exist: to use File/Folder objects, as well as Stream objects.

    Cheers,

    Trevor Sullivan

  21. Luke Baughan says:

    I agree with all who suggested additional formats especially 7zip (LZMA?) and RAR – although I understand including RAR functionality may be difficult due to licensing issues.

  22. Pavel Minaev [MSFT] says:

    Just wanted to note that the API as described _does_ allow you to extract to a stream. There's no helper method for that, similar to ExtractToFile, but you can get a ZipArchiveEntry, call Open() on that to get a stream, and then use Stream.CopyTo() to extract it to another stream (or just Read() and process in some other way).

    Similarly, the constructors for ZipArchive also include Stream overloads. So it would seem that you can use the API entirely over your own custom streams, both for input and for output, with no on-disk files involved at all. It's just that working with files is slightly more convenient due to helpers.

  23. 13xforever says:

    It seems SharpZipLib is the most popular choice to work with ZIP archives. However, I find it quite lacking in feature set and API realization.

    DotNetZip however is pretty sweet. Makes everything I want however I want and it's APIs are clean and easy to use. Fast too.

  24. Ooh says:

    Well, I like the idea and the API you designed – good work!

    For me, there are two open issues:

    * Rename the Dispose method to Close, which is more BCL-like, I think.

    * Also add support for .cab files. Even if you don't directly implement it for the next version, designing it and writing a prototype implementation can help you design a class hierarchy like it has been requested by other commenters.

    Thank you very much and all the very best for your mid and final reviews!

    Ooh

  25. David says:

    Finally! And the API looks usable.

    One thing, since a ZIP is conceptually very much like a folder, it would be great if it would be in netfx as well. So to be more concrete, it could offer more of the functionality that System.IO.Directory offers. For example listing files using a wildcard.

    Regards and keep it up.

    David

  26. Joe White says:

    Sounds like this is something that would be good to put on Codeplex early, so that people can try it out and offer feedback before you ship it in the BCL and permanently freeze the API.

    One more comment: For us, speed is vital, because we zip and unzip large quantities of data. Currently we P/Invoke to an unmanaged library for zipping, and use a managed library for unzipping, just because that's the fastest combination. If the BCL zip library wasn't faster than what we have now, we wouldn't get any benefit from it.

  27. MikeS says:

    A couple of suggestions:

    How about implementing some of the zip operations as extension methods on the Stream class? That way, they'd work with any stream in the same way that Linq extension methods operate on any IEnumerable.

    Secondly, I like the idea of adding file masks (e.g. "*.png,*.jpg") to overloads of the static zip/unzip methods. However, it would also be really flexible to have a static zip/unzip overload that would take a predicate on the file name or the full ZipArchiveEntry, allowing a simple lambda to be used to determine whether to zip/unzip a file:

    ZipArchive.ExtractToDirectory("photos.zip", @"photossummer2010",

     fileName => fileName.StartsWith( "foo", StringComparison.OrdinalIgnoreCase );

    OR

    ZipArchive.ExtractToDirectory("photos.zip", @"photossummer2010",

     entry => entry.LastWriteTime > someDateTime );

    This would allow a lot of flexibility without the additional code required to create instances of ZipArchive and foreach through them, etc.

    If you wanted the ultimate in flexibility from the static methods, you could have the predicate take a special class that gives access to the ZipArchiveEntry AND supplies the output path or stream and allows the predicate to manipulate it. This would allow scenarios where you might want to extract certain files into different directories, for example.

  28. Aaron B says:

    it would be nice if you could support PKWARE AES 256 encryption.

    maybe you could create an ArchiveStorage class similar to the isolated storage class (msdn.microsoft.com/…/system.io.isolatedstorage.isolatedstoragefile_members.aspx).

  29. zzz says:

    Wanted (and great for testing now and later if more format will be added):

    archiveStreamsCollection.Test() method which opens the stream, compares the header to currently supported archive types (initially just ZIP later others possibly) and if the stream contains a ZIP then it reads if in such way that it extracts the data but throws it away not consuming memory/disk while calculating checksum for the files inside.

    When the Test() is running, if the user wants the checksums for each file and their status whether the particular file was determined to be broken, a callback can be subscribed that provides this information.

    The API should be user extendable, so that if we have Proprietary Format archive, we can just call some method to add our format under the same Test() so while MS only supports ZIP, now we can use it to Test/Extract to memory both ZIP + any number of formats we have added support for through a wrapper.

    Nice to have:

    Zip format actually has changed a bit I believe. If you try to extract some zips from 15 years ago it might not work. I have no idea how much it changed but if it's not too much effort to support the older zips as well that would be nice. I know this because trying the various .NET zip libraries they failed to extract/test various older zips. The native Info-Zip library which supports older zips as well.

  30. Interestingly, it seems like there are 2 (or more probably, but generally 2) schools of thought on zip libraries, which mirrors the 2 schools of thought on zip in windows more generally. There is the old-school camp which come from the days when the focus was on the structure archive file(s); and there is the new-school camp which has bought into the Microsoft abstraction of a zip archive as just like a folder (as it can be viewed in Windows these days). But in Windows it is still possible to install 7-zip or WinZip or WinRar or whatever and use archives in the old school way. So presumably you might want to design an API (or APIs) that can accommodate both schools of thought.

    I'll second (or third) the idea that the first focus here should be on providing support for compression within the existing streams interfaces and pattern. I would suggest having a look at CryptoStream in System.Security.Cryptography – this is an area of the framework that has already accomplished this in another area. This is also analogous because the CryptoStream can work with many "providers" which give different crypto implementations – so if you mirror the pattern you can have a CompressionStream which can also take one of several providers.

    I'd definitely like to see support for multiple compression providers, specificallly .rar and .7z, including as much of their optional functionality as can be supported, such as multiple-volumes.

    Thanks!

  31. Arnshea says:

    Your proposed API looks great.  Kudos for subjecting it to public scrutiny too!

    Areas I'd also like to see covered are:

    Asynchronous, Cancellable operation (already mentioned by others but I've found it helpful too).

    Error Handling – What happens when 1 or more files in a directory are inaccessible during an archival operation?  It'd be nice to have the option of continuing without interruption though it would also be nice to have access to the reason for failure for each file that failed (e.g., fire an Event in case of an error – this could be cancellable or not but the event args should have the Exception(s) responsible).

  32. DBNickel says:

    Zip was already implemented in the J# library. Might want to check that out. It would be nice to not have to include that library for my zip projects in the future. 😉

  33. Charlie says:

    Thank you for doing this!

    We have a web app that allows customers to pick their .jpg files and package / download them.  It would be great if the API:

    1) Allowed developer to set compression ratio (would be zero for our case).

    2) Allow us to pass in a list of strings (into CreateFromDirectory) that represent the names of the files the user selected to zip and download.

  34. Malisa Ncube says:

    I just have to suppose that even though its called a "Zip" archive, we can select / use different forms of compression resulting in different types of files.

    e.g. rar, iso, .7, .001 and others.

  35. Malisa Ncube says:

    Support for streams is a major requirement. I would recommend checking http://dotnetzip.codeplex.com/ and  sevenziplib.codeplex.com projects.

    Combining features from both would be ideal. I would be happier if the framework would simply use MEF to add compression engines. e.g. 7z, rar, iso and others.

    I'm also against calling it "ZipArchive", is it going to be for .zip files only?

    Thats my view.

  36. Omer Mor says:

    The previous comments touched all my pain-points.

    I'll try to sum them up:

    * base-class/interface hierarchy for supporting other archive formats

    * stream support

    * setting compression level

    * password support

    * archive splitting

    * exposing chesksums

    * async & cancellation support

    * abstracting an archive as a folder

  37. I think you're making the same mistakes of the current zip libraries, like #ziplib, I would go with a fluent interface approach, the basic helpers for extracting and creating are fine, but once you get into more specific scenarios I think a fluent interface would work really good; here at work I'm working on such library my self, using #ziplib as the actual engine, I think it suits it really well

  38. I think you're making the same mistakes of the current zip libraries, like #ziplib, I would go with a fluent interface approach, the basic helpers for extracting and creating are fine, but once you get into more specific scenarios I think a fluent interface would work really good; here at work I'm working on such library my self, using #ziplib as the actual engine, I think it suits it really well