Bug in ZipFile CreateFromDirectory - your archive can be irreparable corrupted

In this post I want to share my observation of a bug in ZipFile.CreateFromDirectory method that will yield resulting archives corrupted. Luckily this happens only for quite large source directories - and so you might not be affected at all. I'll share my observations of limits that trigger this bug and also small code sample reproducing the bug.

Our usage of ZipFile.CreateFromDirectory and discovery of a bug

The ZipFile class has a very useful utility methods. We decided to use it for compressing our daily logs (which were running into dozens of GB in cumulative size). We basically zipped the previous days logs and if zipping operation finished successfully, we also deleted the original file:

 private static void ArchiveDirectory(string target_dir)
{
 try
 {
       if (Directory.Exists(target_dir) && !File.Exists(target_dir + ".zip"))
      {
           System.IO.Compression.ZipFile.CreateFromDirectory(target_dir, target_dir + ".zip");
         DeleteDirectory(target_dir);
        }
   }
   catch (Exception)
   { /* This is intentional as there is no place to log during log archival */ } 
}

Where DeleteDirectory just recursively deletes the original directory with all its content.

This worked great, until we started to cross a magic line of about 45GB of a original directory to be zipped and at the same time about 4GB size of a destination zip archive (the size of a destination zip seems to be more significant here). Nothing changed during zipping - ZipFile.CreateFromDirectory created a zip file without throwing any exception. Destination zip files seem to have appropriate sizes and they contain all the source files. Problems started when we wanted to access original log files - not all of them could be accessed. Build in windows zip files support failed (0x80004005 - Unspecified Error). Various third party archiving tools failed (complaining about the zip file being compressed). So we tried to call reciprocal method: ZipFile.ExtractToDirectory - it started to unpack the original zip - calming us down for a minute - but after unpacking those magic some 45GBs (or more probably 4GBs of a zipped folder) it threw:

 System.IO.InvalidDataException: A local file header is corrupt.
at System.IO.Compression.ZipArchiveEntry.ThrowIfNotOpenable(Boolean needToUncompress, Boolean needToLoadIntoMemory)
at System.IO.Compression.ZipArchiveEntry.OpenInReadMode(Boolean checkOpenable)
at System.IO.Compression.ZipArchiveEntry.Open()
at System.IO.Compression.ZipFileExtensions.ExtractToFile(ZipArchiveEntry source, String destinationFileName, Boolean overwrite)
at System.IO.Compression.ZipFileExtensions.ExtractToDirectory(ZipArchive source, String destinationDirectoryName)
at System.IO.Compression.ZipFile.ExtractToDirectory(String sourceArchiveFileName, String destinationDirectoryName, Encoding entryNameEncoding)
at System.IO.Compression.ZipFile.ExtractToDirectory(String sourceArchiveFileName, String destinationDirectoryName)

Reproduction of a bug - you can try it yourself

This turned out to be a very significant problem for us (I'm building financial platform where missing logs is an unacceptable situation). So I quickly reported this to Microsoft and got some attention - what helped was probably also a very simple repro code, that can be used without any changes/configuration (you just need about 90GBs of a free space and about 30-60 minutes for running it):

 namespace ZipFileTest
{
    using System;
    using System.Diagnostics;
    using System.IO;
    using System.Linq;
    using System.Text;

    class Program
    {
        static void Main(string[] args)
        {
            ZipFileTest zft = new ZipFileTest();
            zft.DoTest();
        }
    }

    public class ZipFileTest
    {
        const string LOGS_DIR = "logs";
        const string LOG_FILE_NAME_PREFIX = "dummyLog";
        private const string LOG_LINES_STRING = @"
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. 
Nam cursus. Morbi ut mi. Nullam enim leo, egestas id, condimentum at, laoreet mattis, massa. 
Sed eleifend nonummy diam. 
Praesent mauris ante, elementum et, bibendum at, posuere sit amet, nibh. Duis tincidunt lectus quis dui viverra vestibulum. Suspendisse vulputate aliquam dui. Nulla elementum dui ut augue. Aliquam vehicula mi at mauris. Maecenas placerat, nisl at consequat rhoncus, sem nunc gravida justo, quis eleifend arcu velit quis lacus. Morbi magna magna, tincidunt a, mattis non, imperdiet vitae, tellus. Sed odio est, auctor ac, sollicitudin in, consequat vitae, orci. 
Fusce id felis. Vivamus sollicitudin metus eget eros.
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. In posuere felis nec tortor. Pellentesque faucibus. Ut accumsan ultricies elit. 
Maecenas at justo id velit placerat molestie. Donec dictum lectus non odio. Cras a ante vitae enim iaculis aliquam. Mauris nunc quam, venenatis nec, euismod sit amet, egestas placerat, est. 
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Cras id elit. Integer quis urna. Ut ante enim, dapibus malesuada, fringilla eu, condimentum quis, tellus. Aenean porttitor eros vel dolor. Donec convallis pede venenatis nibh. Duis quam. Nam eget lacus. Aliquam erat volutpat. Quisque dignissim congue leo.
Mauris vel lacus vitae felis vestibulum volutpat. Etiam est nunc, venenatis in, tristique eu, imperdiet ac, nisl. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. In iaculis facilisis massa. Etiam eu urna. Sed porta. Suspendisse quam leo, molestie sed, luctus quis, feugiat in, pede. Fusce tellus. Sed metus augue, convallis et, vehicula ut, pulvinar eu, ante. Integer orci tellus, tristique vitae, consequat nec, porta vel, lectus. Nulla sit amet diam. Duis non nunc. Nulla rhoncus dictum metus. Curabitur tristique mi condimentum orci. Phasellus pellentesque aliquam enim. Proin dui lectus, cursus eu, mattis laoreet, viverra sit amet, quam. 
Curabitur vel dolor ultrices ipsum dictum tristique. Praesent vitae lacus. Ut velit enim, vestibulum non, fermentum nec, hendrerit quis, leo. Pellentesque rutrum malesuada neque.
Nunc tempus felis vitae urna. Vivamus porttitor, neque at volutpat rutrum, purus nisi eleifend libero, a tempus libero lectus feugiat felis. Morbi diam mauris, viverra in, gravida eu, mattis in, ante. 
Morbi eget arcu. Morbi porta, libero id ullamcorper nonummy, nibh ligula pulvinar metus, eget consectetuer augue nisi quis lacus. 
Ut ac mi quis lacus mollis aliquam. Curabitur iaculis tempus eros. 
Curabitur vel mi sit amet magna malesuada ultrices. Ut nisi erat, fermentum vel, congue id, euismod in, elit. Fusce ultricies, orci ac feugiat suscipit, leo massa sodales velit, et scelerisque mi tortor at ipsum. Proin orci odio, commodo ac, gravida non, tristique vel, tellus. Pellentesque nibh libero, ultricies eu, sagittis non, mollis sed, justo. Praesent metus ipsum, pulvinar pulvinar, porta id, fringilla at, est.
Phasellus felis dolor, scelerisque a, tempus eget, lobortis id, libero. Donec scelerisque leo ac risus. Praesent sit amet est. In dictum, dolor eu dictum porttitor, enim felis viverra mi, eget luctus massa purus quis odio. Etiam nulla massa, pharetra facilisis, volutpat in, imperdiet sit amet, sem. 
Aliquam nec erat at purus cursus interdum. 
Vestibulum ligula augue, bibendum accumsan, vestibulum ut, commodo a, mi. Morbi ornare gravida elit. Integer congue, augue et malesuada iaculis, ipsum dui aliquet felis, at cursus magna nisl nec elit. Donec iaculis diam a nisi accumsan viverra. Duis sed tellus et tortor vestibulum gravida. Praesent elementum elit at tellus. Curabitur metus ipsum, luctus eu, malesuada ut, tincidunt sed, diam. Donec quis mi sed magna hendrerit accumsan. Suspendisse risus nibh, ultricies eu, volutpat non, condimentum hendrerit, augue. Etiam eleifend, metus vitae adipiscing semper, mauris ipsum iaculis elit, congue gravida elit mi egestas orci. Curabitur pede.
Maecenas aliquet velit vel turpis. Mauris neque metus, malesuada nec, ultricies sit amet, porttitor mattis, enim. In massa libero, interdum nec, interdum vel, blandit sed, nulla. In ullamcorper, est eget tempor cursus, neque mi consectetuer mi, a ultricies massa est sed nisl. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos hymenaeos. Proin nulla arcu, nonummy luctus, dictum eget, fermentum et, lorem. Nunc porta convallis pede.";

        private string[] _loglines;
        private int _fileCnt = 0;
        Random _rnd = new Random((int)DateTime.Now.Ticks);

        public ZipFileTest()
        {
            _loglines = LOG_LINES_STRING.Split(new[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries);
        }

        private string CreateLogDirectory()
        {
            string basedir = System.IO.Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().Location);
            basedir = basedir ?? string.Empty;

            string targetDir = System.IO.Path.Combine(basedir, LOGS_DIR, DateTime.UtcNow.ToString("yyyy-MM-dd"), Process.GetCurrentProcess().Id.ToString());

            if (!Directory.Exists(targetDir))
            {
                Directory.CreateDirectory(targetDir);
            }

            return targetDir;
        }

        private string RandomString(int size)
        {
            StringBuilder builder = new StringBuilder();

            builder.Append(
                Enumerable.Repeat(0, size)
                          .Select(dummy => Convert.ToChar(Convert.ToInt32(Math.Floor(26*_rnd.NextDouble() + 65))))
                          .ToArray());

            return builder.ToString();
        }

        private void CreateNextLogFile(string logDir)
        {
            FileStream stream = File.Open(
                            System.IO.Path.Combine(logDir, string.Format("{0}_{1}.log", LOG_FILE_NAME_PREFIX, _fileCnt++)),
                            FileMode.Append,
                            FileAccess.Write,
                            FileShare.Read);

            for (int i = 0; i < 500000; i++)
            {
                string logLine = _loglines[_rnd.Next(_loglines.Length)];

                for (int j = 0; j < _rnd.Next(5); j++)
                {
                    logLine = logLine.Insert(_rnd.Next(logLine.Length), RandomString(_rnd.Next(30)));
                }

                byte[] logLineBytes = Encoding.Default.GetBytes(logLine);
                stream.Write(logLineBytes, 0, logLineBytes.Length);
            }

            stream.Close();
        }

        private string ZipLogFiles(string logDir)
        {
            string dirToZip = Directory.GetParent(logDir).FullName;
            System.IO.Compression.ZipFile.CreateFromDirectory(dirToZip, dirToZip + ".zip");
            return dirToZip + ".zip";
        }

        private void UnzipZippedLogs(string zipName, string logDir)
        {
            string dirToUnzip = Path.Combine(Directory.GetParent(logDir).Parent.FullName, "UnzippedLogs");
            System.IO.Compression.ZipFile.ExtractToDirectory(zipName, dirToUnzip);
        }

        public void DoTest()
        {
            string logDir = CreateLogDirectory();
            for (int i = 0; i < 700; i++)
            {
                CreateNextLogFile(logDir);
            }

            string zipName = ZipLogFiles(logDir);
            UnzipZippedLogs(zipName, logDir);
        }
    }
}

You need to reference System.IO.Compression.FileSystem assembly, so your section of csproj file with references will look this way:

   <ItemGroup>
    ...
    <Reference Include="System.IO.Compression.FileSystem" />
    ...
  </ItemGroup>

Do not pay to much attention to all the magic constants - they are tuned in a way so that you can reliably hit the issue with reasonable time and space runtime requirements for the repro.

When running this code the unzipping will throw above mentioned exception. The thing is that handling of this exception won't help you - archive is already corrupted during the zipping operation

Workaround and a fix

I have reported this issue to Microsoft and I'm in contact with owners of this code. They were able to find rootcause of the issue (my wild guess would be something with 32bit checksums or something like that - as this reproes with archives over 4GBs). It seems that there is no remedy for already archived folders (I'll try to find something myself - at least partial remedy, as we simply cannot miss logs), for newly created archives it's highly recommendable to split the original folder into multiple destination archives, each below 4GBs (you'll probably need to do some tuning to find the source sizes - as level of compression highly depends on patterns in your logs).