ZIP level Robustness for the Office XML Formats


I already mentioned this in one of my comments, but I wanted to expand on it a bit more. I want to first make clear though that I’m not trying to claim that we’re original for using ZIP, or found some new cool stuff that no one else knew about. One of the reasons we chose ZIP was that is was in such wide use today we knew it would be easier for people to work with it. If you guys are curious to know more about the reasons behind our decision let me know and I can go into the different requirements we had when deciding on the container format.


For this post though I want to focus specifically on robustness of ZIP, since there was some disagreement on how robust ZIP really is. While I’ll agree that ZIP isn’t the most robust technology out there, it’s still extremely robust relatively speaking (especially compared to compound doc). There is a central directory at the end of the entire ZIP package which points out the location of each part (file) in the ZIP, but that actually isn’t required to open the file. The central directory just makes it easier to do random access to any individual part.  The parts inside the ZIP are all written out serially, and there is a header between each one. So, even without the central directory, we could still rebuild the ZIP package just by scanning the file looking for each header.


So, this means that if any one part gets corrupted, you can still recover all the other parts. All the compression occurs at the part level, so you don’t need to worry about the corruption of one part interfering with your ability to decompress the other parts. Additionally, if the file gets cut off at some point and you don’t get the central directory, you can still rebuild it just by scanning all the parts you have. Of course you can’t rebuild the parts that are missing, but if all the important parts are at the front of the file, you at least get those.


To take advantage of these conditions and make our files robust, we break them out into multiple parts. In Word for instance, the document properties are stored in a separate part from the document content. The properties go towards the end of the file, and the content goes towards the front. This way we are making it so that even if there is some type of bit level corruption or transmission error, we are increasing our chances of being able to recover the most important data. I’ll go into more detail later about how each application breaks their files out into separate parts. It’s really cool.


The last big benefit around ZIP from a robustness point of view goes back to it’s widespread use. There are tons of ZIP tools out there today that deal with corrupt ZIP files. We knew that we could build in good recovery logic, and in addition, people would have the ability to use other tools to try and repair their files if they wanted to.


There are other kinds of corruption that occur as well, this discussion was really just around bit level corruptions. Other corruptions we see are more related to application or user error. I’ll talk more about this in a future post, since it doesn’t have as much to do with ZIP as it does with the openness of the formats.


-Brian

Comments (8)

  1. Josh says:

    There have been lots of comments of Microsoft just copying OpenOffice. Can you elaborate on the differences between the two file formats?

  2. snowknight says:

    After reading this, I’m really glad you guys are switching to this new format. I was a workstudy in a college computer lab. One of the most frequent complants was a word document trapped on a dying floppy. Sadly, the disk corruption had a tendency to destroy most if not all of the document. It sounds like this new format has a better chance of surviving that.

  3. lexp says:

    As far as I know, ZIP does not support

    1) Unicode file names

    2) Data Recovery

    3) >2GB file size

    4) High precision date/time

    etc.

    Today the best compression engine obviously has WinRAR http://www.win-rar.com

    Maybe you guys just buy this technology and use WinRAR instead of ZIP? You are making a new standard for >10 years and it can turn out in 5 years some limitations of ZIP that limit usability of new Office XML format.

  4. Evan Erwin says:

    High precision date/time? Meh. Over 2GB files? Jesus man, if you’re working in -Office- with a file over 2GB, you’re using the wrong tool!

    I also felt this post was fairly redundant. Other than saying it’s more widely supported than any other compressed format (and that’s certainly true), nothing new here.

  5. BrianJones says:

    lexp,

    Those are some valid points(especially around file size). With the growing popularity of large media files (pictures, videos, images), we’re seeing very large Office files being generated. This is especially the case with PowerPoint where people like to create photo albums with background songs, as well as other uses of rich media. So while it may seem ridiculous at this point to image a 2GB file we’re definitely heading in that direction, and like you said these formats will be around for a long time.

    We support a good portion of ZIP v 4.5 (we’ll fully document exactly which parts of that we don’t support). For the most part our goals were to be compatible with most of the existing ZIP tools out there. As a result, we actually get a bit more than you mention:

    Size Limits – The original zip spec was limited to 64 thousand items, and the size (compressed or uncompressed) of any item or the overall archive was limited to 2 or 4GB (can’t remember for sure). We support a later addition to the zip spec called "Zip64." Zip64 is already supported by 3rd party zip tools. It raises these limits considerably — about 18 quintillion instead of 64 thousand, and 16 exabytes instead of 2-4 gigabytes*. Our implementation won’t let you hit those astronomic values, but needless to say, we aren’t limited by ZIP itself.

    Unicode – By "file names" I assume you mean the names of items inside the archive (we obviously support unicode characters in the name of the Office file itself) The zip spec has no provision for non-ASCII item names. Some zip tools have made up their own schemes for doing non-ASCII characters but they are often problematic. All of the part names inside the archive are actually insignificant. They are essentially just URIs that we use to identify each one and build up the relationships. So yes, this is a limitation, but we don’t see it having a big impact.

    -Brian

  6. Rob says:

    Brain, great posts! Can you talk a little about the object models, .NET or otherwise that will be available to developers to manipulate these ZIP files?

  7. Your site is very informational for me. Nice work.

  8. buy xanax says:

    i like your website very much but please do get us more information about it