ZIP level Robustness for the Office XML Formats

I already mentioned this in one of my comments, but I wanted to expand on it a bit more. I want to first make clear though that I’m not trying to claim that we’re original for using ZIP, or found some new cool stuff that no one else knew about. One of the reasons we chose ZIP was that is was in such wide use today we knew it would be easier for people to work with it. If you guys are curious to know more about the reasons behind our decision let me know and I can go into the different requirements we had when deciding on the container format.

For this post though I want to focus specifically on robustness of ZIP, since there was some disagreement on how robust ZIP really is. While I’ll agree that ZIP isn’t the most robust technology out there, it’s still extremely robust relatively speaking (especially compared to compound doc). There is a central directory at the end of the entire ZIP package which points out the location of each part (file) in the ZIP, but that actually isn’t required to open the file. The central directory just makes it easier to do random access to any individual part.  The parts inside the ZIP are all written out serially, and there is a header between each one. So, even without the central directory, we could still rebuild the ZIP package just by scanning the file looking for each header.

So, this means that if any one part gets corrupted, you can still recover all the other parts. All the compression occurs at the part level, so you don't need to worry about the corruption of one part interfering with your ability to decompress the other parts. Additionally, if the file gets cut off at some point and you don’t get the central directory, you can still rebuild it just by scanning all the parts you have. Of course you can’t rebuild the parts that are missing, but if all the important parts are at the front of the file, you at least get those.

To take advantage of these conditions and make our files robust, we break them out into multiple parts. In Word for instance, the document properties are stored in a separate part from the document content. The properties go towards the end of the file, and the content goes towards the front. This way we are making it so that even if there is some type of bit level corruption or transmission error, we are increasing our chances of being able to recover the most important data. I’ll go into more detail later about how each application breaks their files out into separate parts. It’s really cool.

The last big benefit around ZIP from a robustness point of view goes back to it’s widespread use. There are tons of ZIP tools out there today that deal with corrupt ZIP files. We knew that we could build in good recovery logic, and in addition, people would have the ability to use other tools to try and repair their files if they wanted to.

There are other kinds of corruption that occur as well, this discussion was really just around bit level corruptions. Other corruptions we see are more related to application or user error. I’ll talk more about this in a future post, since it doesn’t have as much to do with ZIP as it does with the openness of the formats.

-Brian