Arbitrary content in an OPC package

I was up early this morning playing around with some sample Open XML documents when my friend Pedro, an ISV in Portugal, sent me an email with this question:

I'm trying to store a file inside the zip file (.docx), but when I open the .docx on word 2007 I get the following error message: «the office open xml file test.docx cannot be opened because there are problems with the contents.» I want to store some files inside the .docx (zip file) for use by other applications without affecting how it is used by word 2007.

Great question, Pedro, and coincidentally I need a little break from what I was working on. Let's take a look at how this can be done.

Basic Concepts: Parts, Relationships, and Content Types

First, a few basic concepts. Open XML documents are packages based on the OPC (Open Packaging Convention) specification. XPS documents are also OPC packages, and you can create OPC packages of your own that have nothing to do with Open XML or XPS if you'd like. If you do that, you can use the .NET 3.0 packaging API to work with your documents, which means the same programming techniques you'll use for Open XML apply to your own custom package.

In Pedro's case, that isn't quite what he wants to do, however. He wants to work with a plain old DOCX as created by Word 2007, but he wants to put some of his own files in the package so that he can use them later. (I know a bit about Pedro's work so I could speculate on why he wants to do that, but such speculation is outside the scope of this post. :-))

To understand the steps involved, consider that OPC packages have three key concepts: parts, relationships, and content types. Each part in an OPC package (like a DOCX) is related to some other part or the package itself, and each part has a defined content type. So we need to do three things to hide a "payload part" in an OPC package:

  1. Put our part in the package somewhere
  2. Define a content type for our part
  3. Define a relationship to our part

The Payload Part

For this little demo I've created a file that is most definitely not a standard file type of any kind. My paylod part is named MyFile.doug, and it's a file of 10 bytes, with values 0-9 in that order. It's not text, not XML, not an image, not anything that Word has ever seen before. If we can successfully embed that part in a DOCX so that it stays there in round-trips through editing in Word, then we can embed anything in a DOCX.

By the way, "payload part" isn't a term in the OPC spec. It's just the name I'm using for my custom part since it's essentially payload that comes along for the ride when our DOCX is edited by Word.

Now I need to put this part into a DOCX. So I've created a little DOCX, and I renamed it to .ZIP (Open XML documents are just ZIP packages), created a folder named mystuff inside the ZIP package, and dropped a copy of MyFile.doug into that folder. When I rename that back to DOCX, at this point I have what Pedro described above: a DOCX that appears corrupt to Word because it's not yet a well-formed Open XML document. So we still have a couple of details to work out.

What's a ".doug" file?

That's at the heart of Word's complaints about this file: what's this ".doug" content type? We need to define a content type for this part, because the OPC spec says that every part must have a defined content type. In this case, we're not asking Word to actually render the part in any way, so we don't need to use a content type that Word understands, but we need to give our part some content type. So I made up a new content type for parts with an extension of "doug" and put it in the [Content_Types].xml part as follows:

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="https://schemas.openxmlformats.org/package/2006/content-types">
   <Default Extension="doug" ContentType="mycontenttypes/dougfile"/> 
  ... other content types as defined by Word ...
</Types>

By the way, when you make up a new content type like this, it needs to conform to the W3C's RFC 2616, which defines valid syntax for content types. Here's my one-sentence summary of RFC 2616: come up with a category, put a slash after it, then put something indicating the specific content type, and don't put any spaces in it. Done.

Relationship Management

The final piece of the puzzle is adding a relationship to our payload part. We'll do this in the .rels part in the root _rels folder in the package, because that's where package-level relationships are defined. In other words, we're relating this payload to the package itself, and not to any specific part within the package.

One concept that applies here is implicit versus explicit relationships. An explicit relationship is referenced from inside the body of the document at a specific location, and Word will attempt to render the related part's content at that location. But in this case, we're going to set up an implicit relationship, which means that we'll define a relationship, but that relationship won't be mentioned anywhere in the body of the document. Word doesn't know what to do with a mycontenttypes/dougfile part, and that's fine as long as we don't ask Word to try to render it or use it in any way.

So here's what the _rels/.rels part looks like after I add the relationship to my payload part:

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="https://schemas.openxmlformats.org/package/2006/relationships">
   <Relationship Id="myRel1" Type="https://doug.com/myfiletype" Target="mystuff/MyFile.doug"/> 
  ... other relationships defined by Word ...
</Relationships>

Mission Accomplished

And that's all there is to it. We now have a DOCX with our special payload part embedded in it, and Word can open the document no problem. When Word saves changes to the document it will keep this part, preserving its name and contents and location in the mystuff folder, because there's a package-level relationship that tells Word this part is part of the package, even if it happens to be a part that Word doesn't understand.

The file I constructed for this little demonstration is attached, and if you have the RTM build of Word 2007 you can open it to see how this all works.

For a great example of a creative application of Open XML's ability to embed arbitrary content in document packages, check out the Mindjet video on Channel 9's "In The Office" show, in which Michael Scherotter explains how Mindmanager works with Open XML documents. By embedding various resources in a DOCX, Mindmanager is able to create documents that can be edited in Word and return to the Mindmanager software later with all the information needed to "rehydrate" the original mindmap. This is powerful stuff, and folks like Michael and Pedro are on the leading edge of what's sure to become a common approach for integrating all sorts of content into Open XML documents.

Hope that helps, Pedro. And now I have another Open XML sample I can use for explaining the details of OPC. Obrigado!

PayloadPartDemo.docx