I wanted to get an example document posted so people get a chance to look through the new Office 12 XML formats and see what the similarities and differences are with the Word 2003 XML format. I took a basic document and saved it out in the new format, as well as in Word 2003's XML format. This is still very early code, so a number of the structures could still change, but I'm pretty confident this is close to what the final version will look like. Also, the majority of the file size is taken up by an embedded picture, so you won't see a significant file size saving with the new format compared to the current binary formats.
You will see right away that it's just pure XML representing the file. I read a post on a blog today where the author mistakenly thought these new formats weren't XML, but instead just XML-based. I guess if that's referring to the fact that we use ZIP as a container it would be true, but other than ZIP, everything else is pure XML following the W3C XML 1.0 standard. I still remember when we decided to go with ZIP as the container... it was a pretty straightforward decision. There were already a number of other formats out there using XML and ZIP, so we figured that would be the best way to go if we wanted people to have an easier time working with our files. Using a single flat XML file wasn't really ever given serious consideration just because of the file size bloat. This was especially true for PowerPoint, where presentations often contain tons of pictures, and having to encode those to store in a single XML file just didn't make a lot of sense.
So anyone want to see an example of the format? If you download the following zip file: http://jonesxml.com/resources/BasicDocument.zip you will see 3 embedded documents that have identical content, but in different formats. There is a binary document (.doc) you can open in Word, and you'll see some text and a picture. There is then an equivalent .xml file that was saved in Word 2003 with the XML format. The third file is a .docx file that I saved using the latest build of Word 12. That's the file you guys will find the most interesting. Open the file using any ZIP tool, and you can start to explore. Let me give you a basic description of what you are seeing:
If you are using the shell's ZIP support (just rename the file to have a .zip extension), you'll see that at the root level of the package there is an xml file called [Content_Types].xml, and three folders: "_rels", "docProps", and "word".
If you haven't read through the first part of the Metro Spec, I would recommend it. Office uses the same ZIP conventions that the metro folks do, as I described in this earlier post. We worked together on designing a logical model for documents, and then mapped that into ZIP. Since ZIP doesn't have a content type property on each part, we instead use this XML part to describe the content types that appear in the package. By reading this part (which always has the same URI "/[Content_Types].xml") you can quickly see what type of content the file consists of. There is a default mapping for extensions, as well as overrides for specific URIs.
The _rels Folders are where you go to find the relationships for any given part. To find the relationships for a part, you just look for the _rels folder that is a sibling of your part. If the part has relationships, the _rels folder will contain a file that has your original part name with a ".rels" appended to it. For example, if the content types part had any relationships, there would be a file called "[Content_Types.xml.rels]" inside the _rels folder.
The root level _rels folder always contains a part called ".rels". This URI ("/_rels/.rels") and "/[Content_Types].xml" are the only two reserved URIs for parts in files that adhere to our conventions. This is where the "package relationships" are located. Whenever you open a file using these conventions, you always start by going to the _rels/.rels file. All relationship files are represented with XML. If you open it in a text editor you'll see a bunch of XML that outlines each relationship for that part. In this example document, the top level parts are two metadata parts, and the wordDocument.xml part. That's what we'll look at next.
This is the main part for any Word document. If you crack it open in an XML editor (I just use IE to view it), you'll see a pretty basic XML file. This is where you'll start to see the differences between the new format, and the Word 2003 XML format. A bunch of the stuff that was at the beginning of the document in 2003 is now broken out into separate parts. The body of the document is what's contained in this part. As you look around in this part, there are a couple of things I want to call out.
Notice that the picture isn't embedded in the XML like it was in Word 2003. You'll see there is some markup describing how the picture is laid out, but the picture data itself isn't there. Instead, there is the following tag:
<v:imagedata w:rel="rId5" o:title="bulls" />
This is the reference to the image file. In the new format, all references are done via relationships. The wordDocument.xml part has a relationship to the image part. In order to find the image, we just need to go to the relationships file for wordDocument.xml and find the relationship id "rId5". Looking back at the ZIP package, notice that there is a _rels folder in the same directory as the wordDocument.xml part. Open that folder and you'll see a file called wordDocument.xml.rels. If you open this up in a text editor you'll see that "rId5" is a relationship of type "http://schemas.microsoft.com/office/2006/relationships/image", and it points to the file image0.jpg in the media folder.
I'll talk more about relationships in future posts, but I hope the basic usefulness is clear. The relationships files allow you to quickly navigate through the package without having to open up each part. If I wanted to find all images that are referenced in the wordDocument, I don't even need to open the wordDocument.xml part. I just open the relationships file and look for all relationships that are of type "http://schemas.microsoft.com/office/2006/relationships/image". If I want to change this to point at a different image, I just edit the relationship, and don't need to modify the application level XML. This is especially useful for external relationships, as described next.
Back in the wordDocument.xml, notice the inline markup for the hyperlink. The tag is just <w:hyperlink w:rel="rId4" w:history="1">. It doesn't actually have the URL inline. Just like references to other parts in the ZIP use relationships, so to external references. If you go back to the relationships file for wordDocument.xml, you'll see that rId4 is a relationship of type hyperlink, and it points to my blog. This is true not just for hyperlinks, but for any external reference. Linked images, templates, etc. This makes it much easier to do link fix-up if your moving files from one server to another. Or if you want to remove all external references for security reasons, you just edit the relationships.
There are a bunch of other things I want to talk about with this file, but the post is already getting too long. The main thing I wanted to get across here was how the different pieces of the files are laid out, and how you go about navigating them. Please play around with the file a bit. Let me know what areas of the formats you'd like me to describe in greater detail.