Example Office 12 XML File


I wanted to get an example document posted so people get a chance to look through the new Office 12 XML formats and see what the similarities and differences are with the Word 2003 XML format. I took a basic document and saved it out in the new format, as well as in Word 2003’s XML format. This is still very early code, so a number of the structures could still change, but I’m pretty confident this is close to what the final version will look like. Also, the majority of the file size is taken up by an embedded picture, so you won’t see a significant file size saving with the new format compared to the current binary formats.


You will see right away that it’s just pure XML representing the file. I read a post on a blog today where the author mistakenly thought these new formats weren’t XML, but instead just XML-based. I guess if that’s referring to the fact that we use ZIP as a container it would be true, but other than ZIP, everything else is pure XML following the W3C XML 1.0 standard. I still remember when we decided to go with ZIP as the container… it was a pretty straightforward decision. There were already a number of other formats out there using XML and ZIP, so we figured that would be the best way to go if we wanted people to have an easier time working with our files. Using a single flat XML file wasn’t really ever given serious consideration just because of the file size bloat. This was especially true for PowerPoint, where presentations often contain tons of pictures, and having to encode those to store in a single XML file just didn’t make a lot of sense.


So anyone want to see an example of the format? If you download the following zip file: http://jonesxml.com/resources/BasicDocument.zip you will see 3 embedded documents that have identical content, but in different formats. There is a binary document (.doc) you can open in Word, and you’ll see some text and a picture. There is then an equivalent .xml file that was saved in Word 2003 with the XML format. The third file is a .docx file that I saved using the latest build of Word 12. That’s the file you guys will find the most interesting. Open the file using any ZIP tool, and you can start to explore. Let me give you a basic description of what you are seeing:


Root Folder



If you are using the shell’s ZIP support (just rename the file to have a .zip extension), you’ll see that at the root level of the package there is an xml file called [Content_Types].xml, and three folders: “_rels”, “docProps”, and “word”.


[Content_Types].xml



If you haven’t read through the first part of the Metro Spec, I would recommend it. Office uses the same ZIP conventions that the metro folks do, as I described in this earlier post. We worked together on designing a logical model for documents, and then mapped that into ZIP. Since ZIP doesn’t have a content type property on each part, we instead use this XML part to describe the content types that appear in the package. By reading this part (which always has the same URI “/[Content_Types].xml”) you can quickly see what type of content the file consists of. There is a default mapping for extensions, as well as overrides for specific URIs.


_rels Folder



The _rels Folders are where you go to find the relationships for any given part. To find the relationships for a part, you just look for the _rels folder that is a sibling of your part. If the part has relationships, the _rels folder will contain a file that has your original part name with a “.rels” appended to it. For example, if the content types part had any relationships, there would be a file called “[Content_Types.xml.rels]” inside the _rels folder.


_rels/.rels



The root level _rels folder always contains a part called “.rels”. This URI (“/_rels/.rels”) and “/[Content_Types].xml” are the only two reserved URIs for parts in files that adhere to our conventions. This is where the “package relationships” are located. Whenever you open a file using these conventions, you always start by going to the _rels/.rels file. All relationship files are represented with XML. If you open it in a text editor you’ll see a bunch of XML that outlines each relationship for that part. In this example document, the top level parts are two metadata parts, and the wordDocument.xml part. That’s what we’ll look at next.


word/wordDocument.xml



This is the main part for any Word document. If you crack it open in an XML editor (I just use IE to view it), you’ll see a pretty basic XML file. This is where you’ll start to see the differences between the new format, and the Word 2003 XML format. A bunch of the stuff that was at the beginning of the document in 2003 is now broken out into separate parts. The body of the document is what’s contained in this part. As you look around in this part, there are a couple of things I want to call out.


Embedded picture



Notice that the picture isn’t embedded in the XML like it was in Word 2003. You’ll see there is some markup describing how the picture is laid out, but the picture data itself isn’t there. Instead, there is the following tag:


<v:imagedata w:rel=”rId5″ o:title=”bulls” />


This is the reference to the image file. In the new format, all references are done via relationships. The wordDocument.xml part has a relationship to the image part. In order to find the image, we just need to go to the relationships file for wordDocument.xml and find the relationship id “rId5”. Looking back at the ZIP package, notice that there is a _rels folder in the same directory as the wordDocument.xml part. Open that folder and you’ll see a file called wordDocument.xml.rels. If you open this up in a text editor you’ll see that “rId5” is a relationship of type “http://schemas.microsoft.com/office/2006/relationships/image”, and it points to the file image0.jpg in the media folder.


I’ll talk more about relationships in future posts, but I hope the basic usefulness is clear. The relationships files allow you to quickly navigate through the package without having to open up each part. If I wanted to find all images that are referenced in the wordDocument, I don’t even need to open the wordDocument.xml part. I just open the relationships file and look for all relationships that are of type “http://schemas.microsoft.com/office/2006/relationships/image”. If I want to change this to point at a different image, I just edit the relationship, and don’t need to modify the application level XML. This is especially useful for external relationships, as described next.


Hyperlink



Back in the wordDocument.xml, notice the inline markup for the hyperlink. The tag is just <w:hyperlink w:rel=”rId4″ w:history=”1″>. It doesn’t actually have the URL inline. Just like references to other parts in the ZIP use relationships, so to external references. If you go back to the relationships file for wordDocument.xml, you’ll see that rId4 is a relationship of type hyperlink, and it points to my blog. This is true not just for hyperlinks, but for any external reference. Linked images, templates, etc. This makes it much easier to do link fix-up if your moving files from one server to another. Or if you want to remove all external references for security reasons, you just edit the relationships.


There are a bunch of other things I want to talk about with this file, but the post is already getting too long. The main thing I wanted to get across here was how the different pieces of the files are laid out, and how you go about navigating them. Please play around with the file a bit. Let me know what areas of the formats you’d like me to describe in greater detail.


-Brian

Comments (39)

  1. orcmid says:

    Nifty! Nice choice of examples about keeping relationships outside of the content parts.

  2. Mario Goebbels says:

    Just a nitpick, why are the attributes prefixed with the same namespace as the parent element? If you prefix the parent element, the attribute assumes that same prefix by default. Example line of what I mean:

    <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800"

    w:header="720" w:footer="720" w:gutter="0" />

    Just pointing it out, removing the redundant prefixes saves diskspace 🙂

  3. lexp says:

    Will Word2005 file format use more clear element and attribute names as apposed to Word2003 file format (e.g. "Paragraph" instead of "p")?

  4. Keith says:

    Thanks for all the early examples.

    As someone who’s doing a lot of work on WML transformations to other formats I’m particularly interested in the ways in which those transformations will have to be updated for the new schema. Therefore, a snapshot of any beta schemas would be greatly appreciated.

    Thanks for adding line breaks and pretty formatting. You’ve made the XML human-readable in vim at least.

    On relationships: It seems to me that breaking the image data out of the XML is a great idea–kudos for that. However, I’m very concerned about the naming conventions for the relationships. Is everything just going to use the w:rel="rId[###]" format? What will determine the order of the Id numbering in the document–is it just sequential? I’d really be much happier with a more human-readable/meaningful w:rel name, or at least some way to distinguish between relationship types from simply the w:rel name itself. Imagine (just like I am), a document with 100 hyperlinks and 100 images and a _rels/wordDocument.xml.rels file that hasn’t put the rIDs in order (like the one you gave, which goes 3 2 1 5 4). w:rel="rIdImg20" looks much more appealing than w:rel="rId59". I dunno if that’s achievable or desirable (I understand some of the reasons why it wouldn’t be). Maybe others can comment.

    If it’s easy for you to create these files, I’d love to see a file with some character styles, numbered lists (maybe some nesting?), and a simple table.

    Again, thanks for all the info,

    Keith

  5. Tom says:

    TO MARIO:

    I thought that attributes are not a part of any namespace by default. You have to prefix them explicitly.

  6. Bruce Rindahl says:

    Thanks for posting the examples! I will be waiting for a similar post on your Excel blog

  7. BrianJones says:

    Mario – You’re right that in most cases people don’t qualify their attributes, and just assume the namespace of the parent element applies to the attribute. In Word 2003 schemas we qualified the attributes, and we currently are doing it for the 12 schemas. We may decide to drop that though and just go with unqualified attributes as you suggest (not really a big deal either way).

    LexP – Currently the naming conventions are very similar to what they were in 2003. I was thinking about changing this because now that we use ZIP compression, there shouldn’t be a big impact on file size. There is a big impact though on performance, as longer names require more parsing. Because of that, for elements that occur often in the files, we try to use very short names. For elements that only occur a couple times though and are more rare, we will often use more verbose names.

    I had been thinking about providing a tool that could make this easier, but I wasn’t sure how useful it would be. Maybe you can give me your opinion… I was going to have someone build a simple XSLT that converted every element from the short tag names to longer more verbose names. I would create a group of “debug” namespaces that matched the Office namespaces and allow people to transform between the two. These “debug” namespaces wouldn’t be supported by the applications, instead they would just be for putting the document into a temporary state that’s more readable. Does that sound useful? Or is it already getting too inconvenient?

    Keith – I think we’ll be able to provide the first draft of the schemas a bit before beta 1. I’m currently thinking it will be around the time of PDC which is the 2nd week in September, but I’m not positive.

    It’s funny that you bring up the pretty printing, because we were actually arguing about it yesterday. The current plan is that we will not pretty print our XML parts. The example I posted has pretty printing, but it wouldn’t in the final version. The reason for this is that there is actually a significant performance hit when you have to take the additional time to pretty print the files. Since these formats are going to be the default, we need to make sure they have fast open and save times. By not pretty printing though, it makes them harder to work with if you are editing the XML directly by hand. Many XML editors currently have pretty printing functionality though (FrontPage & VS for example), so I’m hoping it won’t be that big of a deal. What do you think?

    You’re relationship ID point is interesting. The IDs just need to be unique. If you were creating a file from scratch, you could call them anything you want. There is a type attribute on each relationship as well though, which allows you to understand more about how it’s used. While there is nothing to help you know what order they are used, you will know when a relationship is pointing at an image vs. a stylesheet (just as an example). I’d like to hear from more folks on their first impressions of relationships.

    Bruce – I’ll try to get some Excel stuff pulled together soon. There are more significant differences from the SpreadsheetML to the new Excel format than what you see with Word, so I’ll need to start with something simple and make sure I explain everything properly. Excel has done a lot of work to make the new format a faster more efficient format, which means far less information on each cell, and instead separate collections of properties that reference the grid. As an example, with a named range, instead of having the name on each cell, you would instead have a separate named range element that references the range in the grid that it applies to. I’ll get an example together to make this more clear.

    Everyone – Keep the comments coming. We announced this early so we could get more feedback from folks on the formats. Let me know what you liked or disliked about the 2003 schemas. What would you like to see changed? What do you think about this first look at the new formats?

    -Brian

  8. Keith says:

    Brian said:

    <blockquote>

    It’s funny that you bring up the pretty printing, because we were actually arguing about it yesterday. The current plan is that we will not pretty print our XML parts. […] The reason for this is that there is actually a significant performance hit when you have to take the additional time to pretty print the files.

    </blockquote>

    Your speed argument seems to trump human-readability quite completely. I’m fine with that. Developers should be able to get pretty printing on their end pretty easily.

    <blockquote>

    You’re relationship ID point is interesting. The IDs just need to be unique. If you were creating a file from scratch, you could call them anything you want.

    </blockquote>

    Would hand naming be preserved after opening and saving using Word?

    <blockquote>

    There is a type attribute on each relationship as well though, which allows you to understand more about how it’s used.

    </blockquote>

    Sure, it seems pretty reasonable at this point. I’m just quivering because I work too much with XML & other stuff that auto-generates number strings for all XRefs (cross-references). For example, I can ask Word 2003 to "Toggle Field Codes" on a simple XRef and I get the not very helpful "REF _Ref107126818 h". Another program gives me nice DocBook XML strings like <pre><indexterm id="IXT-33-296798"><primary>algorithms</primary></indexterm></pre>

    Because cross references tend to break for us we’ve gone to some lengths to get meaningful names for IDs so that we can fix them later. Here’s what we’ve gotten Word 2003 to say instead of something like the above "REF XREF70988_Figure_111 h", which is a cross reference to the paragraph with the text "Figure 1-11" (implemented with auto-numbering).

    Any human-readable text that finds its way into IDs would seem like a big improvement from my perspective.

    <blockquote>

    I’d like to hear from more folks on their first impressions of relationships.

    </blockquote>

    Ditto, I may be in the minority.

    Thanks,

    Keith

  9. lexp says:

    Frankly I don’t understand such an economy on element names. Excel2003 already has readable and verbose XML format with Pascal casing, while word2003 has camel old-style "economy" naming convention.

    Please pay attention to unification of:

    1) verboseness of XML format (p vs Paragraph)

    2) naming convention (camel, Pascal)

    3) attribute vs. element usage in sample scenarios (<CoreProperties Title="" .. /> vs. <CoreProperties><Title>… )

    As to me, I prefer XAML-style naming convention with clear element and attribute names.

  10. Ryan says:

    I may be out of place here as an IT manager rather than a developer, but I’m wondering if these format changes will result in better, more reliable handling of things like long tables in Word, cell formatting in Excel, etc. For example, we’ve used Excel to catalog/sort/analyze an email exchange in a corporate litigation scenario, using a single cell for the message body (we would use Access, but it doesn’t support rich-text). Word tables choke on this type of task altogether. But Excel has some limitations on cell size, etc. And even though a single cell can hold quite a bit of information, Excel seems only to recognize the first 1024 characters when it comes to things like autofit and other formatting. There probably is a better program out there for this kind of task, but folks often turn to the programs they know, and it requires a lot of clean-up afterwards when they find out Word/Excel has unforeseen limitations. The Word table issue–the program seems to slow down dramatically or crash on large tables–I know is one we’ve experienced many times.

    Any thoughts? Will the new formats help with these problems?

  11. Ignace says:

    For relID readable names I think an option to change the Id name inside Word (properties or so) should be enough.

    Pretty printing should be an option, a checkbox, in the ‘save as’ dialog.

    Will I notice the speed difference in saving and loading binary format compared to new format, especially with large files?

  12. Paul says:

    Brian,

    thanks for the early insights. Nice picture as well.

    Nevertheless, with reference to xml, people will have to cry about every little thing:

    For once, it bugs me when it comes to the vector markup schema in Office 12. MS has always been very reserved in promoting vml. There’s little known documentation available, it got almost completely erased from the msdn dvds and the schema wouldn’t ship with the Office2003 schemas at first?!?

    I realize that MS is tempted avoid svg vs. vml tumults. The office team has always been doing vml over svg and preferred not to talk about it. Since you started talking, I’d expect you to come down on one or other side of the fence. Please explain your decision and stand in for your solution…

    The pity of it is that they don’t have Longhorns in San Miguel de Allende;-))

  13. walter b says:

    I’m really glad you guys have opened up the format. And now that the format is open, i’d like to add other relevent tags to a word doc. The question is if i added my own tags from my namespace will these get stripped out when i resave the document from within Word?

  14. orcmid says:

    Walter b: Are you talking about a "foreign" XML element comingled among/under/within the MS Office Word 12 ones? That’s an interesting topic. I notice that OASIS OpenDocument specifies default behavior for those and I wonder if that is what you have in mind.

    I *think* the OASIS OpenDocument foreign elements/attributes are handled a little like the HTML rules where foreign attributes are ignored and any foreign-element content leaks into the surrounding understood element content. I don’t know if there’s a way to control preservation as the result of editing of the containing understood element. (These things need to be thought through before a default behavior is invented, or else invent a default behavior that is easy to change in an upward compatible way later [;<).

    Part of the problem is not knowing whether the editing has an impact on what the function of the foreign element is. You can even have this problem with something simple like adding a document property to the property sheet. The Office 12 scheme of things could have a way (via attribute) for the keep/discard/fail behavior required when the containing element is foreign to the processing application, but I don’t know if that would solve your problem.

    [Wishing there was a newsgroup as a better place for discussions like this … ]

  15. orcmid says:

    I should point out that the foreign element/attribute problem applies more widely in any scheme where an XML Document can be extended by "foreign" elements/attributes pretty much anywhere. WebDAV is kinda/sorta snarled up in this too. Someone else can say what Sharepoint does about it.

  16. Hi Brian,

    It looks like the relationships tags are going to be the key to user-navigation of these files, so yes, I think more descriptive tags would be better, and/or some sort of organisation of them in the relationships file. I think it would also be immensely helpful if MS could knock together a tool that would make it easy for us to navigate around these files, editing the XML as we go. E.g. a simple text editor, where every relationship link can be clicked to get to the definition, which can be clicked again to get to the target.

    I’d like to also add my vote to a ‘Verbose’ or ‘Debug’ option in the Save As dialog, which would run the xml through your ‘Flesh out the tags’ transform, with a preprocessor to do the reverse when the files are loaded.

    Stephen

  17. y says:

    Brian,

    Please consider adopting just one convention from the OpenDocument standard: that of placing a "mimetype" file first in the zip archive, uncompressed, whose contents are the MIME type of the entire document. This makes it easy to identify documents by examining the first few bytes.

    See Section 17.4 of the OpenDocument v1.0 standard, http://www.oasis-open.org/committees/download.php/12572/OpenDocument-v1.0-os.pdf

  18. Kaleb says:

    Brian,

    : There are more significant differences from

    : the SpreadsheetML to the new Excel format

    : than what you see with Word, so I’ll need to

    : start with something simple and make sure I

    : explain everything properly.

    As one who is familiar with many of the issues you had to face when developing the new Excel format, I can certainly appreciate this.

    However, some of us are already pretty familiar with the earlier Excel formats and the architecture they’re based on, and we’d like to start getting our hands dirty. Would it be possible to post a sophisticated workbook in the new format (and including the BIFF equivalent) for our benefit?

    We here in my office are excited about this new file format and what it means for the customers you and we have in common.

  19. BrianJones says:

    Keith – Hand naming of the relationships most likely would not be preserved. It’s important to understand the target use of relationships. Preservation of IDs on objects is something that really needs to be part of the application runtime, and not just the file format. The parts that we break our files into often aren’t truly separate objects in the applications memory structures. We are asked to provide IDs on different types of objects, like embedded documents, images, tables, paragraphs, etc. Those are things we could decide to persist via the relationship ID rather than other inline markup, but it would really depend on the object itself, and not the more generic concept of parts and relationships.

    You’re point about cross references is very valid, and that is something that is more of an applications feature level thing, as opposed to something directly related to parts and relationships in the file format. Let me know what you’d like to see out of the references features.

    Ryan – You definitely shouldn’t feel out of place. I’ve had some fairly technical posts so far, but I also plan to touch on a lot of topics more related to IT folks. I also will have some posts that are more at an “intro to XML” level. These new formats to make us more stable, but not necessarily in the way you describe. The new formats also remove a lot of the limitations we had in previous formats, which allows us to explore some of the constraints we had in the older versions. In Word, we’re also looking at the problems that come with longer documents and large tables, but there isn’t anything specific I can say right now as far as new functionality goes.

    Ignace – Many of the relationships and parts aren’t exposed at runtime. They are only generated when we go persist the file. We don’t have performance numbers yet in comparing the old binary formats with the new XML formats (it’s still a bit too early).

    Paul – VML has been around for a long time. The reason we didn’t have the schemas for VML when we first released the 2003 schemas was that we had to go back and create it. All the code for generating VML had been done long before XSD came about. We had XSDs for all the other schemas because we’d actually built them directly into our build process. Our code would pull the tag names to be used directly from the XSD files at build time. It was a lot easier for us to clean those up and make them available publicly. For VML we had to get someone to generate it from scratch based on the implementations.

    Orcmid – We’ll probably wait until we get closer to the betas to get a newsgroup setup, but it may come sooner. I don’t really have enough time to monitor a newsgroup myself, and it would be best to not distract the rest of the development team until we get closer to ship. Like I said though, maybe it will make sense to set something up sooner.

    Stephen – I’ve been looking around for some resources to put together a tool like you suggest for navigating the files. Not sure if we’ll get it together though or if we’ll have to rely on a third party to do it. It would definitely be very cool.

    Kalelb – I’ll try to get something together soon. Most likely I’ll start with some really simple files though. The reason I posted the Word example file first is that the Word format is the furthest along. I’ll check the latest builds of Excel and see if it’s in a state where an example file would be useful.

  20. Mark Focas says:

    Is there any merit in being able to have multiple documents in the same zip, so that shared style information, shared images etc are not duplicated.

  21. BrianJones says:

    That’s a great question Mark. A number of people naturally wonder if this new format means we could create V2 of the "binder." The binder was essentially a document format that would allow for multiple files to be stored in one file. The scenario was more around having a project that had some Word documents, Excel spreadsheets, and PPT presentations, and you wanted to have them all stored together.

    The ZIP container does lend itself well to that concept, but it’s not something we’re planning on doing this version. It is something we kept in mind while we were architecting the logical model for our documents though, so there isn’t anything that would prevent us from moving in that direction if we decided it was worthwhile.

    -Brian

  22. Marco Antonio Sanchez says:

    It would be better that the values of the w:lang element attributes of the /word/wordDocument.xml part follow the RFC 1766 standard: that is, a lowercase two-letter language code and a uppercase two-letter country-region code.

    This way you can interpret these values as culture names with existing classes and frameworks.

  23. Visual Studio Team System User Education – Process Planning Guide

    David has written a nice guideline…

  24. This post is for those of you interested in learning the basics behind WordprocessingML. That’s the schema…

  25. Visual Studio Team System User Education – Process Planning Guide

    David has written a nice guideline…

  26. Format Comparison Between ODF and MS XML

    by Carrera, D’Arcus, Eisenberg

    http://groklaw.net/article.php?story=20051125144611543

  27. I found your page from google but i like it so much

  28. This post is for those of you interested in learning the basics behind WordprocessingML. That’s the schema

  29. Dating says:

    I wanted to get an example document posted so people get a chance to look through the new Office 12 XML formats and see what the similarities and differences are with the Word 2003 XML format. I took a basic document and saved it out in the new format

  30. Weddings says:

    I wanted to get an example document posted so people get a chance to look through the new Office 12 XML formats and see what the similarities and differences are with the Word 2003 XML format. I took a basic document and saved it out in the new format