Example Word ’12’ document from Beta 1 with hyperlink and image


I just posted another example document if any is interested. http://jonesxml.com/resources/hyperlinkandimage.docx For those of you that got a copy of Beta 1, the file will be compatible with your build, so you can open it and take a look. This is an extremely simple file that has a simple paragraph, another paragraph with a hyperlink, and an image. I posted this to show you guys a few things:


Open Packaging Conventions


As I’m mentioned before, we use a simple set of conventions for structuring a document within a ZIP. This file has some text, a hyperlink, and a picture, and the open packaging conventions are used to tie that all together.


Go ahead and rename the file to have a “.zip” extension and open it up. You’ll notice there is a file there called [Content_Types].xml. That file describes what the content types of the other parts within the package are. Look at the _rels folder. The file _rels/.rels is the first place you go to start parsing the file. It’s an xml file that describes all the root level relationships, and if you open it you can see that the first part you need to parse in order to read the document is “document.xml”.


Use of relationships


Open the “document.xml” part and take a look:


<w:wordDocument xmlns:r=”http://schemas.microsoft.com/office/2005/11/relationships” xmlns:v=”urn:schemas-microsoft-com:vml” xmlns:w=”http://schemas.microsoft.com/office/word/2005/10/wordml”>
  <w:body>
    <w:p>
      <w:r>
        <w:t>Hello World!</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:hyperlink r:id=”rId2″>
        <w:r>
          <w:rPr>
            <w:color w:val=”0000FF”/>
            <w:u w:val=”single” />
          </w:rPr>
          <w:t>Click here for Brian Jones’ blog.</w:t>
        </w:r>
      </w:hyperlink>
    </w:p>
    <w:p>
      <w:r>
        <w:pict>
          <v:shape id=”_x0000_i1025″ type=”#_x0000_t75″ style=”width:250; height:200″>
            <v:imagedata r:id=”rId4″/>
          </v:shape>
        </w:pict>
      </w:r>
    </w:p>
  </w:body>
</w:wordDocument>


The first paragraph isn’t really all that interesting, but the next two definitely are. Look at the attributes called r:id. Those are relationship references. Any reference from one part in the file to another part has to be done via a relationship. The really cool part is that all relationships live out on their own, so you can quickly scan a package and figure out all the parts that make up that document and how they relate without having to actually go into the application xml.


Are you interested to know where those relationships actually point to? Well, the name of this part is “document.xml”, so that means the relationship file is going to be called “_rels/document.xml.rels”. That’s how you find the relationships for any part, just go to the _rels folder that is in the same folder as the part, and find a part with the same name but with “.rels” at the end. This is all described in the Open Packaging Conventions, but it’s pretty straightforward.


Here’s what the relationship folder looks like:


<Relationships xmlns=”http://schemas.microsoft.com/package/2005/06/relationships”>

  <Relationship
    Id=”rId2″
    Type=”http://schemas.microsoft.com/office/2006/relationships/hyperlink
    Target=”http://blogs.msdn.com/brian_jones
    TargetMode=”External” />

  <Relationship
    Id=”rId4″
    Type=”http://schemas.microsoft.com/office/2006/relationships/image
    Target=”image1.jpg”/>

</Relationships>


You can see that even external references are done via relationships. This means that if you want to do link fix-up, or even just quickly scan a document to see what it points at, you don’t need to parse all the application XML, but instead just quickly scan the relationship files. You can also obviously modify the relationships just as easily if you wanted to change a server name or something. Every relationship has an id, type, and target. If the relationship points to an external source, than it also has the TargetMode attribute set to external.


Another thing you may notice from this is that the actual part names don’t really matter. We output all of our files in a fairly nice structure with folders, etc. but we [DON’T] require that structure. The only thing you need to worry about is the relationship structure. You could change that folder called “word” to be “myownfolder”, and as long as the relationships were updated to account for this, everything would continue to work. That means if you want to replace the picture I put in there with another one, you could just drop it into the package, and then update the document.xml.rels file to point at your new picture instead of the old one.


Formatting


Now take a look at the second paragraph with the hyperlink. When you open this in Word, the text will have a blue color and underline applied to make it look like a hyperlink. That isn’t because it has a hyperlink applied, but instead because it has that formatting applied to it directly. If you were to create this file directly from Word, it would have used a style instead of direct formatting, but I wanted to show the difference between styles and direct formatting.


In Word, there are a number of different ways you can apply formatting to a document. One way is with styles. There are all kinds of styles: paragraph styles; list styles; table styles; and character styles. If some text has a style applied to it, then the WordprocessingML for it would look something like this:


<w:r>
  <w:rPr>
    <w:rStyle val=”Hyperlink”/>
  </w:rPr>
  <w:t>Click here for Brian Jones’ blog.</w:t>
</w:r>


If that were the case, then there would also be a styles.xml part in the package that described the Hyperlink style. In that part, there would be a style definition that would look like this:


<w:style w:type=”character” w:styleId=”Hyperlink”>
  <w:name w:val=”Hyperlink” />
  <w:rPr>
    <w:color w:val=”0000FF”/>
    <w:u w:val=”single” />
  </w:rPr>
</w:style>


A character style (like this one) has an ID which it’s referenced with, as well as a name, which is the friendly display name. They are usually both the same, but at times they need to be different (internationalization, etc).


As I already said though, in my example, I used direct formatting rather than styles. That’s really not a call we make in Word, it’s up to the user and the template author. If people use styles, then there won’t be any formatting stored in the document.xml part and instead it will all be in the styles.xml part. If they use direct formatting though, the formatting will of course be stored right on the text in the document.xml part. If you aren’t aware of the difference between direct formatting and styles, it’s pretty straightforward. If you use the style picker to apply a style like “emphasis”, or “heading”, then we store that style name on the text, and the formatting information is stored with the style itself. If you instead press the “B” button to make the text bold, or choose a color to apply, then you haven’t applied a style. Instead, you’ve specified that the text selected should have those specific properties stored.


That’s what I did with this example. I applied formatting properties directly on the text, so instead of the style reference on the run, and then the color and underline values stored on the style, I just took the entire <w:rPr> tag from the style and moved it down to the text run, so it looks like this:


<w:r>
  <w:rPr>
    <w:color w:val=”0000FF”/>
    <w:u w:val=”single” />
  </w:rPr>
  <w:t>Click here for Brian Jones’ blog.</w:t>
</w:r>


The Word structure is actually really simple. You have a few core objects: p (paragraphs), r (text runs), tbl (tables), as well as other things like sections, table rows, table cells, text boxes, etc. These core objects are represented with their associated XML element, and any formatting or other properties that are applied with that object are stored in the objects property bag: pPr (paragraph properties), rPr (run properties), and tblPr (table properties). If you want to apply formatting to a run of text, all you do is edit the rPr tag for that run.


We’ll cover much more of this over the coming months, but I want to make it clear that as you look at the WordprocessingML format, understand the core structures. Everything else is just a property of one of those structures.


-Brian

Comments (18)

  1. Todd Knarr says:

    This looks like it confirms my first impression of the MS XML formats: they’re not so much document formats as XML encodings of the internal Word object structures. And with 5 years experience with XML I have to say my first impression is that this structure’s going to be a real pain to manipulate. For example, the fact that the target of the hyperlink in the document doesn’t appear anywhere in the document, you need to read another file to find it. An XSL transform of this example into HTML, for example, looks to be a lot more complicated than it should be.

    It’s a wonderful format for serializing objects for later deserialization by the same program, but it looks far from optimal for arbitrary manipulation as XML.

  2. orcmid says:

    "Another thing you may notice from this is that the actual part names don’t really matter. We output all of our files in a fairly nice structure with folders, etc. but we require that structure."

    I’m hoping you meant to say "… but we DON’T require that structure." [;<).

    I love the business of having relationship specifics defined external to the content files. It reminds me a little of what I thought was great about HyTime. One cool aspect is that the location-sensitivity of the content has been abstracted out and it is now easy to work with material that has been relocated without rewriting the content (especially if digital signatures and other important things need to be preserved).

    Your great example has me wonder what this does to the idea of monikers and reference to fragments within a part.

  3. y says:

    So, suppose I’m just handed this document, and I don’t know a priori that it’s a Word document, and I’m trying to decide what to do with it. First I have to examine the zip table of contents, and notice that there’s a _rels/.rels file, so it’s an open packaging document but not necessarily an Office document. Then I have to uncompress and parse _rels/.rels, and guess that the top-level relationship I’m looking for is the one of type http://schemas.microsoft.com/office/2006/relationships/officeDocument, so I pretty much have to be looking for an Office document to start with (it’s the only one in this case, but presumably there usually will be others, e.g. for metadata).

    Then to figure out what type of Office document it is, I have to uncompress and parse [Content_Types].xml, and notice that the type of document.xml is application/vnd.ms-word.main+xml, so it’s a Word document (this MIME type has changed since the last sample you gave out, but maybe it’ll be stable now). Alternatively I could uncompress and parse document.xml, and notice that the top-level element is w:wordDocument (probably more reliable, but probably also requires handling many more bytes).

    Conclusion: in order to determine anything about one of these documents, it’s necessary to know ahead of time what types you’re looking for, and even then you need to process and parse large chunks of the document just to get started.

  4. omz says:

    >><w:hyperlink r:id="rId2">

    i cant see the point of this.. why not just put the "plain" link, avoiding this indirection ? like HTML or XHTML does ? just wondering, thanks

  5. BrianJones says:

    y, one of our goals was to make this a format you could easily create with existing tools (ZIP and XML). We wanted folks to have the ability to just take some XML files and use something like winzip to zip them up and have a valid document. That means that the method for declaring the doc type needs to be at the level of the files in the ZIP (rather than extending ZIP, or using some workaround like the ZIP comment block).

    So, the first way you can tell what kind of file it is would be to just use the extension (that’s the easiest and most common). If that isn’t an option for some reason, then you can first identify that it’s a ZIP package by looking at the header. Then, like you said, you look at the .rels part (which actually probably won’t be compressed, so it’s really easy to get at). Then you can either open the start part itself and look at the namespace; or you go to the [Content_Types].xml part which by default will not be compressed, and look at that. It’s pretty straightforward. If you have an easier way you’d like to see it done that doesn’t actually break away from the ZIP spec and is easy to do with most ZIP libraries let me know.

    omz – There are always reasons for consolidating things like this. Often times it has to do with easy access. Other reasons are to save space (you can define a specific target once but use multiple times). Performance may also be an issue. This is commonly done with styles for instace, where a style is defined in one place and then just referenced every time it is used.

    We wanted there to be a main place you go to see all external resources a file uses. We’d already designed relationships for referencing parts, and decided that it also made sense to do this with external references. It makes managing links a lot easier. You don’t need to parse though all the document XML, just the relationships. Link management is pretty important, and this makes it easier.

    This now gets into a point also raised by Todd. If you are used to operating on a single XML file, this approach seems overkill, and adds extra complexity. You need to view this as a new model for documents though. Other XML document formats do similar things. Look at the StarOffice format. They have styles defined in a seperate XML file which you need to get at. We do the same thing with our styles. The best way to work with these documents is to understand the Open Packaging conventions. If you have access to WinFx, that also makes things easier, as it allows you to quickly query for relationships on any part (just pass in the relID, and you’ll be returned the target).

    -Brian

  6. anon says:

    "Other reasons are to save space (you can define a specific target once but use multiple times). Performance may also be an issue."

    Hmm, semantics is not the same. If you define a style externally, then changing the style definition will consequently project onto all objects referencing that style. Whereas an inline style will only apply to the current object (and assuming the inline style works in a top-down manner). We are talking a different instantiation. The way you store this information may prevent you from doing meaningful or deterministic changes in the future, only because of the way you decided to store styles. Or anything else.

    For instance, if you do that in html using a tool like Dreamweaver, you run into those troubles because the WYSIWYG tool never make the user really control those things. And you end up editing html with notepad.

    So part of the developer being in control is the ability to decide what becomes a rel, what remains "stand-alone".

    If a developer updates a docx file using code and simply ignores the above, then the results are unpredictable.

  7. BrianJones says:

    Anon, I’m sorry, but I don’t follow your point. I’m on the road right now so maybe it’s just the jetlag 🙂

    Let me attempt to restate my point and see if that clears it up (or shows you how I’ve misunderstood your point). I said there are numerous reasons people designing file formats have chosen to write information inline instead of in a seperate location. That doesn’t necessarily mean it has to be in a seperate file, it could be in the same file but just in a different branch of the tree.

    The hyperlink’s path (which we were discussing here) is stored locally (in the file), just in a different branch instead of inline. I also used styles as an example, because that’s another case where people have done that. Another example would be a string table, which is useful when you have a large grid of data and you don’t want to have to repeat strings, and instead just store a collection of strings and then point to the string from the grid.

    You’re point about the developer remaining in control by deciding whether a rel is used or not just seems to just add more complexity to the file. It means that anyone parsing the file has to handle two methods now instead of just one. Is that what you were suggesting or did I misunderstand you?

    -Brian

  8. anon says:

    "Another example would be a string table, which is useful when you have a large grid of data and you don’t want to have to repeat strings, and instead just store a collection of strings and then point to the string from the grid."

    I am not talking about how easy or not it is to store then parse strings from a shared dictionary. That I guess is as easy as the API wrapping the whole rel+XML thing is easy to use, and then it comes down to performance issues. That’s not my point. (that said, any plan to ship a small native zero-dependency dll for that matter?)

    My point is semantics. Let’s take the repeated strings example and assume I as a developer want to update a docx document such that one word of the shared dictionary gets updated, for instance "word1" becomes "word2". Well then, I am saying the resulting docx document has unpredictable content because the developer wanted to change a single occurence of word "word1" somewhere in the document, however making the change in the shared dictionary causes all occurences of the word to change. That is also just as bad for now, and for future updates, either with code or using Office 2006.

    The point is, when you start sharing stuff, be it content, styles, hyperlinks, whatever, you make a clear assumption on how those are supposed to be used. However, a developer out there who was just tasked to make a change to a docx document without knowing it will make changes that may break the very intentions of the original document content.

    In the other post, I took the analogy of Dreamweaver. This WYSIWYG tool lets you make changes to both inline and external style definitions. But without telling you when you do so. The result is that some of your changes will affect say all objects with a given style although you wanted to change only one particular object (the correct operation is : create a new inline style and attach it to the object). After testing the tool, you usually end up going back to editing html and styles with Notepad. Because in Notepad you see how things relate together, you control the underlying semantics.

    Hope this clarifies.

  9. Brian Kemp says:

    That is the ugliest, most non-readable piece of data storage in XML I’ve ever seen. I can’t figure out what the heck that’s supposed to look like on paper.

  10. John Mertl says:

    Hello,

    I have been reading those post for a quite time now. I really believe that Office team is making office file formats open in the way that we will be available to create them even in the notepad. This is a good thing, but does not solve the problems we have currently with DOC or XLS files – that only the Microsoft Office in the latest version is able to open those files. It is about competition, but also, it is about the fact that MS on purpose claims that it is UNABLE to create a format that will be compatible with future versions. Also, if MS means the interoperability seriously, they would create (or made possible create) a free converter from to DOC to DOCX and vice versa.

    I am a long-time network manager of Windows networks and consider Microsoft software a good piece to work with, especially with terms of user friendliness. But those "embrace and extend" policies are really bad things, especially for the users which always are talked about in Microsoft speeches.

    I think this big talk will end as always. The documents created in Office 12 will be viewable and editable only in Office 12 and future versions. Microsoft does not want to make a format which will be compatible and interoperable. It would harm their business and monopoly power. It would be nice if they at least say it.

    John

  11. Patrick says:

    If I am correct, the docx format will be backported to earlier versions of MS-Office. I believed I read it on this blog somewhere.(?)

    If so it would be easy to convert older documents yourself with thecurrent version of your Office application.

  12. Neo says:

    Err what?

    BTW Thailand is deserting M$ too. It’s for political reasons, but based on this it could be on technical considerations too.

    That is some UGLY code. Maybe you should learn why XML is useful, and why this doesn’t fit into that framework. Not only is M$ evil, it seems institutionally stupid too!

  13. y says:

    "Then, like you said, you look at the .rels part (which actually probably won’t be compressed, so it’s really easy to get at). Then you can either open the start part itself and look at the namespace; or you go to the [Content_Types].xml part which by default will not be compressed, and look at that. It’s pretty straightforward. If you have an easier way you’d like to see it done that doesn’t actually break away from the ZIP spec and is easy to do with most ZIP libraries let me know."

    My understanding is that most zip libraries won’t compress files under a certain size, or those for which compression doesn’t save space, although criteria may differ here. In your particular example, everything but the jpeg is compressed. It would be helpful if the .rels and [Content_types].xml files were not compressed, but I’m not sure how to guarantee that. I’ll have to take a look at what Office 12 actually does.

    I still think my earlier suggestion of a "mimetype" file is worthwhile. If the contents of the file are just the mimetype, it will be quite short, and I believe zip libraries generally will not compress it. If it is specified first in the list of files to be zipped, then it should end up first in the zip archive. Some experimentation suggests that this is easy to achieve. If necessary, the mimetype file can be optional, since it can be derived from the main document.

    Speaking of the main document, how is it again that one determines which is the "start part"?

  14. Joe Mele says:

    If automation is truly important, will microsoft hand out a freely deployable command line conversion or one that is com or .net?? or all of the above?

    I can see that then doc format would accepted more as in resumes due to a mechanical import of the information made possible.

  15. I’d been meaning to post a write-up on how to create a simple SpreadsheetML document from scratch, but

  16. 247Blogging says:

    I just posted another example document if any is interested. http://jonesxml.com/resources/hyperlinkandimage.docx For those of you that got a copy of Beta 1, the file will be compatible with your build, so you can open it and take a look. This is an extremel

  17. Dating says:

    I just posted another example document if any is interested. http://jonesxml.com/resources/hyperlinkandimage.docx For those of you that got a copy of Beta 1, the file will be compatible with your build, so you can open it and take a look. This is an extremel

  18. Weddings says:

    I just posted another example document if any is interested. http://jonesxml.com/resources/hyperlinkandimage.docx For those of you that got a copy of Beta 1, the file will be compatible with your build, so you can open it and take a look. This is an extremel