Introduction to Word documents


Now that folks have had a chance to work with Beta 1 for a few months, I wanted to take some time to give a high level overview of the three different document formats. Today I’m going to focus on Word. Obviously there is a huge set of features and functionality in Word, and I won’t really be able to do much more than just scratch the surface today (but hopefully this will be a good start).

Document

There are a large number of pieces of information that we use to construct a Word document. If you want to just focus though on the pieces that actually provide the content for the document, then you can actually break it out into a collection of multiple subdocuments. We call those subdocuments ‘stories’, and there are 6 top level stories that make up a document:

  • The main story – this is the core body of the document, and is really the only one that’s required to make a document.
  • Headers & Footers – There can be one or more of these, and they are tied to a section.
  • Footnotes & Endnotes – Anchors for the footnotes and endnotes like in the body, but the actual content is stored separately.
  • Subdocuments – There is a feature that allows for the document to be broken out into a collection of subdocuments.
  • Frames
  • Comments

Once you have the collection of stories, you then focus on the other parts of the file that help specify all the properties that should be used for those stories (ie layout; formatting; etc.). For the most part, all the stories in a document share a common set of properties. These properties are contained within:

  • Style information
  • Bullets and numbering information
  • Font information
  • Document settings

Style Information

A style defines a specific set of formatting properties that can the be referenced by content object. A great example of a style would be the “Normal” paragraph style which in Word 2003 is defined as having the following properties: Font = Times New Roman; Font Size = 12 point; Justification = Left; Line Spacing = Single.

Word supports five different style types:

  • Paragraph Styles
  • Character Styles
  • Linked Styles (both paragraph and character)
  • Table Styles
  • List Styles

Style cascading (or inheritance) is a fairly important and complex area. Multiple style types can be applied to the same part of a file, so the properties must be applied in a specific order. It’s possible for a property set by one style type to actually be removed or supplemented by other style types that follow it.

Styles of any given type can also inherit from other styles of that type. For example, the Heading 1 paragraph style is based on (and inherits from) the Normal paragraph style.

Here is a diagram that shows a simple view of how style information is applied. There are some additional complexities not outlined here, but this covers most cases.

If you look at the above diagram, you’ll see that the first type applied is the Table style type. This will affect Tables, Paragraphs, and Characters (or runs) within that paragraph. The next level is the List style type. This affects the paragraph properties. A list style can also bring in a paragraph style, but that’s a bit more complexity than I want to get into today. Paragraph and then Character styles are the next two applied, and the final piece is direct formatting, which will override everything else. That’s why folks involved in more complex documents like to avoid direct formatting if at all possible, since you can then manage the styles, and don’t have to worry about direct formatting overriding those styles.

Now let’s talk about this at the XML level, and how a style is applied. The properties of the style are contained in the style definitions:

And the paragraph then just references the style via the style ID:

Bullets and Numbering

Although it’s not always obvious, any bullet/numbering definition consists of nine levels, each of which have Paragraph properties (e.g. margins) and Item properties (e.g. bullet vs. numbering, numbering type, etc.) defined. The behavior of the numbering is specified in two parts, the Bullets & numbering definition, then the actual Bullets & numbering instance which is a specific instance of a given definition.

The Bullets and numbering definition specifies the properties for any or all of the nine levels. The instance then specifies the properties for a specific numbering instance inheritance which includes a reference to a definition; and then any additional overrides for one or more levels.

Let’s get into an example of how this would look in XML. Here is what a numbering definition looks like:

Then, after the numbering definition, there is a numbering instance that references the definition, and itself has an ID.

And the paragraph then just references the numbering instance via the list property settings.

Font Information

Often, you can’t rely on a specific font being on a users machine. In order to make sure a document being passed around still looks good on a users machine that doesn’t have a font used in the document additional information can also be stored in the document. The two ways that is done is via the font embedding functionality, as well as the font type data that we write out. The font type data specifies characteristics of the font which are used to find a suitable replacement when the specified font is unavailable.

Document Settings

All settings that are pertinent to the document are stored in separate parts within the document package. The settings can really be divided into two groups: those that affect presentation, and those that are just pure application settings.

The settings that affect presentation are things like compatibility options (ie layout tables like Word 97), as well as web settings such as div behaviors or frameset data. The pure application settings are things like view or zoom state. They may affect how the document appears within the application, but not the actual layout of the document.

Story Content

So, let’s get back into the concept of “stories” serving as the main building blocks of the document. Within each story, there is the actual content, which consists of block level structures:

  • Paragraphs
  • Tables
  • Structure Document Tags (customer XML; smartTags; content controls)
  • Range Permissions

And within each paragraph, there is a collection of inline structures:

  • Runs
  • Structured Document Tags (same as at the block level)
  • Comments, tracked changes, bookmarks
  • Drawings
  • Fields
  • Hyperlinks

There are a few basic structural rules that are in play here. First, all text in a word-processing document is contained with a run. A run is a region of text with a common set of properties. The second rule is that all runs must be contained within a paragraph. A paragraph of course, is a collection of one or more runs that is displayed as a unit (this is analogous to the HTML <p> tag).

So let’s look at an example. The following text:

The quick brown fox.

would look like this in XML:

Notice that a paragraph is just a flat list of runs. There is not additional nesting which is different from the HTML <span> model. I’m not saying one is better than the other, just pointing out that it’s different.

A paragraph may be at any location that allows for block level content. For example, it could be at the top level within a story (ie header, footer, main document); nested within a table cell; or nested within a structured document tag or some other structured markup.

Tables

Tables in Word (at least at the base level) or fairly similar to tables in HTML. A Word table consists of a table element which can have a set of properties assigned to it. Then within the table element is a collection of rows, and within each row is a collection of cells. Here is a basic example of a table in WordprocessingML:

Individual table cells can contain block level content. This means a table cell can contain not just a paragraph, but also another table. This allows for tables to be nested in other tables.

Custom Defined XML

The custom defined XML support allows users to embed their own XML within a WordprocessingML file. For example, if you wanted to have the following structure in your document:

You could just insert that XML using a custom XML tag:

That gives you additional structure in your document, and allows you to parse the file looking for your structures.

Sections

Sections in a word-processing document specify a number of properties. By default, a document contains one section, but additional sections can be inserted to either change some of those properties for a specific portion of the document, or even just to create some additional structure (such as a page break).

The types of information that lives with a section is:

  • Page properties (page size; page orientation; margins)
  • Header/footer references
  • Footnote/endnote properties
  • Column properties
  • Line numbering
  • Text direction (RTL vs. LTR; top-to-bottom vs. bottom-to-top)

There are four types of sections: Continuous; Next page (start this section on the next page); Even (start on the next even page); and Odd (start on the next odd page).

The last section of the document (which for the most part is the only section) is stored at the end of the body. All other additional sections inserted are stored as a paragraph property.

Headers and Footers

There are three types of headers and footers. The main one is the Odd page header. If that’s the only one that exists, then it is applied to all pages of the document. Optionally, an override header can exist for the even pages, as well as for the first page. Headers are specified for each section, so if you want a different header used, you’ll need to create a new section.

Headers and footers are stared in separate parts within the package. There is one part for each header and each footer. Each section then refers to it’s header(s) and footer(s) by an explicit relationship reference:

The type of the header or footer is actually declared at the root of the part.

Closing

Well, that was probably enough for one day. I know that I kept this still at a relatively high level. I’ll definitely try to dig deeper into the details on the areas that folks are more interested in.

I’m going to be offline for the next week or so, but hopefully I’ll have time to at least check comments every once and awhile.

-Brian

Comments (32)

  1. Tate, Jeffrey T. says:

    Great information.

    So are you tellling us that if a table will "live" in the document, we should start with the Table Style first and format in the order illustrated above? Is there harm when a Paragrah style is applies to a cell?

    Again, great information!

    Jeffrey

  2. Ian Easson says:

    Now that the external (XML) representation of styles has been rationalized, does that mean that their internal representation (data structures) have been similarly rationalized? By that, I mean will Word 12 no longer suffer from the endless corruption of styles (particularly list styles) that all previous versions of Word are prone to due to the inadequate styles data structures?

  3. Chris Nahr says:

    Ian’s question is mine as well. The main reason I’m using Word only for other people’s documents (and FrameMaker for my own) is Word’s chronic lack of reliability. Ribbons or not, I’ll not bother to upgrade unless the internal document structure has been fixed.

  4. So Word 12 doc’s are broken up into XML modules. Does this affect what we had in Office 2003 VBA? Will these modules be aggregated into a single XML property such that Range.XML() remains unchanged?

  5. Tristan Davis says:

    Bryan: We’re definitely not changing the result of Range.XML this release – it will continue to return WordprocessingML that matches the Word 2003 XML schemas. There will also be a method to return an XML serialized version of the new file format, whose schemas are different.

  6. I have had a crazy week, so am just now catching up on some of my blog reading. A few things worth reading:Brian Jones has a nice introduction to using styles with the XML underlying Microsoft WordEd Dodds did some…

  7. Daniello says:

    Only a question: Will be MathML natively supported in Microsoft Word 12?

  8. nchamp says:

    All great stuff.

    Can I request for a further article information on how the different files fit together in a package, especially the customXML stuff?  I’ve worked out (I think) the four places where the name of the xml data files is defined and referenced, but I’m interested in hearing it from the horses mouth, so to speak.  I’d also like to see an example with more than one xml datastore in the package (and with more meaningful names than item1.xml, too!)

    Thanks in advance

  9. Very interesting… wish I was in the beta crowd too!

    A question on Word and XML and datafiles which I didn’t see touched on.  Word 11 can take in data for a merge from a wide variety of formats including CSV, RTF, straight from a database, etc… but not XML.  Will Word 12 import XML data directly?  Without the need for ASP or other work arounds?

    Thanks for the intriguing look at Word internals.

  10. Daniel Schierbeck says:

    I’m curious here: why do you use the `w’ prefix for attributes on elements that are already in the namespace referred to by `w’?

     <foo:bar foo:bur="…"/>

    is exactly the same as

     <foo:bar bur="…"/>

    though obviously shorter. Do you use a real namespace-aware parser, or are the prefixes just mapped statically, i.e. the namespace name doesn’t mean anything to the parser, only the prefix does? While harmless, it seems a bit unprofessional.

    This is not just a problem with Office, OpenOffice does the same when saving files in the OpenDocument format.

  11. Randy Brown says:

    I currently use XML Schemas in Word XML files that I create in Word.  I then use these files as templates from within my ASP.NET applications to load the XML into a DOM, cycle the XML Schema elements, and infuse SQL data into the document to produce tailored Word Documents and save to Disk as Word XML files.

    I now have the need to save these Word XML files as PDF files and cannot find a good tool to do this programmatically from the ASP.NET application.  If you know of a third party tool that can perform this operation, I’d love to know what it is.

    But my real question is will there be a way with all the new features of XML and PDF functionality to perform the task above in a easier fashion AND have the ability to Save the Word XML to a PDF programmatically?

    Thanks in advance, keep up the good work!

  12. Johan says:

    XML formats aside; will the new Office version provide a COHERENT AND UNDERSTANDABLE system for Paragraph Formats?

    In current Word versions, different documents behave very differently, depening on if "Automatically update paragraph format" is active or not, and it appears there are also some other mysterious configuration options. I have not found a sure way to change a paragraph format and have the change applied to all paragraphs of that type.

    The result has been confusing the hell out of users, and extremely few people seem able to master the MS Word paragraph format complexity, not to mention the "art" of creating a table-of-contents based on headings.

    FrameMaker got this right 10+ years ago!!

    In my opinion, MS Word needs a redesign for USABILITY. Adding more features is NOT the answer.

  13. JayV says:

    Hey Brian.

    Good stuff, but one of the things that I haven’t seen mentioned yet is Word’s OfficeArt.  Currently, Word seems to still use vml to describe autoshapes and vector objects, while Excel and PowerPoint both leverage the new OfficeArt schemas.  Is Word planning to continue using vml, or will this eventually be changed before the final release so that even Word uses oartml?

    Thanks, and my condolences regarding the super bowl.  Maybe next year, buddy.

  14. BrianJones says:

    Thanks for all the comments everyone. Sorry for not replying sooner. I was down in Cupertino for Ecma meetings when I made this post, and then I was on vacation last week. In between I had to suffer through the Super Bowl…

    Jeffrey,

    You can only apply a paragraph style directly to a paragraph, not to a cell, or table style. A table style can specify that every paragraph within a specific cell, row, column should have certain paragraph properties assigned, but not the style.

    Ian,

    This is just the physical representation of the styles on disk. The internal style behavior is not directly affected. That said, there is some work that has been done to make working with styles easier (but I’m not sure if it will affect the problem you are talking about).

    Daniello,

    You will be able to copy and paste using MathML, but the XML persistence in the file format will be a bit different from MathML. We’ll also provide transforms for going between the two.

    nchamp,

    I’ll try to pull together an example with more than one datastore file. The short of it is that all you need to do is create a relationship to the XML part you want to use as a datastore item (and give it the right relationship type). You don’t need to reference it directly anywhere (unless you want to create a mapping or something).

    Terry,

    There aren’t really any changes being made to the types of sources allowed for a Mail Merge.

    Daniel,

    This is a discussion we’ve had internally a number of times. When we had first made the decision to do this back with SpreadsheetML in Office XP (about 7 years ago), it was because it looked like the best way to go. With the XML parsers we were using, if you asked for the namespace of an attribute, it would just return null if you didn’t qualify the attributes. Of course you could go an look at the parent element to figure out the namespace, but it seemed easier just specifying it on the attribute.

    Now, looking back on it, it might have been nicer to go the other way, but it wasn’t that big of a deal.

    In answer to your question about using a namespace aware parser though, we definitely do that. You can change the prefixes if you want and we’ll still properly parse the files (assuming you’ve made the proper namespace declaration for that new prefix).

    Randy,

    There are a few tools out there that go from WordprocessingML to XSL-FO. From there you can go to PDF. I’ve blogged about a couple of them, and I’ll try to dig into some more.

    Johan,

    The "automatically update paragraph format" is really something that only a template author should use. It’s extremely confusing for any end user, and that’s why it isn’t on by default for any of the styles or templates we ship. Unfortunately there isn’t a Word blog out there right now, but if you’ve seen the Beta, you’ll notice that there are a large number of improvements to Styles, Tables or Contents, and other pieces of functionality that we’ve seen people have trouble with. We are very aware of the usability issues (just look at Jensen’s blog to get an idea of all the work we do there: http://blogs.msdn.com/jensenh/)

    Jay,

    It was a great year anyway (although that game was about as painful of a game as I’ve watched in a while).

    Word will still use VML for most shapes. For new diagrams, charts, and pictures it will use the newer drawingML.

    -Brian

  15. FARfetched says:

    Aside from Ian’s and Chris’s question(s) about stability and corruption issues, which chased me away from Word after 12 productive years, I have a specific question about your the "<b>The quick <i>brown</i> fox.</b>" example under "Story Content."

    In the source XML, I noticed that there were no explicit spaces between the individual text runs… yet the italic text run has spaces on either side in the display. If that example is correct, how do you express an example like "do not do this <i>yet</i>." where an italicized "yet" abuts the plain-text period immediately following?

    I once learned (the hard way) that trailing spaces in RTF are significant, so I’m not assuming anything….

  16. BrianJones says:

    Well, in the WordprocessingML, it would be more like this:

    <r>

     <t xml:space=’preserve’>do not do this </t>

    </r>

    <r><rPr><i/></rPr>

     <t>yet</t>

    </r>

    <r>

     <t>.</t>

    </r>

    Notice that on the first text node, it specifies that leading and trailing space should be preserved. If that wasn’t there, then when you opened the file "this" and "yet" would have appeared as one word (where the 2nd part is italicized).

    -Brian

  17. Randy Brown says:

    Hey Brian, thanks for keeping up with the posts.

    I just got off yet ANOTHER contract where the client was wanting to produce data infused documents and send out as PDFs.  They also need to have the ability to alter the document template as business needs dictate.  Not sure about others, but I’m finding this requirement on just about every workflow related project that I get involved with lately.

    Word’s ability to work with custom XML schemas is awesome for programmatically infusing data into word templates stored on disk.  Unfortunately, the lack of programmatically going from WordML to PDF is preventing the above scenario from becoming a reality.  It forces us to scour the web looking for the final piece to the puzzle.  So far I have found only one solution that claims to do the convertion of WordML to PDF but it is cost prohibitive ($1600 which is ridiculous in my opinion, at 4 times the cost of Word itself).

    You mentioned going from WordML to XSL-FO, and then to PDF.  Could you elaborate on this, and what all is need to go from WordML – to XSL-FO – to PDF?  Also, if I’m barking up the wrong tree and you’re not the one to talk to about generating PDF’s from WordML (programmatically from .Net apps) then let me know.  I looked through Cindy blog (now defunct) and the other guys, and have seen my same question asked several times by others, but have yet to see any quality answers on the topic.

    If MS recognizes the importance of embedding PDF generation into the entire Office suite, then I would think that they would recognize the importance of exposing this same functionaly to developers for workflow related applications.  

    If there was a specific blog on this topic, I would be willing to bet that there would be substantial interest.

    Thanks a lot for your time and comments.

  18. Ian Bradbury says:

    Brian,  Please can you tell me.  Will the ability to add my own custom xml tags be available across the whole Office 12 range?

    Or just the professional set (as is currently)?  

    I would like to use custom xml tags to define business structures and then process that xml "in document" via some custom code.

  19. BrianJones says:

    Hey Ian, it will be available in all SKUs: http://blogs.msdn.com/brian_jones/archive/2005/09/20/472146.aspx

    -Brian

  20. Shane Wilson says:

    Hey Brian;

    I have a xml doc.  I have a dotx template with embedded custom xml using a predefined xsd.  Now I want to create a new document by merging the xml and the dotx to create a docx that ultimately will be PDF’ed.  The question is – or am I just dumb – how do I get the xml merged with the template?

  21. Links to blog posts that contain useful technical information for developers.  Open XML is a new standard, but there’s some good information already available if you know where to look.

  22. Andrew says:

    Is there any XSL to convert WordML to RTF?

    Please?

    With image support would be nice……

  23. I thought it might be worthwhile to give a bit of an overview of the WordprocessingML model that you

  24. I thought it might be worthwhile to give a bit of an overview of the WordprocessingML model that you