Introduction to Word documents

Article
02/02/2006

Now that folks have had a chance to work with Beta 1 for a few months, I wanted to take some time to give a high level overview of the three different document formats. Today I'm going to focus on Word. Obviously there is a huge set of features and functionality in Word, and I won't really be able to do much more than just scratch the surface today (but hopefully this will be a good start).

Document

There are a large number of pieces of information that we use to construct a Word document. If you want to just focus though on the pieces that actually provide the content for the document, then you can actually break it out into a collection of multiple subdocuments. We call those subdocuments 'stories', and there are 6 top level stories that make up a document:

The main story - this is the core body of the document, and is really the only one that's required to make a document.
Headers & Footers - There can be one or more of these, and they are tied to a section.
Footnotes & Endnotes - Anchors for the footnotes and endnotes like in the body, but the actual content is stored separately.
Subdocuments - There is a feature that allows for the document to be broken out into a collection of subdocuments.
Frames
Comments

Once you have the collection of stories, you then focus on the other parts of the file that help specify all the properties that should be used for those stories (ie layout; formatting; etc.). For the most part, all the stories in a document share a common set of properties. These properties are contained within:

Style information
Bullets and numbering information
Font information
Document settings

Style Information

A style defines a specific set of formatting properties that can the be referenced by content object. A great example of a style would be the "Normal" paragraph style which in Word 2003 is defined as having the following properties: Font = Times New Roman; Font Size = 12 point; Justification = Left; Line Spacing = Single.

Word supports five different style types:

Paragraph Styles
Character Styles
Linked Styles (both paragraph and character)
Table Styles
List Styles

Style cascading (or inheritance) is a fairly important and complex area. Multiple style types can be applied to the same part of a file, so the properties must be applied in a specific order. It's possible for a property set by one style type to actually be removed or supplemented by other style types that follow it.

Styles of any given type can also inherit from other styles of that type. For example, the Heading 1 paragraph style is based on (and inherits from) the Normal paragraph style.

Here is a diagram that shows a simple view of how style information is applied. There are some additional complexities not outlined here, but this covers most cases.

If you look at the above diagram, you'll see that the first type applied is the Table style type. This will affect Tables, Paragraphs, and Characters (or runs) within that paragraph. The next level is the List style type. This affects the paragraph properties. A list style can also bring in a paragraph style, but that's a bit more complexity than I want to get into today. Paragraph and then Character styles are the next two applied, and the final piece is direct formatting, which will override everything else. That's why folks involved in more complex documents like to avoid direct formatting if at all possible, since you can then manage the styles, and don't have to worry about direct formatting overriding those styles.

Now let's talk about this at the XML level, and how a style is applied. The properties of the style are contained in the style definitions:

And the paragraph then just references the style via the style ID:

Bullets and Numbering

Although it's not always obvious, any bullet/numbering definition consists of nine levels, each of which have Paragraph properties (e.g. margins) and Item properties (e.g. bullet vs. numbering, numbering type, etc.) defined. The behavior of the numbering is specified in two parts, the Bullets & numbering definition, then the actual Bullets & numbering instance which is a specific instance of a given definition.

The Bullets and numbering definition specifies the properties for any or all of the nine levels. The instance then specifies the properties for a specific numbering instance inheritance which includes a reference to a definition; and then any additional overrides for one or more levels.

Let's get into an example of how this would look in XML. Here is what a numbering definition looks like:

Then, after the numbering definition, there is a numbering instance that references the definition, and itself has an ID.

And the paragraph then just references the numbering instance via the list property settings.

Font Information

Often, you can't rely on a specific font being on a users machine. In order to make sure a document being passed around still looks good on a users machine that doesn't have a font used in the document additional information can also be stored in the document. The two ways that is done is via the font embedding functionality, as well as the font type data that we write out. The font type data specifies characteristics of the font which are used to find a suitable replacement when the specified font is unavailable.

Document Settings

All settings that are pertinent to the document are stored in separate parts within the document package. The settings can really be divided into two groups: those that affect presentation, and those that are just pure application settings.

The settings that affect presentation are things like compatibility options (ie layout tables like Word 97), as well as web settings such as div behaviors or frameset data. The pure application settings are things like view or zoom state. They may affect how the document appears within the application, but not the actual layout of the document.

Story Content

So, let's get back into the concept of "stories" serving as the main building blocks of the document. Within each story, there is the actual content, which consists of block level structures:

Paragraphs
Tables
Structure Document Tags (customer XML; smartTags; content controls)
Range Permissions

And within each paragraph, there is a collection of inline structures:

Runs
Structured Document Tags (same as at the block level)
Comments, tracked changes, bookmarks
Drawings
Fields
Hyperlinks

There are a few basic structural rules that are in play here. First, all text in a word-processing document is contained with a run. A run is a region of text with a common set of properties. The second rule is that all runs must be contained within a paragraph. A paragraph of course, is a collection of one or more runs that is displayed as a unit (this is analogous to the HTML <p> tag).

So let's look at an example. The following text:

The quick brown fox.

would look like this in XML:

Notice that a paragraph is just a flat list of runs. There is not additional nesting which is different from the HTML <span> model. I'm not saying one is better than the other, just pointing out that it's different.

A paragraph may be at any location that allows for block level content. For example, it could be at the top level within a story (ie header, footer, main document); nested within a table cell; or nested within a structured document tag or some other structured markup.

Tables

Tables in Word (at least at the base level) or fairly similar to tables in HTML. A Word table consists of a table element which can have a set of properties assigned to it. Then within the table element is a collection of rows, and within each row is a collection of cells. Here is a basic example of a table in WordprocessingML:

Individual table cells can contain block level content. This means a table cell can contain not just a paragraph, but also another table. This allows for tables to be nested in other tables.

Custom Defined XML

The custom defined XML support allows users to embed their own XML within a WordprocessingML file. For example, if you wanted to have the following structure in your document:

You could just insert that XML using a custom XML tag:

That gives you additional structure in your document, and allows you to parse the file looking for your structures.

Sections

Sections in a word-processing document specify a number of properties. By default, a document contains one section, but additional sections can be inserted to either change some of those properties for a specific portion of the document, or even just to create some additional structure (such as a page break).

The types of information that lives with a section is:

Page properties (page size; page orientation; margins)
Header/footer references
Footnote/endnote properties
Column properties
Line numbering
Text direction (RTL vs. LTR; top-to-bottom vs. bottom-to-top)

There are four types of sections: Continuous; Next page (start this section on the next page); Even (start on the next even page); and Odd (start on the next odd page).

The last section of the document (which for the most part is the only section) is stored at the end of the body. All other additional sections inserted are stored as a paragraph property.

Headers and Footers

There are three types of headers and footers. The main one is the Odd page header. If that's the only one that exists, then it is applied to all pages of the document. Optionally, an override header can exist for the even pages, as well as for the first page. Headers are specified for each section, so if you want a different header used, you'll need to create a new section.

Headers and footers are stared in separate parts within the package. There is one part for each header and each footer. Each section then refers to it's header(s) and footer(s) by an explicit relationship reference:

The type of the header or footer is actually declared at the root of the part.

Closing

Well, that was probably enough for one day. I know that I kept this still at a relatively high level. I'll definitely try to dig deeper into the details on the areas that folks are more interested in.

I'm going to be offline for the next week or so, but hopefully I'll have time to at least check comments every once and awhile.

-Brian