Intro to Word XML Part 2: Simple Formatting

If you read Part 1 of the Word XML Introduction, you saw the basics behind a Word document, as well as how basic formatting can be applied. The Word XML schemas were designed to closely map the structures that Word uses internally to represent a document. A Word document is essentially a collection of text runs. Each text run has a collection of properties that describe how that text should be displayed. Often times a text run can be very long, and is broken out only when the paragraph ends. If there is a bunch of text, and at some point the formatting changes, then the text run will need to be broken to account for that formatting. The reason for this is that Word doesn't apply formatting in a cascading way as is done in HTML. In Word, for the most part the formatting that is applied to text either comes from properties assigned to the paragraph, or to the text directly.

Let's take an example to try to make this clear. As we saw in Part 1 of the Word XML Introduction, simple text like this: "My name is Brian Jones" would look like this in WordprocessingML:

<w:p>
<w:r>
<w:t>My name is Brian Jones</w:t>
</w:r>
</w:p>

I also showed how you could apply bold formatting to that entire run of text. What if we just wanted to apply bold formatting to a couple words though, so that it looked like this: "My name is Brian Jones". This is where you'll see differences between Word XML and HTML. In HTML, there would just be a <b> tag thrown around the text "name is". In Word, by applying that formatting, we've now created three separate runs of text. The HTML for this will look like [edit: "HTML" should have been "WordprocessingML"] :

<w:p>
<w:r>
<w:t>My</w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>

<w:t>name is</w:t>
</w:r>
<w:r>
<w:t>Brian Jones</w:t>
</w:r>
</w:p>

Go ahead and try this opening this in Word. Make sure you also include the wordDocument and body tags like we used in Part 1 of the Word XML Introduction.

Did you notice any problems when you opened that file in Word? If you look closely (or maybe you see it clearly), there are no spaces between the runs, so the text looks like this: "Myname isBrian Jones". We just need to add some trailing whitespace to the first two runs, and also specify that the XML parser should preserve our whitespace. Update the XML so that it looks like this:

<w:p>
<w:r>
<w:t xml:space="preserve" >My </w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t xml:space="preserve" >name is </w:t>
</w:r>
<w:r>
<w:t>Brian Jones</w:t>
</w:r>
</w:p>

You can specify to preserve space on the specific runs where it matters, or you can just declare it globally. It's up to you.

That's how you apply formatting directly to text. Another way of applying formatting is by creating a style and referencing that style from the paragraph or from the run of text. Let's create a paragraph style that has red font coloring, and we'll then reference that style from the paragraph by updating the paragraph properties:

<w:wordDocument xmlns:w="https://schemas.microsoft.com/office/word/2003/wordml">
<w:styles>
<w:style w:type="paragraph" w:styleId="myCustomStyle">
<w:name w:val="myCustomStyle" />
<w:rPr>
<w:color w:val="FF0000" />
</w:rPr>
</w:style>

</w:styles>
<w:body>
<w:p>
<w:pPr>
<w:pStyle w:val="myCustomStyle" />
</w:pPr>

<w:r>
<w:t xml:space="preserve">My </w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t xml:space="preserve">name is </w:t>
</w:r>
<w:r>
<w:t>Brian Jones</w:t>
</w:r>
</w:p>
</w:body>
</w:wordDocument>

Now if you open that file up in Word, you should see the following: "My name is Brian Jones". We're moving in baby steps here folks...

Here are a couple final things you should notice from this example:

  1. Properties: Notice that almost all the XML that describes how your file is presented is done via properties. The properties on the run in combination with the properties on the paragraph are what gave us the rich formatting. WordprocessingML is actually a fairly simple flat schema. There isn't a lot of nesting and it's pretty straightforward how you determine the properties that are applied to any given run. There are of course areas where it starts to get more complex, and I'll try to dig into those areas in future posts. If you have a particular area that's giving you a lot of trouble, let me know and I'll try to pull something together.
  2. Styles: The styles used in a document are all declared up at the top. This example was pretty simple, and there were actually some other styles we should have declared that we didn't. You should probably define the normal style which exists in every file, and then in addition specify that the "myCustomStyle" is based on the normal style. That's not required though as you can see from our example. Also notice that in the Style, the font color is specified within an rPr tag, just like you would do if you were to apply it directly to the run. Certain properties like bold and font color are considered text properties. They can be specified on a paragraph style, but not on a paragraph directly.
  3. No Mixed Content: Unlike HTML where our example would have been something like: <p>My <b>name is</b> Brian Jones.</p> we don't use any mixed content. All text is contained within a <w:t> tag, and a text node is the only valid child of that tag. The tags within a paragraph are for the most part just a flat collection of runs. There are ways of introducing more hierarchy, but that usually comes as a result of things like the support for custom defined schemas.

That's it for now. I'm still trying to decide what to talk about next. Most likely it will either be another approach to opening XML data in Excel (different from Part 2), or working with custom defined schema in Word. I've been focusing a lot of Office 2003 since that's what is available today, but if there are other topics you'd like to hear more about in relation to the Office 12 formats, let me know.

-Brian