Intro to Word XML Part 2: Simple Formatting


If you read Part 1 of the Word XML Introduction, you saw the basics behind a Word document, as well as how basic formatting can be applied. The Word XML schemas were designed to closely map the structures that Word uses internally to represent a document. A Word document is essentially a collection of text runs. Each text run has a collection of properties that describe how that text should be displayed. Often times a text run can be very long, and is broken out only when the paragraph ends. If there is a bunch of text, and at some point the formatting changes, then the text run will need to be broken to account for that formatting. The reason for this is that Word doesn’t apply formatting in a cascading way as is done in HTML. In Word, for the most part the formatting that is applied to text either comes from properties assigned to the paragraph, or to the text directly.


Let’s take an example to try to make this clear. As we saw in Part 1 of the Word XML Introduction, simple text like this: “My name is Brian Jones” would look like this in WordprocessingML:


<w:p>
    <w:r>
        <w:t>My name is Brian Jones</w:t>
    </w:r>
</w:p>


I also showed how you could apply bold formatting to that entire run of text. What if we just wanted to apply bold formatting to a couple words though, so that it looked like this: “My name is Brian Jones”. This is where you’ll see differences between Word XML and HTML. In HTML, there would just be a <b> tag thrown around the text “name is”. In Word, by applying that formatting, we’ve now created three separate runs of text. The HTML for this will look like[edit: “HTML” should have been “WordprocessingML”] :


<w:p>
    <w:r>
        <w:t>My</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>

        <w:t>name is</w:t>
    </w:r>
    <w:r>
        <w:t>Brian Jones</w:t>
    </w:r>
</w:p>


Go ahead and try this opening this in Word. Make sure you also include the wordDocument and body tags like we used in Part 1 of the Word XML Introduction.


Did you notice any problems when you opened that file in Word? If you look closely (or maybe you see it clearly), there are no spaces between the runs, so the text looks like this: “Myname isBrian Jones”. We just need to add some trailing whitespace to the first two runs, and also specify that the XML parser should preserve our whitespace. Update the XML so that it looks like this:


<w:p>
    <w:r>
        <w:t xml:space=”preserve”>My </w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t xml:space=”preserve”>name is </w:t>
    </w:r>
    <w:r>
        <w:t>Brian Jones</w:t>
    </w:r>
</w:p>


You can specify to preserve space on the specific runs where it matters, or you can just declare it globally. It’s up to you.


That’s how you apply formatting directly to text. Another way of applying formatting is by creating a style and referencing that style from the paragraph or from the run of text. Let’s create a paragraph style that has red font coloring, and we’ll then reference that style from the paragraph by updating the paragraph properties:


<w:wordDocument xmlns:w=”http://schemas.microsoft.com/office/word/2003/wordml”>
    <w:styles>
        <w:style w:type=”paragraph” w:styleId=”myCustomStyle”>
            <w:name w:val=”myCustomStyle” />
            <w:rPr>
                <w:color w:val=”FF0000″ />
            </w:rPr>
        </w:style>

    </w:styles>
    <w:body>
        <w:p>
            <w:pPr>
                <w:pStyle w:val=”myCustomStyle” />
            </w:pPr>

            <w:r>
                <w:t xml:space=”preserve”>My </w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:b/>
                </w:rPr>
                <w:t xml:space=”preserve”>name is </w:t>
            </w:r>
            <w:r>
                <w:t>Brian Jones</w:t>
            </w:r>
        </w:p>
    </w:body>
</w:wordDocument>


Now if you open that file up in Word, you should see the following: “My name is Brian Jones“. We’re moving in baby steps here folks…


Here are a couple final things you should notice from this example:



  1. Properties: Notice that almost all the XML that describes how your file is presented is done via properties. The properties on the run in combination with the properties on the paragraph are what gave us the rich formatting. WordprocessingML is actually a fairly simple flat schema. There isn’t a lot of nesting and it’s pretty straightforward how you determine the properties that are applied to any given run. There are of course areas where it starts to get more complex, and I’ll try to dig into those areas in future posts. If you have a particular area that’s giving you a lot of trouble, let me know and I’ll try to pull something together.
  2. Styles: The styles used in a document are all declared up at the top. This example was pretty simple, and there were actually some other styles we should have declared that we didn’t. You should probably define the normal style which exists in every file, and then in addition specify that the “myCustomStyle” is based on the normal style. That’s not required though as you can see from our example. Also notice that in the Style, the font color is specified within an rPr tag, just like you would do if you were to apply it directly to the run. Certain properties like bold and font color are considered text properties. They can be specified on a paragraph style, but not on a paragraph directly.
  3. No Mixed Content: Unlike HTML where our example would have been something like: <p>My <b>name is</b> Brian Jones.</p> we don’t use any mixed content. All text is contained within a <w:t> tag, and a text node is the only valid child of that tag. The tags within a paragraph are for the most part just a flat collection of runs. There are ways of introducing more hierarchy, but that usually comes as a result of things like the support for custom defined schemas.

That’s it for now. I’m still trying to decide what to talk about next. Most likely it will either be another approach to opening XML data in Excel (different from Part 2), or working with custom defined schema in Word. I’ve been focusing a lot of Office 2003 since that’s what is available today, but if there are other topics you’d like to hear more about in relation to the Office 12 formats, let me know.


-Brian

Comments (11)

  1. Darryl Hover says:

    Thanks Brian. Having things broken down into little chaunks like this makes it easier to soak in.

    The last sentence in the third paragraph reads "The HTML for this will look like:". I think you meant "The WordprocessingML…",no?

    Good stuff.

    Darryl

  2. BrianJones says:

    Thanks Darryl, I updated the original post to note that it should have read "WordprocessingML".

  3. Eugen Bacic says:

    If you’re taking a poll, I’ll vote for "working with custom defined schema in Word".

    And thanks for these articles, it does make things easier to sort through than the schemas themselves.

  4. BrianJones says:

    Thanks Eugen, I’ll show some basic stuff with custom defined schemas in the next "Intro to Word XML" post.

  5. Wow… I must say, I am itchin to get this onto my desk. 🙂

    I run several websites using TextPattern ( http://www.textpattern.com/ ). TXP uses its own formatting called TexTile. It is great for formatting, but when I have dozens of articles to publish, I like to use my word processor (MS Word). Unfortunately, I can’t produce clean HTML that can be easily parsed to TexTile.

    With this new XML format, I can easily see a simple convert to reformat a document with any tags I want. Custom X/HTML, CSS and HTML, TexTile, etc.

  6. Ryan Ackley says:

    How do you plan on dealing with creating a Table of Contents in XML? Right now, a user must manually update the Table Of Contents (press F9) when the document changes or write a macro to keep it accurate.

    So lets assume a TOC was created in the new XML. The way Word works currently, the first time that document was opened, the user would have to select the TOC and press f9 before the TOC would contain accurate page numbers

    Will Office 12 automatically update the TOC on open? Will earlier versions of Office be retro-fitted to deal with this?

  7. BrianJones says:

    Eric, that’s one of the many great things you can do with this new format. I’ve already seen a number of filters (aka transforms) that go from WordML into other formats and back into WordML. Now that the formats are open and fully documented, anyone can do just that.

    Ryan, that’s a great question. I’ve been looking at adding a way to specify on a field that it should be updated on open without the user needing to take any actions. There are obviously security issues anytime you do an automatic update so it would have to only be on specific types of fields (TOC is a great example).

    -Brian

  8. Zander Westendarp says:

    We are trying to use Word as our XML editor, and we apply our own schema. Our schema uses mixed content for our <para> element, enabling authors to tag runs with emphasis or inline url’s for example. When saved as Data-only, we loose the significant whitespace around the contained elements. We then have to add it back programatically in the XSLT–very messy with foreign languages!

    Correct me if I’m wrong, but what I hear you saying is that Office 12 will continue with this "data loss" implementation. That means our choices are:

    1. Switch to WordProcessingML

    2. Switch to a different XML editor

  9. Jasper says:

    Great information… I need to know how to get tabspacing values in Open Office XML format……