Intro to Word XML Part 1: Simple Word Document


This post is for those of you interested in learning the basics behind WordprocessingML. That’s the schema that we built for Word 2003. You can save any Word document as XML, and we will use this schema to fully represent that document as XML. The new default XML format for Word 12 is going to look very similar to the WordprocessingML schema in Word 2003. The big differences are really around the use of ZIP as a container, and breaking the file out into pieces so that it’s no longer one large XML file. If you are interested in the Office 12 formats, it would be really valuable to first get familiar with the XML formats from Office 2003. Over the coming months I’ll provide more details about the 12 formats, but for the time being, I would suggest learning WordprocessingML. This post will serve as a simple introduction.


If your first exposure to Word’s XML schema came from taking an existing document and saving it out using the XML format, you probably had a bad experience. First off, we don’t pretty print our files, so if you opened it in a text editor you probably had no chance of making anything out. We also save out a processing instruction (I’ll post more on this later) that tells IE and the shell that it’s a Word XML file. This means that if you try to open the file in IE to get their XML view, it instead will launch Word.


The other issue is that in Word, we maintain all sorts of information about the files that you may not care about. As a result , there are a ton of XML elements saved out that can at first make the file itself look a bit intimidating. This is the difference between a full featured format, and one that isn’t. We can’t lose any functionality by moving to these XML formats, so as a result, we have to be able to represent everything as XML. I will show in future posts that it’s also possible to save into a non-full featured format by using XSLT on the way out. This would allow you go get a simpler file when you save, but it has the side effects of losing some functionality. That’s why we would never do a non-full featured format as a default. Instead it’s an optional thing.


Word’s XML format is actually fairly simple if you’re only trying to do simple things. You only need to expose yourself to the complex side if you’re trying to do something more complex. Often times, the functionality of a feature in Word is extremely complex, so as a result, the representation as XML of that feature is also complex. In future posts I’ll drill into some of the areas where people have had more problems and try to better explain the mapping from the internal feature to the XML representation. For now though, let’s just make a simple Word document.


Step 1: Root Element


Just like in our simple Excel Example, the first thing we need to do is to create the root element for the Word document. The root element is “wordDocument“, and the root namespace is “http://schemas.microsoft.com/office/word/2003/wordml“. So, we should start with the following in our XML file:


<?xml version=”1.0″?>
<w:wordDocument xmlns:w=”http://schemas.microsoft.com/office/word/2003/wordml”>
</w:wordDocument>


There are three things that we just did. The first was declaring at the top that it’s an XML file following the 1.0 version of the W3C XML standard (<?xml version=”1.0″?>). The second was that we declared that the “w:” prefix maps to the Word namespace (xmlns:w=”http://schemas.microsoft.com/office/word/2003/wordml”). And the third thing we did was to create the root element wordDocument in the Word namespace (<w:wordDocument>).


Step 2: Document body and first paragraph


OK, so we have a skeleton document, but there is nothing in it yet. Similar to HTML, the content of the Word document is contained within a “body” tag. Within the body tag, we can have paragraphs and tables. Let’s also create a paragraph element, so that our file now looks like this:


<?xml version=”1.0″?>
<w:wordDocument xmlns:w=”http://schemas.microsoft.com/office/word/2003/wordml”>
    <w:body>
        <w:p>
        </w:p>
    </w:body>
</w:wordDocument>


We now have a Word document with one empty paragraph


Step 3: Add some content


Since this is just a simple introduction, let’s keep it that way, and make this into a “hello world” example. Internally in Word, we assign formatting to text by breaking everything in the document into a flat list of runs. Each run then has a set of formatting properties associated with it. We do the same in WordprocessingML. A paragraph is made up of one or more runs of text. So, to make this Word document say “hello world”, we need to add a run tag and a text tag inside our paragraph. The “hello world” text will then go inside that text tag:


<?xml version=”1.0″?>
<w:wordDocument xmlns:w=”http://schemas.microsoft.com/office/word/2003/wordml”>
    <w:body>
        <w:p>
            <w:r>
                <w:t>Hello World</w:t>
            </w:r>
        </w:p>
    </w:body>
</w:wordDocument>


Go ahead and open that file in Word, and you’ll see your text. Not too exciting yet, but it’s a start. For the last part, let’s make the text bold.


Step 4: Add basic formatting


As I already mentioned, all text in a word document is stored as a collection of runs with properties associated with them. We already created one run of text in that first paragraph, but it just used the default formatting. Let’s add one more tag (<w:rPr>) tag inside of that run which allows us to specify properties for that text:


<?xml version=”1.0″?>
<w:wordDocument xmlns:w=”http://schemas.microsoft.com/office/word/2003/wordml”>
    <w:body>
        <w:p>
            <w:r>
                <w:rPr>
                    <w:b/>
                </w:rPr>
                <w:t>Hello World</w:t>
            </w:r>
        </w:p>
    </w:body>
</w:wordDocument>


Now we’ve said that that run of text has bold formatting (<w:b/>) applied to it. Not the most exciting example, but we have to start somewhere. In later posts we’ll go into how to create a more complicated set of formatting using multiple runs of text, as well as working with lists, tables, images, etc. It’s a bit different that other document formats out there, so I want to step through everything carefully.


-Brian

Comments (18)

  1. Mike DeKoning says:

    Brian-

    Thanks for all of the info you’re posting. In the last code snip I believe you’re missing the slash to close the <w:rPr> tag surrounding the bold formatting. XML is just SO picky! 😉

  2. BrianJones says:

    Thanks Mike! I was going to wait to do any formatting examples until a later post, but decided to add that in at the last minute. I didn’t even bother trying to make sure it worked… my bad 🙂

    I just updated the original post so it should work now, thanks again.

    -Brian

  3. Glen Starrett says:

    Will there be a sample XSLT the might help remove the grammer and spellcheck elements? They wrap the text and create quite a mess in a technical document.

  4. とりあえずここを読んでおけばよさそうな感じです。Brian Jones: Office XML Formats

    現時点で以下の記事があがっています。

    Excel:Introduction to Excel…

  5. If you read Part 1 of the Word XML Introduction, you saw the basics behind a Word document, as well as…

  6. When we built the support for customer defined schemas into Word 2003 there were a couple scenarios we…

  7. Slashdot: MS Office XML Format Now in TextEdit

    I saw this the other day on slashdot. I have to admit…

  8. Colin Fox says:

    Is this snippet correct? If so, then I don’t see how the bold tag will apply to the hello world text, since it is in a nested rPr scope prior to the Hello World:

    <w:rPr>

    <w:b/>

    </w:rPr>

    <w:t>Hello World</w:t>

    SHouldn’t it be:

    <w:rPr>

    <w:b><w:t>Hello World</w:t></w:b>

    </w:rPr>

    ?

  9. BrianJones says:

    Hi Collin, yes, the snippet is correct. Don’t think of it as the "bold tag applying to the text". Instead think of it as the bold tag is a property of the text run.

    This is how formatting works in Word. The document is a collection of text runs contained within paragraphs. Formatting can either be inherited from the paragraph, or can be applied directly. When formatting is applied directly to text, that will cause the selected text to be broken out into it’s own run.

    The <w:r> is the actual "object", where the <w:rPr> stores the properties and the <w:t> stores the text for that run.

    Does that make sense?

    -Brian

  10. Hi Brian,

    I’m playing with the new format and the examples you gave because I want to prepare a filter which would allow me to go from WordML to XLIFF and back, possibly using a relatively small XSLT.

    The problem is that with the Polar Word2000 AddIn I get a nasty XML, cluttered with duplicate formatting tags:

    <w:r><w:rPr><w:sz w:val="27"/></w:rPr>

    <w:t>With</w:t></w:r>

    <w:r><w:rPr><w:sz w:val="27"/></w:rPr>

    <w:t> </w:t></w:r>

    <w:r><w:rPr><w:sz w:val="27"/></w:rPr>

    <w:t>your</w:t></w:r>

    I understand that this could be reformatted (without loosing the formatting) to:

    <w:r><w:rPr><w:sz w:val="27"/></w:rPr>

    <w:t>With your</w:t></w:r>

    This transformation would allow me to simply extract the contents of w:t with same formatting (same contents of w:r) because this would greatly reduce the number of tags inside XLIFF file.

    Now, my questions:

    1. Is it easy to do such cleaning (reformatting) using XSLT? I tried different methods but I couldn’t make it.

    2. Is Word 2003 and Office ML likely produce such output as Polar WordML addon? Maybe I’m looking for a solution for an artificial problem…

    Thanks,

    Marcin

  11. BrianJones says:

    Hey Marcin, I haven’t used Polar’s add-in, so I’m not sure why those runs are broken out like that.

    To clean that up, you could use an XSLT, but I’m not sure how difficult it would be. You would need to continue to look ahead and do a compare for each run and grab all the text from future runs that match. I think it would actually be really easy using XPath 2.0, but you’d have to find a tool that supported that.

    For Word 2003, we try to optimize our output so that runs aren’t broken out like that. I think we still miss some cases, but for the most part the output should be closer to your second example.

    -Brian

  12. The Monster says:

    Why in the heck is the formatting going into a sister element to the text element, instead of the nested &lt;b&gt; make this bold &lt;/b&gt; construct we’re all familiar with from (x)html (and for old-school types, in Word Perfect codes that make so much more sense)? This would seem to force nested formatting information to be repeated for each ‘run’, unnecessarily bloating the file with the same attributes over and over.

  13. BrianJones says:

    Actually, there are a number of reasons it was done this way and it has a number of benefits. The first benefit is that there is no mixed content. There is just a flat list of runs that have properties associated with them. When you get into the mixed content world it actually can become really ugly if there are other complex structures your application supports. For example, when you try to mix in additional structures like custom defined schema you need to do a lot of work to maintain well formedness. You need to know how many different tags you should close out before you write the start tag for the custom structure, and then do the same when you write the closing tag. For us, it’s always the same procedure, the single run tag is closed, and the custom schema tag is then inserted.

    One of the main reasons we do it this way is that it’s the way the formatting is stored internally in Word. It makes for a very simple structure, and there is a very short list of areas you need to look to understand what formatting is applied to any given location. There are only the run properties and the paragraph properties. There are no other nested structures that can apply formatting.

    If you are interested, I can definitely go into more details on this.

    -Brian

  14. Sir,

    I have one doubt.I attached a schema to a word document and saved the document as xml file.So far so good.but when a tried to open the copied word document from a  new system , the URI and alias name shown as unavailable.Is there any way to keep the schema along with the word document where ever I open the word document,Pls help me to solve this issue.

    Thnaks in advance

    Robin Joseph Varughese

    robinjoseph.v@arisglobal.co.in

  15. Mahesh says:

    Hi Sir,

    Does it possible in wordprocessing ml to insert the any other word document content to wordml target file?

    I will have xml file in which will have the url for any other word document file. using xslt i will convert to wrodml format but not able to get how to insert the content of any other document file.

  16. When we built the support for customer defined schemas into Word 2003 there were a couple scenarios we

  17. If you read Part 1 of the Word XML Introduction, you saw the basics behind a Word document, as well as