Intro to Word XML Part 1: Simple Word Document

This post is for those of you interested in learning the basics behind WordprocessingML. That’s the schema that we built for Word 2003. You can save any Word document as XML, and we will use this schema to fully represent that document as XML. The new default XML format for Word 12 is going to look very similar to the WordprocessingML schema in Word 2003. The big differences are really around the use of ZIP as a container, and breaking the file out into pieces so that it’s no longer one large XML file. If you are interested in the Office 12 formats, it would be really valuable to first get familiar with the XML formats from Office 2003. Over the coming months I’ll provide more details about the 12 formats, but for the time being, I would suggest learning WordprocessingML. This post will serve as a simple introduction.

If your first exposure to Word’s XML schema came from taking an existing document and saving it out using the XML format, you probably had a bad experience. First off, we don't pretty print our files, so if you opened it in a text editor you probably had no chance of making anything out. We also save out a processing instruction (I'll post more on this later) that tells IE and the shell that it's a Word XML file. This means that if you try to open the file in IE to get their XML view, it instead will launch Word.

The other issue is that in Word, we maintain all sorts of information about the files that you may not care about. As a result , there are a ton of XML elements saved out that can at first make the file itself look a bit intimidating. This is the difference between a full featured format, and one that isn't. We can't lose any functionality by moving to these XML formats, so as a result, we have to be able to represent everything as XML. I will show in future posts that it's also possible to save into a non-full featured format by using XSLT on the way out. This would allow you go get a simpler file when you save, but it has the side effects of losing some functionality. That's why we would never do a non-full featured format as a default. Instead it's an optional thing.

Word's XML format is actually fairly simple if you're only trying to do simple things. You only need to expose yourself to the complex side if you're trying to do something more complex. Often times, the functionality of a feature in Word is extremely complex, so as a result, the representation as XML of that feature is also complex. In future posts I'll drill into some of the areas where people have had more problems and try to better explain the mapping from the internal feature to the XML representation. For now though, let's just make a simple Word document.

Step 1: Root Element

Just like in our simple Excel Example, the first thing we need to do is to create the root element for the Word document. The root element is "wordDocument", and the root namespace is "https://schemas.microsoft.com/office/word/2003/wordml". So, we should start with the following in our XML file:

<?xml version="1.0"?>
<w:wordDocument xmlns:w="https://schemas.microsoft.com/office/word/2003/wordml">
</w:wordDocument>

There are three things that we just did. The first was declaring at the top that it's an XML file following the 1.0 version of the W3C XML standard (<?xml version="1.0"?>). The second was that we declared that the "w:" prefix maps to the Word namespace (xmlns:w="https://schemas.microsoft.com/office/word/2003/wordml"). And the third thing we did was to create the root element wordDocument in the Word namespace (<w:wordDocument>).

Step 2: Document body and first paragraph

OK, so we have a skeleton document, but there is nothing in it yet. Similar to HTML, the content of the Word document is contained within a "body" tag. Within the body tag, we can have paragraphs and tables. Let's also create a paragraph element, so that our file now looks like this:

<?xml version="1.0"?>
<w:wordDocument xmlns:w="https://schemas.microsoft.com/office/word/2003/wordml">
<w:body>
<w:p>
</w:p>
</w:body>
</w:wordDocument>

We now have a Word document with one empty paragraph

Step 3: Add some content

Since this is just a simple introduction, let's keep it that way, and make this into a "hello world" example. Internally in Word, we assign formatting to text by breaking everything in the document into a flat list of runs. Each run then has a set of formatting properties associated with it. We do the same in WordprocessingML. A paragraph is made up of one or more runs of text. So, to make this Word document say "hello world", we need to add a run tag and a text tag inside our paragraph. The "hello world" text will then go inside that text tag:

<?xml version="1.0"?>
<w:wordDocument xmlns:w="https://schemas.microsoft.com/office/word/2003/wordml">
<w:body>
<w:p>
<w:r>
<w:t>Hello World</w:t>
</w:r>
</w:p>
</w:body>
</w:wordDocument>

Go ahead and open that file in Word, and you'll see your text. Not too exciting yet, but it's a start. For the last part, let's make the text bold.

Step 4: Add basic formatting

As I already mentioned, all text in a word document is stored as a collection of runs with properties associated with them. We already created one run of text in that first paragraph, but it just used the default formatting. Let's add one more tag (<w:rPr>) tag inside of that run which allows us to specify properties for that text:

<?xml version="1.0"?>
<w:wordDocument xmlns:w="https://schemas.microsoft.com/office/word/2003/wordml">
<w:body>
<w:p>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>Hello World</w:t>
</w:r>
</w:p>
</w:body>
</w:wordDocument>

Now we've said that that run of text has bold formatting (<w:b/>) applied to it. Not the most exciting example, but we have to start somewhere. In later posts we'll go into how to create a more complicated set of formatting using multiple runs of text, as well as working with lists, tables, images, etc. It's a bit different that other document formats out there, so I want to step through everything carefully.

-Brian