Manage your documents: Using Office XML formats and XSLTs

One of the main reasons we announced this new file format so early was that we wanted to give people an opportunity to start working on building different types of solutions on top of the file formats. I’m pushing for an early release of the schemas (sometime before Beta 1), but that still leaves us with a few months before they would be out. So, in the mean time, the best way to start playing around with potential solutions is using Office 2003. There is already a ton of XML support in that product. While the announcement of these new default XML formats is a big deal, it is definitely not the first time we’ve worked with XML. In Office 2000 (which we started developing in 1997) we build an HTML format that leveraged XML for representing things like document properties and other application specific information. This was done because HTML didn’t support all of our features and we didn’t want people to lose information when saving as a web page. It was unfortunate because it didn’t look like “pure” HTML, but it was necessary to support our customers data. Starting in 1999 we began building the SpreadsheetML format that shipped with Excel in Office XP. Then in 2001 we started working on the WordProcessingML format which is now available in Word 2003. So, as you can see, we’ve been doing stuff with XML in Office for the past 8 years. Why the brief history lesson you ask? It’s important to understand that the new formats coming out with Office12 are based on the work we’ve done up through Office 2003. So, if you build solutions on top of Word2003’s XML, those will map fairly easily into the new file formats. For Word, the only big difference with the new format is that we break the single XML file into multiple files and wrap them all up in a ZIP package (We’ve actually designed a logical model for structuring documents from multiple pieces which we then mapped into ZIP). Today I want to show an example of something you can do with WordprocessingML in Word 2003.

There have been a number of questions around support for other XML formats (there are tons of them out there). As I’ve described, since the formats are XML and fully documented, anyone can build transforms to go from our format into another (or vice versa). I decided I would post a really simple transform that runs against Word 2003 XML just to give folks an example. This transform will get rid of all the tracked changes and comments in a file. It does the exact same thing as if you were editing the file directly in Word and chose to accept all revisions. This transform is something that people could leverage as part of a workflow process. Imagine if you had documents you wanted to publish and you wanted to make sure there weren’t any deletions or comments in the files. I’m sure you’ve heard of people getting burned by posting documents on a server that had deletions in them. Often times the end user didn’t realize the deletions were still there, and there wasn’t an easy way for administrators to write an automated process to remove those deletions. Well, using XML, it’s easy to write solutions that manipulate Office documents without having to run the applications themselves.

Here are the steps for trying this solution out:

  1. Download this ZIP file and put the two enclosed files on your desktop (
  2. Open the file called “FileFormatDev.xml” in Word 2003. Notice that there are a bunch of comments and deletions.
  3. Open the file in Internet Explorer (or any text / xml editor) and look at it’s contents. There is a ton of XML there, but you only really need to care about certain parts. Do a search for “aml:” and you’ll see all the tags we use for representing those comments and revisions.
  4. Now open the XSLT “acceptRevisionsAndDeleteComments.xslt” in a text editor or IE, and take a look. It’s a pretty simple transform. If you are familiar with XSLT, you’ll see that all this does is re-writes all the WordProcessingML except for the comments and revisions. It strips those out.
  5. You’ll now need to apply the XSLT to the Word document to remove those comments and revisions. There are a number of ways you can do this. Most XML parsers out there can do this for you. You can also use Word to do this directly. In Word, we allow you to save XML files through transforms as well as open them through transforms. That’s what we’ll do just to keep it simple.
  6. Open the Whitepaper in Word again and go to the Save As… dialog (File -> Save As…). The file type in that dialog should be “XML Document (.xml)”.
  7. Notice that there is a checkbox in the dialog called “Apply Transform”. Go ahead and select that, and you will then have the “Transform…” button enabled.
  8. Click on the “Transform…” button and go find the transform that you downloaded in Step 3.
  9. You’ve now told Word to save the file as an XML document, and then after the save is done, apply the specified transform. That means that if the XSLT does it’s job right, you should get a WordXML file that has all the comments and revisions removed.
  10. Rename the file so you can compare the results, then press the “save” button. There will be a warning letting you know you are saving through a transform and that some of the document information might be lost. Go ahead and press “OK”. Once the file is saved, go ahead & shut down Word.
  11. Open the file in Word again, and you’ll see that the comments and revisions are now gone. Remember that while in these steps we applied the XSLT with Word, you can do it anywhere. You don’t need to have Word on the machine. You could use any XML parser that supports XSLT and apply it to your documents.

    1. As a quick aside, you may notice that the XML file that you saved doesn’t open as easily in IE or a text editor. There are two things going on here. The first is that we put the following PI (processing instruction) at the top of our files (<?mso-application progid=”Word.Document”?>). We actually have a shell handler that sees this in the XML and associates the file with Word (even though the extension is just .xml). We do something similar with our HTML files. The problem is that if you try to open it in Internet Explorer, it will see that PI and hand the file off to Word. You can open the file in notepad and delete the PI, and then it will open in IE.
    2. The other thing you may notice if you open the file in an XML or text editor is that it’s just one long stream of text. We don’t “pretty print” our XML files, so if you look at them as plain text, it’s hard to read. We do this because it improves the performance of saving and loading. It makes it a bit more difficult to work with though. One option is to open it in IE (after removing the PI) since IE will apply a transform to lay it out better. Another option is to use an XML editor (Visual Studio; Front Page; etc.) that gives you the option to format the file. That will apply “pretty printing” to the document for easier reading.

So, that’s just one example of writing a tool that manipulates a Word document. If you were going to try to do something like this with the binary formats, it would have been extremely difficult. Most people that are trying to do this today usually end up writing code that automates the Office applications. The advantage with the XSLT is that you don’t need to have the Office applications involved (in the demo we had Word apply the XSLT, but you could have used any number of tools to do it).

Let me know if you guys have any questions or if the XSLT doesn’t work for you. I think in my next post I’ll talk more about the Word schema and how we designed it. At first glance it’s a fairly intimidating schema, but as you learn about it, it’s pretty basic and straightforward. There are just a ton of features in Word, so we had to create XML to represent them all. That doesn’t mean that you need to deal with them all though if you’re just trying to do something simple. Also, does anyone feel like it would be useful to have some posts talking about more of the basics around XML? Or does everyone feel like they are already up to speed on everything I’ve discussed and just want to see more technical posts?


Comments (24)

  1. Bob ?:-) says:

    Thank you for that example Brian.

    After doing a quick run through the steps in Word 2003Pro I was surprised that after processing the XML file with your XLST file to strip *out* the Track-change (aml) revisions that the resulting XML file grew from 498K to 648K. (Haven’t had a chance to look into ‘why’ at this point <g>)

    On the other hand, your original zip at only 38K does show how future .DOCX files, for example can often be smaller than original .doc files.

    BTW, it was mentioned that one of the reasons for ZIP was to leverage that built in capability in WinXP. Since Word 2000 and 2002’s can be used on Win98, NT, ME and 2000 will the add-in ‘patches’ to those earlier versions of Word have a ‘hook’ to call a ZIP utility of some type? Also will the Office apps/patches be able to handle zips that use higher compression rather than the ‘default’ one?

    Bob Buckland ?:-)

  2. Gene Myers says:

    Hi Brian-

    Great blog.

    I should point out that in point 11, you said, "Open the file in Word again, and you’ll see that the comments and revisions are now gone." The name of the transform was acceptRevisions original.xslt. It has stripped out just the tracked changes, but not the comments.



  3. BrianJones says:

    Sorry everyone, I posted the wrong XSLT. It will still remove the tracked changes, but it leaves the comments.

    In addition, it’s a bit more complex than it needs to be. I’m at home right now so I can’t update it. Feel free to still play with it though, as it still does most of what I had described. I’ll post a more up to date one when I get in tomorrow morning.

  4. BrianJones says:

    OK, I just updated it. The XSLT should now remove comments. It’s also a bit easier to look at and figure out what’s going on.

    Bob – Office isn’t using the ZIP technology that comes with WinXP, so we aren’t limited to only that platform.

  5. Bruce Rindahl says:

    Two questions. Will the new XML format for Excel be similar to SpreadsheetML or a totally new schema? Also how are you going to handle the ZIP/XML package format in MSXML? For example, to open an XML document in XSLT you use the document() function which is expecting a well-formed single XML document. How are you going to expose an Office Open XML to this kind of call?? If there is no support for this in XSLT then you have to unzip the files to get at the particular XML file you need. This pretty much cancels out the benefit of the compression or the format.

  6. BrianJones says:

    Hey Bruce. The new schema for Excel will be different from the existing SpreadsheetML schema. There will be some similarities, but it will be much more aligned with how Excel internally represents the grid.

    Your second question is a great question. There are a couple alternatives here. If you want to operate on the file as a single XML file, we will have a serialization method you can run that will convert the ZIP package into a single XML file. This is what you would do if you just wanted to run a single XSLT against the thing. As you say though, that cancels out the packaging and compression benefits of the format.

    Alternatively, you can use System.IO.Packaging provided in the WinFx SDK to navigate the ZIP package and relationships to access each part that makes up the file ( If you are building a solution on top of the format, this will often be the better way to go. It’s also just ZIP, so you could use any existing ZIP library out there if you didn’t want to use the WinFx SDK. If you instead are just wanting to apply a single XSLT though, than the serialization format would be what you want.


  7. Jan Fransen says:


    I’m catching up on your posts as I take a break from working my way through a real-world WordML/XSLT application. I’d love to see as much information and as many examples as you’d care to post showing how XSLT can be used to manipulate WordML.

    Of course, everyone building Word solutions will want more documentation on the parts of the schema that address their own specific needs. So I might as well share just a bit about the specific solution I’m working on: We’re sending a Word document to PDF format and taking tracked changes along for the ride as PDF Comments. Our latest approach is to use XSLT to transform the document twice–once to mark each revision in the document by either changing the font color (insertions) or adding a 1-pixel character (deletions), and once to provide an XML document with information about each revisions. From there another process will read the PDF and use the colors, marks, and metadata to add the annotations.

    So far, we haven’t hit anything we couldn’t figure out by inspecting the WordML (although marking deleted rows within tables is a bit tricky). But examples of how you and others are using and transforming WordML, such as the one you just posted, help us figure things out that much faster.

    Jan Fransen


  8. Darryl Hover says:

    Hi Brian.

    All this is very good and exciting!

    This particular post brings up an interesting thought…what I’d like to see is information on using Word to create the XSL files for the transforms. Perhaps it’s already available in the SDK somewhere, and I’ve just missed it?



  9. BrianJones says:

    Darryl, what are you thinking you’d like from the XSLTs? Are you looking for XSLTs to go from WordML into another format? Or do you want XSLTs to go from your XML into WordML?


  10. orcmid says:

    This is getting too exciting. OK, I have extended my Geek Saturday Morning a bit and then I have my regular Geezer Student duties to perform.

    I did the download and worked through the example. Here are snippets from my notes that might not be of interest to other beginners with this:

    2. {Opening the XML document in Word 2003} [Check. I opened it in IE (my default for .xml files) first. Then used the Open With … right-click option to Open it in Word. If you’re on-line there is a ton of network activity while Word opens the document. If you turn off the network first, it opens just fine and a bit quicker. ?!]

    11. {opening the XSLT-transformed result} [check. Minor tweak: It is the newly-saved version that is to be opened, hopefully saved under a different name. My result is bigger. I found out why: It’s in UTF-16 whereas the input is in UTF-8.]

    11.1 {Overcoming the processing instructions.} [check. dh:2005-06-18 I don’t see any processing instructions (I checked with Notepad too). Interesting. The new file opens in IE easily. Hmm. I think this is a configuration-controlled deal on whether or not we want XML/HTML to open in the creation application or not. Apparently the PI isn’t written when we have that turned off. Heh. Sometimes von Clueless lucks out.]

    11.2 {Examining/Pretty-printing the raw XSLT output} [check: Yup. I tried out FrontPage 2003 on the saved document, taking the options to reformat and verify for well-formedness. The odd thing is that FrontPage changed the XML-declaration encoding from UTF-16 to Unicode — no foul in this case but that was a little presumptuous. I also notice that the standalone attribute is dropped from the original document’s XML declaration.]

    OK, done playing for now.

  11. Darryl Hover says:

    XSLTs to go from my XML into WordML. But it just struck me that this is the purpose of the XSLT Inference tool.


  12. BrianJones says:

    orcmid – Thanks for your feedback Dennis. It looks like the PI isn’t preserved when applying the XSLT, so that’s why there wasn’t any problem opening the file in IE. If you save the file without applying the XSLT, then the PI should be there.

    Darryl – Yes, that’s the reason we created the inference tool. It was clear that writing the XSLTs from scratch was pretty difficult and a lot of work for most people, so we built the inference tool to give folks a head start. I’m planning on writing something up in the next few days that will walk people through using the tool.


  13. Hi Brian

    You keep talking about us using the stuff in System.IO.Packaging. I guess that assumes we’re all coding in .NET. Is the same functionality provided anywhere that can be called from VBA/VB6/VBScript? It would be a travesty if we had to rely on third-party libraries to get into these files from VBA.


    Stephen Bullen

  14. Bob ?:-) says:

    Hi Brian,

    Okay, tried the updated XSLT.

    1. File size result: Orcmid had the same result I did (result file after accepting changes and removing comments and deletions resulted in a larger .XML or .DOC file than the original. I see also that the new XML file is UTF-16 although both your original and XSLT show UTF-8. Not sure what I’m missing in this case 🙂

    2. With the Word 2000->2003 add-ins available when processing this (save as XML & apply transform) would there have then been a choice to save as XML or as OffXML (i.e. end up with a zipped package) with a checkbox (similar to the choice to specify a Transform during Save As?

    3. One Word-side issue that could confuse folks doing this type of save via the U.I.

    If you were working in a .doc file, make changes and use File=>Save as (.doc with new name) the document on your screen matches the ‘as saved’ condition and the file name & path on the top your open Word window reflects the new name.

    If you save as xml and apply a transform the Word window changes to reflect the new file name, but what you see on the screen is still the ‘pre transformed’ flavor of the document. If you close and reopen (from the MRU on the File menu in Word) then you get the ‘saved’ version. Seems like a ‘refresh’ (reopen?) choice would be needed in Word to keep things ‘in synch’ (from the person used to working on .doc) files.

    4. Yes – examples, more please 🙂

    Bob Buckland ?:-)

  15. Darryl Hover says:

    An Inference Tool walk through would be sweet. Could you please include in your examples info on how to use a single element in multiple locations in the XSLT?

    For example, the XML File contains:





    The final document is to contain multiple references to the <last_name> element as in a letter as follows:

    Darryl Hover

    9999 My Street

    My City, MY 12345

    Dear Mr. Hover



  16. BrianJones says:

    Stephen, the only solid plans right now for APIs are the managed ones I’ve been referring too. It’s just ZIP and XML though, so anyone can build a tool for accessing the files. I agree with you though that it would be nice to have something simple that is available through VBA, but like you said it would be nice to not have to rely on third party technologies. I’m looking into what we can do, and will probably have an update on this later on in the summer or early fall.

    Bob, you’re right about there being user confusion when saving through an XSLT. We actually don’t see that as being much of an end user scenario though. Ideally the save through XSLT would be leveraged as part of a larger solution that has specific types of XML it wants out of Word.

    Darryl, I’m not sure when I’ll get the example pulled together, but I’ll try to include your suggestion.


  17. Merinda says:

    So if I have a document that is being generated on the fly in Word using XML, every time the webpage is ran, the document is re-generated. Does your XSLT have to be used every time I run the webpage, or will it recognize that the XSLT has been used.

  18. Hi Brian,

    I was confused about the part in your stylesheet about handling deletions that span paragraphs. As best I can tell, when a deletion begins in one paragraph and ends in a nother, it shows up in the XML as two separate aml elements. For example:




    <aml:annotation aml:id="0" aml:author="Andrew Savikas" aml:createdate="2005-06-29T16:48:00Z" w:type="Word.Deletion" />




    <w:t>The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the</w:t>


    <aml:annotation aml:id="1" aml:author="Andrew Savikas" aml:createdate="2005-06-29T16:48:00Z" w:type="Word.Deletion">



    <w:delText> lazy dog.</w:delText>








    <aml:annotation aml:id="2" aml:author="Andrew Savikas" aml:createdate="2005-06-29T16:48:00Z" w:type="Word.Deletion" />



    <aml:annotation aml:id="3" aml:author="Andrew Savikas" aml:createdate="2005-06-29T16:48:00Z" w:type="Word.Deletion">



    <w:delText>The quick brown fox jumps over the lazy dog. The quick bro</w:delText>




    <!– end example –>

    As such, your cleanup stylesheet could be reduced to the following (adapted from Hack 96 in "Word Hacks" (<a href=""/&gt;)

    <xsl:stylesheet version="1.0"





    <!– By default, recursively copy everything through –>

    <xsl:template match="@*|node( )">


    <xsl:apply-templates select="@*|node( )"/>



    <!– Remove all comments and comment references –>

    <xsl:template match="aml:annotation[starts-with(@w:type,


    <!– Remove all deletions –>

    <xsl:template match="aml:annotation[@w:type=’Word.Deletion’]"/>

    <!– Remove all formatting changes –>

    <xsl:template match="aml:annotation[@w:type=’Word.Formatting’]"/>

    <!– Remove all insertion marks –>

    <xsl:template match="aml:annotation[@w:type=’Word.Insertion’]">

    <!– Process content, but do not copy –>

    <xsl:apply-templates select="aml:content/*"/>



    On all the files I tried this on (including the one you posted), it worked fine without any well-formedness errors. It would seem to me that it would actually be impossible to have an annotation element span two paragraphs, unless both were children of the paragraph.

    As for the issue of pretty printing, are there plans to include, or at least make available, something like the XML Toolbox Add-in? That’s been adequate for me for extracting pretty-printed WordprocessingML (that’s quite a mouthful, btw).


    Andrew Savikas

  19. BrianJones says:

    Thanks for the post Andrew.

    The issue of the deletion spanning the paragraphs is that you also need to account for the fact that that paragraph mark at the end of the first paragraph needs to be removed. You need to merge the remainders of those two paragraphs into one.

    Take this as an example:

    "First paragraph

    Second One"

    If you selected from "paragraph" to "Second" and hit delete, the result would be one paragraph that says "First One".

    That is what I was accounting for with the added complexity of the XSLT. If you do the above example and apply the XSLT you suggest, it will result in this:



    Instead of this:

    "First one"

    Make sense?

    As for the pretty printing, it’s not clear yet how we’ll package it. Most likely there will be a seperate tool to apply pretty printing, rather than built in functionality.


  20. Rajiv says:

    Hi Brian,

    I have a requirents to use word 2003 as an XML Editor. Is is possible to use macros dynamically in the Smart Documents ? ie .. the macros are downloaded from a server when the

    instanhce is started

    Will it possible to demostrate this with an example.

    Also i would like to achive this transformation through code ie on some event trigger the tranform.Can i have an example for this.