Manage your documents: Using Office XML formats and XSLTs

One of the main reasons we announced this new file format so early was that we wanted to give people an opportunity to start working on building different types of solutions on top of the file formats. I’m pushing for an early release of the schemas (sometime before Beta 1), but that still leaves us with a few months before they would be out. So, in the mean time, the best way to start playing around with potential solutions is using Office 2003. There is already a ton of XML support in that product. While the announcement of these new default XML formats is a big deal, it is definitely not the first time we’ve worked with XML. In Office 2000 (which we started developing in 1997) we build an HTML format that leveraged XML for representing things like document properties and other application specific information. This was done because HTML didn't support all of our features and we didn't want people to lose information when saving as a web page. It was unfortunate because it didn't look like "pure" HTML, but it was necessary to support our customers data. Starting in 1999 we began building the SpreadsheetML format that shipped with Excel in Office XP. Then in 2001 we started working on the WordProcessingML format which is now available in Word 2003. So, as you can see, we’ve been doing stuff with XML in Office for the past 8 years. Why the brief history lesson you ask? It’s important to understand that the new formats coming out with Office12 are based on the work we’ve done up through Office 2003. So, if you build solutions on top of Word2003’s XML, those will map fairly easily into the new file formats. For Word, the only big difference with the new format is that we break the single XML file into multiple files and wrap them all up in a ZIP package (We’ve actually designed a logical model for structuring documents from multiple pieces which we then mapped into ZIP). Today I want to show an example of something you can do with WordprocessingML in Word 2003.

There have been a number of questions around support for other XML formats (there are tons of them out there). As I’ve described, since the formats are XML and fully documented, anyone can build transforms to go from our format into another (or vice versa). I decided I would post a really simple transform that runs against Word 2003 XML just to give folks an example. This transform will get rid of all the tracked changes and comments in a file. It does the exact same thing as if you were editing the file directly in Word and chose to accept all revisions. This transform is something that people could leverage as part of a workflow process. Imagine if you had documents you wanted to publish and you wanted to make sure there weren’t any deletions or comments in the files. I’m sure you’ve heard of people getting burned by posting documents on a server that had deletions in them. Often times the end user didn’t realize the deletions were still there, and there wasn’t an easy way for administrators to write an automated process to remove those deletions. Well, using XML, it’s easy to write solutions that manipulate Office documents without having to run the applications themselves.

Here are the steps for trying this solution out:

  1. Download this ZIP file and put the two enclosed files on your desktop (https://jonesxml.com/resources/trackChangesExample.zip)
  2. Open the file called "FileFormatDev.xml" in Word 2003. Notice that there are a bunch of comments and deletions.
  3. Open the file in Internet Explorer (or any text / xml editor) and look at it’s contents. There is a ton of XML there, but you only really need to care about certain parts. Do a search for “aml:” and you’ll see all the tags we use for representing those comments and revisions.
  4. Now open the XSLT "acceptRevisionsAndDeleteComments.xslt" in a text editor or IE, and take a look. It’s a pretty simple transform. If you are familiar with XSLT, you’ll see that all this does is re-writes all the WordProcessingML except for the comments and revisions. It strips those out.
  5. You’ll now need to apply the XSLT to the Word document to remove those comments and revisions. There are a number of ways you can do this. Most XML parsers out there can do this for you. You can also use Word to do this directly. In Word, we allow you to save XML files through transforms as well as open them through transforms. That’s what we’ll do just to keep it simple.
  6. Open the Whitepaper in Word again and go to the Save As… dialog (File -> Save As…). The file type in that dialog should be “XML Document (.xml)”.
  7. Notice that there is a checkbox in the dialog called “Apply Transform”. Go ahead and select that, and you will then have the “Transform...” button enabled.
  8. Click on the “Transform...” button and go find the transform that you downloaded in Step 3.
  9. You’ve now told Word to save the file as an XML document, and then after the save is done, apply the specified transform. That means that if the XSLT does it’s job right, you should get a WordXML file that has all the comments and revisions removed.
  10. Rename the file so you can compare the results, then press the “save” button. There will be a warning letting you know you are saving through a transform and that some of the document information might be lost. Go ahead and press “OK”. Once the file is saved, go ahead & shut down Word.
  11. Open the file in Word again, and you’ll see that the comments and revisions are now gone. Remember that while in these steps we applied the XSLT with Word, you can do it anywhere. You don’t need to have Word on the machine. You could use any XML parser that supports XSLT and apply it to your documents.
    1. As a quick aside, you may notice that the XML file that you saved doesn't open as easily in IE or a text editor. There are two things going on here. The first is that we put the following PI (processing instruction) at the top of our files (<?mso-application progid="Word.Document"?>). We actually have a shell handler that sees this in the XML and associates the file with Word (even though the extension is just .xml). We do something similar with our HTML files. The problem is that if you try to open it in Internet Explorer, it will see that PI and hand the file off to Word. You can open the file in notepad and delete the PI, and then it will open in IE.
    2. The other thing you may notice if you open the file in an XML or text editor is that it's just one long stream of text. We don't "pretty print" our XML files, so if you look at them as plain text, it's hard to read. We do this because it improves the performance of saving and loading. It makes it a bit more difficult to work with though. One option is to open it in IE (after removing the PI) since IE will apply a transform to lay it out better. Another option is to use an XML editor (Visual Studio; Front Page; etc.) that gives you the option to format the file. That will apply "pretty printing" to the document for easier reading.

So, that's just one example of writing a tool that manipulates a Word document. If you were going to try to do something like this with the binary formats, it would have been extremely difficult. Most people that are trying to do this today usually end up writing code that automates the Office applications. The advantage with the XSLT is that you don’t need to have the Office applications involved (in the demo we had Word apply the XSLT, but you could have used any number of tools to do it).

Let me know if you guys have any questions or if the XSLT doesn’t work for you. I think in my next post I’ll talk more about the Word schema and how we designed it. At first glance it’s a fairly intimidating schema, but as you learn about it, it’s pretty basic and straightforward. There are just a ton of features in Word, so we had to create XML to represent them all. That doesn't mean that you need to deal with them all though if you're just trying to do something simple. Also, does anyone feel like it would be useful to have some posts talking about more of the basics around XML? Or does everyone feel like they are already up to speed on everything I've discussed and just want to see more technical posts?

-Brian