Parsing XML from an Open XML Document

[Blog Map]  [Table of Contents]  [Next Topic]

The first problem that we're going to tackle is to retrieve some specific text out of an Open XML Word document.  In this word document will be text that has the style of "Code".  We want to find all consecutive paragraphs that have this style, and retrieve the paragraph text.  Also, when authoring the document, we want to add word comments to the document, on the first paragraph of a block of lines that are styled "Code".  These word comments will contain metadata that tells how to build the code, run the code, and validate the output.  The comments contain the metadata, so we want to retrieve the text of the comments.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCFurther, if there are multiple separate blocks of text in the word doc that are styled Code, we want to grab each of these blocks as separate chunks of text, along with their associated comments.

For those who are interested, the reason I needed this code is that I commonly have code snippets in documents.  Each time that I modify the code or that I receive a new drop of the libraries that the code uses, I want to automatically test all of these snippets and make sure that they still work.  What I have done is to put build instructions (written in XML) in Word comments on the code in the docs.  Then, I run a program that extracts the snippets and uses the build instructions to compile and build the code.  The code tester then runs the code and verifies the output.  In most cases, the output is also in the word doc, styled as Code, also with a comment on it that identifies it.  (I would have preferred a different style name for the output, but didn't have a choice.)  Validation then not only makes sure that the code compiles, but it verifies that when readers of the docs run the snippet, they are going to see the output that is shown in the docs.  Of course, it is a little more complicated than this - for instance, there is the facility to specify a language for each snippet, files to copy that are required for the snippet to run, ability to validate against a file that the snippet writes, etc.  But once we have the query to retrieve the text styled as Code and the comments, the rest of the code tester is pretty simple stuff.

To accomplish this task, we'll start by writing a simple query, then enhance our query, using additional standard query operators, and writing a couple of extension methods that helps us retrieve exactly what we want from the docs.

The native file format for the 2007 Microsoft Office system is Office Open XML (commonly called Open XML).  Open XML is an XML-based format that is an ECMA and ISO-IEC standard.  The markup language for word processing files within Open XML is called WordprocessingML.  This tutorial uses WordprocessingML as input for the examples.

If you are using Microsoft Office 2003, you can load and save documents in Open XML if you have installed the Microsoft Office Compatibility Pack for Word, Excel, and PowerPoint 2007 File Formats.  You can download the compatibility pack here.

For more information, and for links to the ECMA Office Open XML specification, see the OpenXML Developer Web site.

Open XML documents consist of various XML and binary parts stored in a zip file, called a package.  This is documented in the IS29500 Open XML specification as Part 2, the Open Packaging Conventions.  It is also documented in part 2 of the ECMA376 standard.

[Blog Map]  [Table of Contents]  [Next Topic]