Accessing Open XML Document Parts with the Open XML SDK

About a month ago the Open XML SDK 1.0 (June 08 update) was released. The SDK provides strongly typed document part access to Word 2007, Excel 2007 and PowerPoint 2007 documents. The SDK has been a CTP for a while, but last month version 1.0 was finally released. So I installed this baby last week and started playing around with it and found it really easy to use after briefly looking at the documentation. The How Do I section is a great place to start.

Upgrading the Letter Generator

I decided to upgrade my Word 2007 letter generator program to use the SDK to manipulate the packages. Remember that Office 2007 documents are really just archive files, so if you rename them to .ZIP you can take a look at the contents of the package. The Open XML Package spec defines a set of XML files that contain the content and define the relationships for all of the document parts stored in a single package. To programmatically manipulate them you can use the raw System.IO.Packaging namespace, but the SDK's DocumentFormat.OpenXml.Packaging namespace is much easier to work with. 

My mail merge program uses XML literals to construct XML for the document part of a Word 2007 file based on data in the Northwind database. The LINQ query was a piece of cake compared to figuring out how to manipulate the .docx package in order to replace the document.xml (called the MainDocument) part. Not that the final code is particularly long, it was just a pain to figure it out. The SDK not only saved me a few lines of code, it made the code much more readable and took only a few minutes to write. (I updated the code for the WordMailMerge program on Code Gallery).

Getting Started with the Open XML SDK

Let's take another simple example that constructs a MainDocument part using XML literals and then replaces it in a .docx package using the SDK. This time I'll focus on the code that manipulates the Open XML package with the SDK not on the particulars of XML Literals. The first thing I recommend is to install the VSTO Power Tools so you can open Office 2007 documents and manipulate the parts directly in the Visual Studio IDE like I showed in my last post using the Open XML Package Editor.

Of course you'll need to also install the SDK which places the DocumentFormat.OpenXML.dll assembly into your GAC. Add a reference to this assembly in your project. As an aside, when x-copy deploying to a machine with the .NET Framework on it already just make sure you deploy the DocumentFormat.OpenXML.dll assembly alongside your application to avoid having to install the SDK on the target machine. The easiest thing to do is select "Show All Files" in the Solution Explorer, expand the References, and on the Properties for the DocumentFormat.OpenXML reference set "Copy Local" = True. This will place a private copy of the assembly next to your application when it's built.

Now create a new Word 2007 document with some simple text in it, for instance, type: "This is my document" then save it and add the .docx file to your Visual Basic project. Double-click on it and that opens the Open XML Package Editor:

We can manipulate the parts through this editor if we want to but what I really want to do is replace the document.xml with our own we create using XML literals and embedded expressions. Double-click on the document.xml to open the MainDocument part in the XML Editor (if the XML editor opens and the XML is all on one line with no breaks then just select all the contents and cut then paste it back into the editor and it will put the proper line breaks in there for you : Ctrl + A,X,V).

For this simple example, let's place the executing user's name into the document. Create the XML Literal and an embedded expression by pasting the document.xml into the VB Editor and adding an expression to print out the executing user's name:

 Dim myDoc = <?xml version="1.0" encoding="utf-8" standalone="yes"?>
            <w:document xmlns:ve="https://schemas.openxmlformats.org/markup-compatibility/2006"
               xmlns:o="urn:schemas-microsoft-com:office:office"
               xmlns:r="https://schemas.openxmlformats.org/officeDocument/2006/relationships"
               xmlns:m="https://schemas.openxmlformats.org/officeDocument/2006/math"
               xmlns:v="urn:schemas-microsoft-com:vml"
               xmlns:wp="https://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
               xmlns:w10="urn:schemas-microsoft-com:office:word"
               xmlns:w="https://schemas.openxmlformats.org/wordprocessingml/2006/main"
               xmlns:wne="https://schemas.microsoft.com/office/word/2006/wordml">
               <w:body>
                   <w:p w:rsidR="00DD17EB" w:rsidRDefault="00361264">
                       <w:r>
                           <w:t>This is <%= Environment.UserName %>'s document</w:t>
                       </w:r>
                   </w:p>
                   <w:sectPr w:rsidR="00DD17EB" w:rsidSect="00DD17EB">
                       <w:pgSz w:w="12240" w:h="15840"/>
                       <w:pgMar w:top="1440" w:right="1440" w:bottom="1440"
                           w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
                       <w:cols w:space="720"/>
                       <w:docGrid w:linePitch="360"/>
                   </w:sectPr>
               </w:body>
           </w:document>

Replacing the MainDocument Part

Before the SDK, replacing the MainDocument part in the package we had to figure out the right content type and write the code that deleted then added the new part. We also needed to add a reference to WindowsBase (a 3.0 assembly) in order to access the System.IO.Packaging namespace.

 Imports System.IO.Packaging
Imports System.IO
 ...

  '**** Without OpenXML SDK
Dim uri As New Uri("/word/document.xml", UriKind.Relative)
Dim contentType = "application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"
Dim docFile = CurDir() & "\MyDocument.docx"

Using p As Package = Package.Open(docFile)
    'Delete the current document.xml file
    p.DeletePart(uri)

    'Replace that part with our XDocument
    Dim replace As PackagePart = p.CreatePart(uri, contentType)
    Using sw As New StreamWriter(replace.GetStream())
        myDoc.Document.Save(sw)
     End Using
End Using

For this example it's pretty easy, however if you add/remove parts it's up to you to update the relations in the package and this isn't an easy task using this raw API. Enter the Open XML SDK. Now we don't need to add a reference to WindowsBase, only to DocumentFormat.OpenXML and import the Packaging namespace contained within. Then our code can access the parts of the document in a strongly-typed way:

 Imports DocumentFormat.OpenXml.Packaging
Imports System.IO
 ...
  '***** Use the OpenXML SDK for easier access to parts
Dim docFile = CurDir() & "\MyDocument.docx"

Dim wordDoc = WordprocessingDocument.Open(docFile, True)
Using wordDoc
    'Replace the document part with our XML
    Using sw As New StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create))
        myDoc.Document.Save(sw)
    End Using
End Using

After we run this code you'll see that the MainDocument part now has the user name in the document body as described by our XML literal.

Using LINQ with the Open XML SDK

Using the SDK we can also write LINQ queries over the part collections. For instance if we want to select all the top level parts and any of their sub-parts, we can write a query like so:

 Using wordDoc
   Dim parts = From part In wordDoc.Parts _
            Select part.OpenXmlPart, _
                   part.RelationshipId, _
                   part.OpenXmlPart.RelationshipType, _
                   SubParts = _
                   ( _
                    From subPart In part.OpenXmlPart.Parts _
                    Select subPart.OpenXmlPart, _
                           subPart.RelationshipId, _
                           subPart.OpenXmlPart.RelationshipType _
                   ).ToList
 End Using

This query returns similar information to what you get with the Open XML Package Editor if we look at the same document. If we display the query results in two related DataGridViews we'll see that the MainDocument part contains additional parts for things like themes, styles and settings.

If we want to access the actual XML content for each of the OpenXmlParts we can call the GetStream method on the OpenXmlPart we want and pass it a StreamReader which we can use to load an XDocument object.

 Using wordDoc

 Dim parts = From part In wordDoc.Parts _
             Select Doc = XDocument.Load(New StreamReader(part.OpenXmlPart.GetStream())), _
                    part.OpenXmlPart, _
                    part.RelationshipId, _
                    part.OpenXmlPart.RelationshipType, _
                    SubParts = _
                    ( _
                     From subPart In part.OpenXmlPart.Parts _
                     Select Doc = XDocument.Load(New StreamReader(subPart.OpenXmlPart.GetStream())), _
                            subPart.OpenXmlPart, _
                            subPart.RelationshipId, _
                            subPart.OpenXmlPart.RelationshipType _
                    ).ToList
End Using

Loading and Querying the XDocument from the Package

Let's say we have a case where we can't use XML Literals and embedded expressions, instead we want to pull out the MainDocument part and find and replace text inside. We can do this using XML Axis properties. This can get pretty tricky because there may be a lot of formatting information in the document. An easier way may be to use content controls which you can alias so that it's easier to query those instead, but for this example it's a pretty simple query to find our body text and replace the word "my" with the user name.

 Imports <xmlns:w="https://schemas.openxmlformats.org/wordprocessingml/2006/main">
 ...

 Dim docFile = CurDir() & "\MyDocument.docx"
Dim wordDoc = WordprocessingDocument.Open(docFile, True)
Dim myDoc As XDocument

Using wordDoc
    Using xr As New StreamReader(wordDoc.MainDocumentPart.GetStream())
        'Load the MainDocument part's XML
        myDoc = XDocument.Load(xr)
    End Using

    'Find the only line of text in this document
    Dim element = (From item In myDoc...<w:t>)(0)

    'Replace the value of the element
    element.Value = <s>This is <%= Environment.UserName %>'s document</s>.Value

    Using sw As New StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create))
        'Save the modified XML back to the MainDocument part
        myDoc.Save(sw)
    End Using

End Using

One of the cool things about using the Open XML SDK is that you don't have to have Office installed to run any of this code. So it's a great alternative instead of using slow COM automation to manipulate documents.

As I explore Open XML in Office 2007 more and more I'll post more realistic business examples using LINQ to XML and Visual Basic. For now, you may want to sink your teeth into Ken Getz's Advanced Basics March 2008 article in MSDN Magazine: Office 2007 Files and LINQ . This article also shows off some important XML namespace features of Visual Basic.

Enjoy!