Open XML SDK and LINQ to XML

In this post, I’m presenting some code that uses the Open XML SDK and LINQ to XML to query an Open XML document. I’ve posted on this before, but this version is cleaner and smaller. It also is an example of code written in the functional style.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC

 

The most common need is to retrieve the paragraphs of a word processing document. But I've also had the need to dig into a spreadsheet, and get the data out for a row, a set of rows, or all the cells in a column. You can think of the code presented here as a (very) small library on top of the Open XML SDK.

 

Note: The Open XML API Version 2.0 will contain additional functionality that enables easier processing of the XML contents of parts. However, it's not available yet. It will be the best way to query documents in the future, but for now, the approach presented in this post is interesting.

 

I have a set of goals for this code:

  • The code should be as short and simple as possible. You should be able to understand it or change it with an absolute minimum of effort (provided that you are familiar with LINQ and LINQ to XML).
  • For robustness and simplicity of code, it should be based on the Open XML SDK. Another previous version was based on System.IO.Packaging (approx 500 lines long, this one is much shorter). Annotations, newly added to the SDK, make the SDK easier to work with.
  • It should be written in a pure functional style, using LINQ queries where appropriate.
  • The queries should be lazy. This little library should not maintain any state outside of the XDocument XML trees corresponding to each part.

 

To accomplish these goals, I used the following approach:

 

Using Nominal Tuple Types as Part of the API 

 

In my functional programming tutorial, I talked about the idea of using nominal types for tuples instead of anonymous types. I use this approach in this code. Using types that are defined using the automatic property syntax helps in shortening the code.

 

I use the approach that is in the spirit of a dynamic language - to shorten the code, make all properties public, and use automatic properties. This makes the properties read/write. However, all the code is written in a functional style - no objects are mutated after creation and initialization. So, for example, here is the definition of the class that encapsulates a comment for a paragraph:

 

public class Comment

{

    public int Id { get; set; }

    public string Text { get; set; }

    public string Author { get; set; }

}

 

Annotating Each Part with its XDocument 

 

This code uses the approach that each part in the package is annotated with its XDocument. There is a small extension method to get an XDocument for a part. If the annotation already exists, we avoid de-serializing, and just return the annotation.

 

public static XDocument GetXDocument(this OpenXmlPart part)

{

    XDocument xdoc = part.Annotation<XDocument>();

    if (xdoc != null)

        return xdoc;

    using (StreamReader sr =

           new StreamReader(part.GetStream()))

        using (XmlReader xr = XmlReader.Create(sr))

            xdoc = XDocument.Load(xr);

    part.AddAnnotation(xdoc);

    return xdoc;

}

 

Then, any time we have a part, we can call GetXDocument(), knowing that we aren’t causing any extra work for deserializing:

 

return (string)mainDocument

        .StyleDefinitionsPart

        .GetXDocument()

        .Root

        .Elements(w + "style")

        .Where(style =>

          (string)style.Attribute(w + "type") ==

            "paragraph" &&

          (string)style.Attribute(w + "default") ==

            "1")

        .First()

        .Attribute(w + "styleId");

 

An Extension Method on MainDocumentPart 

 

To retrieve the paragraphs of a document, I implemented an extension method on the MainDocumentPart. The signature is:

 

public static IEnumerable<Paragraph> Paragraphs(

  this MainDocumentPart mainDocument)

 

Using the Open XML SDK, you navigate to the MainDocumentPart. Once you have it, you can call the Paragraphs method:

 

using (WordprocessingDocument doc =

    WordprocessingDocument.Open(filename, false))

{

    foreach (var p in doc.MainDocumentPart.Paragraphs())

    {

        Console.WriteLine("Style: {0} Text: >{1}<",

            p.StyleName.PadRight(16), p.Text);

        foreach (var c in p.Comments())

            Console.WriteLine(

              " Comment Author:{0} Text:>{1}<",

              c.Author, c.Text);

    }

}

 

The Paragraphs method returns a collection of Paragraph objects. The Paragraph class contains some public properties, such as Text and Style. It also contains a method, Comments, which returns a collection of comments for each paragraph. (A paragraph, of course, can have more than one comment.) It also contains a public property ParagraphElement, which contains the <w:p> XElement for the paragraph. The code takes care of the housekeeping necessary to aggregate multiple <w:t> text nodes into a single string containing the text of the paragraph.

 

If you run this code on the small document that is attached to this post, you see:

 

Printing Paragraphs

===================

Style: Normal Text: >Paragraph 1<

  Comment Author:Eric White (DPE) Text:>Comment 1<

  Comment Author:Eric White (DPE) Text:>Comment 2<

Style: Normal Text: >Paragraph 2<

Style: Normal Text: >Paragraph 3<

  Comment Author:Eric White (DPE) Text:>Comment 3<

Style: Normal Text: >Paragraph 4<

 

There is also an extension method on the MainDocumentPart to retrieve the default style for the document. Its signature:

 

public static string DefaultStyle(this MainDocumentPart mainDocument)

 

The Paragraphs method uses this extension method, but you are free to use it too.

 

The code to retrieve the paragraphs and comments for a Wordprocessing document is 88 lines long.

 

An Extension Method on WorksheetPart 

 

To retrieve the rows and cells of a spreadsheet, I implemented an extension method on the WorksheetPart. The signature is:

 

public static IEnumerable<Row> Rows(

  this WorksheetPart worksheetPart)

 

The Row class contains a method, Cells, which returns a collection of the cells for the row. Its signature:

 

public IEnumerable<Cell> Cells()

 

Using the Open XML SDK, you navigate to the WorksheetPart. Once you have it, you can iterate through the rows, and for each row, you can iterate through the cells in the row:

 

Console.WriteLine("Contents of Spreadsheet");

Console.WriteLine("=======================");

using (SpreadsheetDocument doc =

    SpreadsheetDocument.Open(filename, false))

{

    WorksheetPart worksheet =

        (WorksheetPart)doc.WorkbookPart.GetPartById("rId1");

 

    foreach (var row in worksheet.Rows())

    {

        Console.WriteLine(" RowId:{0}", row.RowId);

        Console.WriteLine(" Spans:{0}", row.Spans);

        foreach (var cell in row.Cells())

        {

            Console.WriteLine(" Column:{0}", cell.Column);

            Console.WriteLine(" ColumnId:{0}", cell.ColumnId);

            if (cell.Type != null)

                Console.WriteLine(" Type:{0}", cell.Type);

            if (cell.Value != null)

                Console.WriteLine(" Value:{0}", cell.Value);

            if (cell.Formula != null)

                Console.WriteLine(" Formula:>{0}<", cell.Formula);

            if (cell.SharedString != null)

               Console.WriteLine(" SharedString:>{0}<", cell.SharedString);

        }

    }

    Console.WriteLine();

}

 

When run on the small spreadsheet attached to this post, you see:

 

Contents of Spreadsheet

=======================

  RowId:1

  Spans:1:3

    Column:A1

      ColumnId:A

      Value:1

    Column:B1

      ColumnId:B

      Value:2

    Column:C1

      ColumnId:C

      Value:3

  RowId:2

  Spans:1:3

    Column:A2

      ColumnId:A

      Type:s

      Value:0

      SharedString:>a<

    Column:B2

      ColumnId:B

      Type:s

      Value:1

      SharedString:>b<

    Column:C2

      ColumnId:C

      Type:s

      Value:2

      SharedString:>c<

 

If you want to retrieve just the cells for a specific column, you can just tack on a Where method call to the Cells method call:

 

WorksheetPart worksheet =

    (WorksheetPart)doc.WorkbookPart.GetPartById("rId1");

 

foreach (var row in worksheet.Rows())

{

    Console.WriteLine(" RowId:{0}", row.RowId);

    Console.WriteLine(" Spans:{0}", row.Spans);

    foreach (var cell in row.Cells().Where(c => c.ColumnId == "B"))

    {

        Console.WriteLine(" Column:{0}", cell.Column);

        Console.WriteLine(" ColumnId:{0}", cell.ColumnId);

        if (cell.Type != null)

            Console.WriteLine(" Type:{0}", cell.Type);

        if (cell.Value != null)

            Console.WriteLine(" Value:{0}", cell.Value);

        if (cell.Formula != null)

            Console.WriteLine(" Formula:>{0}<", cell.Formula);

        if (cell.SharedString != null)

            Console.WriteLine(" SharedString:>{0}<", cell.SharedString);

    }

}

 

The code to get rows and cells for spreadsheets is 69 lines long. This meets my definition of “short”.

LtxOpenXml.zip