Validate Open XML Documents using the Open XML SDK 2.0

Open XML developers create new documents in a variety of ways – either through transforming from an existing document to a new one, or by programmatically altering an existing document and saving it back to disk.  It is valuable to use the Open XML SDK 2.0 to determine if the new or altered document, spreadsheet, or presentation contains invalid markup.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThis was particularly useful when I was writing the code to accept tracked revisions, and the Open XML WordprocessingML markup simplifier.  I wrote a small program to iterate through all documents in a directory tree and programmatically alter or transform each document, and then validate.  This allowed me to run the code on thousands of documents, making sure that the code would not create invalid documents.

The use of the validator is simple:

  • Open your document/spreadsheet/presentation as usual using the Open XML SDK.
  • Instantiate an OpenXmlValidator object (from the DocumentFormat.OpenXml.Validation namespace).
  • Call the OpenXmlValidator.Validate method, passing the open document.  This method returns a collection of ValidationErrorInfo objects.  If the collection is empty, then the document is valid.  You can validate before and after modifying the document.

Here is the simplest code to validate a document.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Validation;
using DocumentFormat.OpenXml.Wordprocessing;

class Program
{
static void Main(string[] args)
{
using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open("Test.docx", false))
{
OpenXmlValidator validator = new OpenXmlValidator();
var errors = validator.Validate(wordDoc);
if (errors.Count() == 0)
Console.WriteLine("Document is valid");
else
Console.WriteLine("Document is not valid");
}
}
}

While debugging your code, it is helpful to know exactly where each error is.  You can iterate through the errors, printing:

  • The content type for the part that contains the error.
  • An XPath expression that identifies the element that caused the error.
  • An error message.

Here is code to do that:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Validation;
using DocumentFormat.OpenXml.Wordprocessing;

class Program
{
static void Main(string[] args)
{
using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open("Test.docx", false))
{
OpenXmlValidator validator = new OpenXmlValidator();
var errors = validator.Validate(wordDoc);
if (errors.Count() == 0)
Console.WriteLine("Document is valid");
else
Console.WriteLine("Document is not valid");
Console.WriteLine();
foreach (var error in errors)
{
Console.WriteLine("Error description: {0}", error.Description);
Console.WriteLine("Content type of part with error: {0}",
error.Part.ContentType);
Console.WriteLine("Location of error: {0}", error.Path.XPath);
}
}
}
}

As a developer, you will want to open a document, modify it in some fashion, and then validate that your modifications were correct.  The following example opens a document for writing, modifies it to make it invalid, and then validates.  To make an invalid document, it adds a text element (w:t) as a child element of a paragraph (w:p) instead of a run (w:r).

This approach to document validation works if you are using the Open XML SDK strongly-typed object model.  It also works if you are using another XML programming technology, such as LINQ to XML.  The following example shows the document modification code written using two approaches.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Validation;
using DocumentFormat.OpenXml.Wordprocessing;

public static class MyExtensions
{
public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument partXDocument = part.Annotation<XDocument>();
if (partXDocument != null)
return partXDocument;
using (Stream partStream = part.GetStream())
using (XmlReader partXmlReader = XmlReader.Create(partStream))
partXDocument = XDocument.Load(partXmlReader);
part.AddAnnotation(partXDocument);
return partXDocument;
}

public static void PutXDocument(this OpenXmlPart part)
{
XDocument partXDocument = part.GetXDocument();
if (partXDocument != null)
{
using (Stream partStream = part.GetStream(FileMode.Create, FileAccess.Write))
using (XmlWriter partXmlWriter = XmlWriter.Create(partStream))
partXDocument.Save(partXmlWriter);
}
}
}

class Program
{
static void Main(string[] args)
{
using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open("Test.docx", true))
{
// Open XML SDK strongly-typed object model code that modifies a document,
// making it invalid.
wordDoc.MainDocumentPart.Document.Body.InsertAt(
new Paragraph(
new Text("Test")), 0);

// LINQ to XML code that modifies a document, making it invalid.
XDocument d = wordDoc.MainDocumentPart.GetXDocument();
XNamespace w = "https://schemas.openxmlformats.org/wordprocessingml/2006/main";
d.Descendants(w + "body").First().AddFirst(
new XElement(w + "p",
new XElement(w + "t", "Test")));
wordDoc.MainDocumentPart.PutXDocument();

OpenXmlValidator validator = new OpenXmlValidator();
var errors = validator.Validate(wordDoc);
if (errors.Count() == 0)
Console.WriteLine("Document is valid");
else
Console.WriteLine("Document is not valid");
Console.WriteLine();
foreach (var error in errors)
{
Console.WriteLine("Error description: {0}", error.Description);
Console.WriteLine("Content type of part with error: {0}",
error.Part.ContentType);
Console.WriteLine("Location of error: {0}", error.Path.XPath);
}
}
}
}

When you run this example, it produces the following output:

Document is not valid

Error description: The element has invalid child element
'https://schemas.openxmlformats.org/wordprocessingml/2006/main:t'.
List of possible elements expected:
<https://schemas.openxmlformats.org/wordprocessingml/2006/main:pPr>.
Content type of part with error:
application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
Location of error: /w:document[1]/w:body[1]/w:p[1]