Using Annotations to Transform LINQ to XML Trees in an XSLT Style (Improved Approach)

You can use LINQ to XML to transform XML trees with the same level of power and expressability as with XSLT, and in many cases more than with XSLT.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCOne of the reasons that XSL is so powerful is that you can write multiple rules to transform a node.  The rule that most specifically matches is the one that is applied.

To make this clear, consider the following source document:

<Parent>
<Heading>Heading 1 text</Heading>
<Heading>Heading 2 text</Heading>
</Parent>

We can specify a transform like this:

 1 <?xml version='1.0'?>
2 <xsl:stylesheet xmlns:xsl='https://www.w3.org/1999/XSL/Transform' version='1.0'>
3 <xsl:template match='/Parent'>
4 <Root>
5 <xsl:apply-templates/>
6 </Root>
7 </xsl:template>
8 <xsl:template match='Heading[1]'>
9 <SpecialHeading>
10 <xsl:value-of select='.'/>
11 </SpecialHeading>
12 </xsl:template>
13 <xsl:template match='Heading'>
14 <H1>
15 <xsl:value-of select='.'/>
16 </H1>
17 </xsl:template>
18 </xsl:stylesheet>

When this stylesheet is applied to the source document, we see:

<Root>
<SpecialHeading>Heading 1 text</SpecialHeading>
<H1>Heading 2 text</H1>
</Root>

The template defined starting on line 8 is the transform that is applied for the first <Heading> element, even though the template defined on line 13 also matches.  The rule on line 8 matches more specifically, so it is the one that is applied.  This is the power of XSL – you supply transforms to nodes based on a pattern to match.  The specificity of the rule is significant.  This allows you to write powerful transformations where you first handle exception cases, and then impose rules that handle all other cases in a general way.

Another reason that XSL is so powerful is that you can apply a transformation to a specific node, and use the <xsl:apply-templates> element to indicate that child nodes should be transformed per their own rules.

If we have this source document:

<Parent>
<Heading>
<Text>This is some text</Text>
</Heading>
</Parent>

And transform it with this stylesheet:

<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl='https://www.w3.org/1999/XSL/Transform' version='1.0'>
<xsl:template match='/Parent'>
<Root>
<xsl:apply-templates/>
</Root>
</xsl:template>
<xsl:template match='Heading'>
<H1>
<xsl:apply-templates/>
</H1>
</xsl:template>
<xsl:template match='Text'>
<t>
<xsl:value-of select='.'/>
</t>
</xsl:template>
</xsl:stylesheet>

It results in this XML:

<Root>
<H1>
<t>This is some text</t>
</H1>
</Root>

We were able to specify separate transforms for the Heading and Text elements, and by using the <xsl:apply-templates> element, the template to transform the heading doesn't have to concern itself with transforms of the child Text element.

Some time ago, I blogged on a technique for using annotations to transform LINQ to XML trees in this same style – the style of XSLT.  Ever since that time, I've been mulling over the approach, thinking about how to improve it.  This post summarizes and shows my current thoughts about this approach to performing document-centric transformations using LINQ to XML.

The example presented here transforms an Open XML word processing document to XHTML in less than 100 lines of code (not counting the infrastructure code that enables the transformation).  The transformation that I present includes transforming paragraphs styled as Heading1 and Heading2 to h1 and h2 nodes, transforming hyperlinks, and bolded text.  It even includes a rudimentary transformation of a word table to an XHTML table.

Note: all of the code mentioned in this post is attached to this page.

Here is the word document that I transform:

Here is the rendering of the resulting XHTML:

The code presented here is not a complete, full fidelity transform.  However, it will serve to demonstrate the technique that I'm presenting here.

Note: I have plans to enhance this code (over time) so that this transformation is more complete. In particular, I plan on enhancing this code so that I can transform a DOCX into XHTML for my blog posts. I'd really like code presented in a blog post to have an automatically inserted "Copy Code" button above each code snippet.

The code presented here has the following features:

  • It allows for transformation of elements, text nodes, and comment nodes.  To transform attributes, you specify a transform for the parent element node.
  • It allows for deletion of nodes that you don't care about.  You can specify that a node be not transformed into the new tree.
  • It supports "mode", ala XSL transforms.  This allows you to define multiple transformations of a single tree, and then perform the transform separately for each mode.  You may want to have one transformation that transforms into a "table of contents" for the transformed document, and another transformation that transforms the main document part.  You then assemble both transforms into a single document that then contains both the table of contents and the contents of the document.  This is a common technique in XSL transformations.
  • It contains a model for specifying the equivalent of the <xsl:apply-templates> element.  With the approach presented here, you can specify the nodes to transform – the equivalent of specifying the "select" attribute of the <xsl:apply-templates> element.  In XSL, you specify the select attribute using an XPath expression; using this approach, you specify the TransformSelect property of the ApplyTransforms class using a LINQ expression.

Document-Centric Transforms 

Some XML documents are "document-centric" With such documents, you don't necessarily know the shape of child nodes of an element. For instance, a node that contains text may look like this:

<text>A phrase with <b>bold</b> and <i>italic</i> text.</text>

For any given text node, there may be any number of child <b> and <i> elements.

Open XML documents contain document-centric markup.  For example, the body of the document can contain any number of paragraphs; tables are siblings to paragraphs; each paragraph can contain any number of formatted text runs; hyperlinks are expressed as sibling elements to text run elements.  One of the primary characteristics of document centric XML is that you do not know exactly which child elements any particular element will have.  They may be interspersed randomly.

If you want to transform nodes in a tree where you don't necessarily know which particular children an element may have, then this approach that uses annotations is an effective approach.  This approach allows you to specify the transformation in a minimum amount of code.

Overview of the Approach 

The summary of the approach is:

  • First, annotate nodes in the tree with a object of type TransformAnnotation (a type introduced in the code attached to this page).  The TransformAnnotation.Replace property contains the new, transformed node.  If TransformAnnotation.Replace == null, then the node is removed from the transformed tree.  TransformAnnotation.Mode contains a string that specifies the transform mode, analogous to mode in XSL.
  • Second, your code calls a function that iterates through the entire tree, creating a new tree where the code replaces each node with the node specified in the TransformAnnotation.Replace property.  This code presented here implements the iteration and creation of the new tree in an extension method on XNode named Transform.  This is a pretty simple method – only about 115 lines long.

In detail, the approach consists of:

  • Execute one or more LINQ to XML queries that return the set of nodes that you want to transform from one shape to another.  For each node in the query, add a new TransformAnnotation object as an annotation to the node.  The TransformAnnotation object contains a node that will replace the annotated node in the new, transformed tree.
  • For convenience, I've defined some extension methods on XNode (TransformRemove, and TransformReplace) that add the appropriate annotation.  This results in cleaner code that specifies the rules of the transformation.
  • The new element (contained in a property of the annotation) can contain new child nodes; it can form a sub-tree with any desired shape.
  • You can add a "pseudo node" as a child node of the replacement node.  This pseudo node tells the transform code to apply further transformations.  It serves the same purpose as the <xsl:apply-templates> element in an XSL sequence constructor.  To allow inserting the pseudo node into the children of the element, the ApplyTransforms class derives from the XText class.  (It is artificial to have this class derive from XText.  I would have derived from XNode, however, XNode contains an internal abstract method, CloneNode, which prevents derivation outside of the assembly.)  This special node isn't transformed into the new tree.  Instead, it indicates to the transform code that further transformations should be performed and the results of the transformations should be inserted into the new tree.  ApplyTransforms contains a property TransformSelect of type IEnumerable<XNode>.  If TransformSelect is not null, then the nodes in the TransformSelect collection will be transformed and inserted.  This allows us to write a query that evaluates to a collection of descendant nodes that should be transformed recursively.  Alternatively, if TransformSelect is null, then the child nodes of the source element are iterated, transformed, and inserted.  ApplyTransforms also contains a string property, Mode.  As mentioned previously, Mode serves the same purpose as in XSL.

This is analogous to the specification of transforms in XSL.  The query that selects a set of nodes is analogous to the XPath expression for a template.  The code to create the new node in TransformAnnotation.Replace is analogous to the sequence constructor in XSL, and as mentioned, the ApplyTransforms node is analogous in function to the <xsl:apply-templates> element in XSL.

One primary advantage to taking this approach - as you formulate queries, you are always writing queries on the unmodified source tree.  You don't need to concern yourself about how modifications to the tree affect the queries that you are writing.

Another primary advantage to this approach – you can specify that any node found throughout the source tree be transformed according to the specified rule without concerning yourself with the specific child nodes of the node.  Those child nodes can have their own rule to specify their transformation.

As mentioned at the top of this post, in XSL, it's possible to define multiple rules that apply to any specific node.  The semantics of XSL specify that the most specific match found is the transform that is applied.  This allows you to define very specific transforms for certain nodes.  You can then define a more general transform that applies in all other cases.  The approach presented here has analogous semantics – the first annotation added is the one that is used for the transform.  You can add other annotations to the node, but the subsequent annotations are simply ignored by the transformation.  The first annotation added is the effective one.

The following is a simple example that shows how to transform a tree.  It uses a special rule to transform the first heading to the element <SpecialHeading>.  Other heading elements are transformed to <H1> elements.  This demonstrates that the transform that we specified for the first heading takes precedence over transforms that were subsequently specified.

Example 1:

XElement sourceDocument = XElement.Parse(
@"<document>
<body>
<heading>Overview of the Technique</heading>
<t>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</t>
<heading>The Technique in Detail</heading>
<t>Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.</t>
<heading>Summary</heading>
<t>Pellentesque habitant morbi tristique.</t>
</body>
</document>");

// transform body to DocumentBody
sourceDocument
.Element("body")
.TransformReplace(new XElement("Body", new ApplyTransforms()));

// transform the first heading in a special way
sourceDocument
.Descendants("heading")
.First()
.TransformReplace(new XElement("SpecialHeading", new ApplyTransforms()));

// transform heading to H1
foreach (var item in sourceDocument.Descendants("heading"))
item.TransformReplace(new XElement("H1", new ApplyTransforms()));

Console.WriteLine(sourceDocument.Transform());

This example produces the following output:

<document>
<Body>
<SpecialHeading>Overview of the Technique</SpecialHeading>
<t>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</t>
<H1>The Technique in Detail</H1>
<t>Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.</t>
<H1>Summary</H1>
<t>Pellentesque habitant morbi tristique.</t>
</Body>
</document>

The following example demonstrates the use of modes.  It uses the same source document as the above example.  It defines two transforms, one where the mode = "TOC", which transforms the document into a table of contents.  The second transform passes no argument to the Transform method, which means that it matches when mode = null.  This transforms the document into a different form for the body of the new document.

Example 2:

XElement sourceDocument = XElement.Parse(
@"<document>
<body>
<heading>Overview of the Technique</heading>
<t>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</t>
<heading>The Technique in Detail</heading>
<t>Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.</t>
<heading>Summary</heading>
<t>Pellentesque habitant morbi tristique.</t>
</body>
</document>");

// define the root transform for the table of contents
sourceDocument.TransformReplace(
new XElement("TableOfContents",
new ApplyTransforms(sourceDocument.Element("body").Elements("heading"), "TOC")), "TOC");

// define the transform of each heading element for the table of contents
foreach (var item in sourceDocument.Descendants("heading"))
{
item.TransformReplace(new XElement("TocItem", (string)item), "TOC");
}

// define the transform of the document body
sourceDocument.Element("body").TransformReplace(
new XElement("Body",
new ApplyTransforms(sourceDocument.Element("body").Elements())
)
);

// define the transforms of heading elements for the document body
foreach (var item in sourceDocument.Descendants("heading"))
{
item.TransformReplace(new XElement("H1", new ApplyTransforms()));
}

// define the transforms of t elements for the document body
foreach (var item in sourceDocument.Descendants("t"))
{
item.TransformReplace(new XElement("Text", new ApplyTransforms()));
}

// assemble the new document with both TOC and body
XElement newDoc = new XElement("Root",
sourceDocument.Transform("TOC"),
sourceDocument.Element("body").Transform()
);

Console.WriteLine(newDoc);

This example produces:

<Root>
<TableOfContents>
<TocItem>Overview of the Technique</TocItem>
<TocItem>The Technique in Detail</TocItem>
<TocItem>Summary</TocItem>
</TableOfContents>
<Body>
<H1>Overview of the Technique</H1>
<Text>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</Text>
<H1>The Technique in Detail</H1>
<Text>Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.</Text>
<H1>Summary</H1>
<Text>Pellentesque habitant morbi tristique.</Text>
</Body>
</Root>

The final example presented here transforms an Open XML document into XHTML.  It defines a number of transforms by annotating a variety of nodes.  At the end, it adds annotations to every node in the tree indicating that the node should be deleted from the transformed tree.  But this rule that deletes nodes is ignored for all nodes that have already been annotated.

Note: this code uses the Open XML SDK, which is available here.

DocxToHtml:

using (WordprocessingDocument wordDoc = WordprocessingDocument.Open("Test.docx", true))
{
XDocument doc = wordDoc.MainDocumentPart.GetXDocument();

XNamespace w = "https://schemas.openxmlformats.org/wordprocessingml/2006/main";
XNamespace r = "https://schemas.openxmlformats.org/officeDocument/2006/relationships";
XNamespace h = "https://www.w3.org/1999/xhtml";

// transform the document root element to the XHTML root element
doc.Root.TransformReplace(
new XElement(h + "html",
new XElement(h + "head",
new XElement(h + "title", "Test.docx")
),
new ApplyTransforms(doc.Root.Elements(w + "body"))
)
);

// transform the w:body element to the XHTML h:body element
doc.Element(w + "document").Element(w + "body").TransformReplace(
new XElement(h + "body", new ApplyTransforms()));

// transform every hyperlink in the document to the XHTML h:A element
foreach (var item in doc.Descendants(w + "hyperlink"))
{
item.TransformReplace(
new XElement(h + "A",
new XAttribute("href",
wordDoc.MainDocumentPart
.ExternalRelationships
.Where(x => x.Id == (string)item.Attribute(r + "id"))
.First()
.Uri
),
new XText(item.Elements(w + "r")
.Elements(w + "t")
.Select(s => (string)s).StringConcatenate())
)
);
}

// transform every Heading1 styled paragraph to the XHTML h:h1 element
foreach (var item in doc.Descendants(w + "p")
.Where(z => (string)z.Elements(w + "pPr")
.Elements(w + "pStyle")
.Attributes(w + "val")
.FirstOrDefault() == "Heading1"))
{
item.TransformReplace(new XElement(h + "h1", new ApplyTransforms()));
}

// transform every Heading2 styled paragraph to the XHTML h:h2 element
foreach (var item in doc.Descendants(w + "p")
.Where(z => (string)z.Elements(w + "pPr")
.Elements(w + "pStyle")
.Attributes(w + "val")
.FirstOrDefault() == "Heading2"))
{
item.TransformReplace(new XElement(h + "h2", new ApplyTransforms()));
}

// transform every text run that is styled as bold to the XHTML h:b element
foreach (var item in doc.Descendants(w + "r")
.Where(z => z.Elements(w + "rPr").Elements(w + "b").Any()))
{
item.TransformReplace(
new XElement(h + "b",
item.Elements(w + "t")
.Select(e => (string)e).StringConcatenate()));
}

// transform every text run that is not styled as bold to a text node that contains the
// text of the paragraph.
foreach (var item in doc.Descendants(w + "r")
.Where(z => !z.Elements(w + "rPr").Elements(w + "b").Any()))
{
item.TransformReplace(
new XText(item.Elements(w + "t").Select(e => (string)e).StringConcatenate()));
}

// transform w:p to h:p
foreach (var item in doc.Descendants(w + "p"))
{
item.TransformReplace(new XElement(h + "p", new ApplyTransforms()));
}

// transform w:tbl to h:tbl
foreach (var item in doc.Descendants(w + "tbl"))
{
item.TransformReplace(
new XElement(h + "table",
new XAttribute("border", 1),
new ApplyTransforms()
)
);
}

// transform w:tr to h:tr
foreach (var item in doc.Descendants(w + "tr"))
{
item.TransformReplace(new XElement(h + "tr", new ApplyTransforms()));
}

// transform w:tc to h:td
foreach (var item in doc.Descendants(w + "tc"))
{
item.TransformReplace(new XElement(h + "td", new ApplyTransforms()));
}

// the following removes any nodes that haven't been replaced.
foreach (var item in doc.DescendantNodes())
{
item.TransformRemove();
}

XElement newDoc = (XElement)doc.Root.Transform();
newDoc.Save("test.html");
}

When run using the document attached to this post, it produces the following:

<html xmlns="https://www.w3.org/1999/xhtml">
<head>
<title>Test.docx</title>
</head>
<body>
<h1>LINQ to XML Transformations in the Style of XSLT</h1>
<h2>Styled Text</h2>
<p>Some <b>bold</b> text.</p>
<p>Some normal text.</p>
<h2>Hyperlinks</h2>
<p>See my <A href="https://blogs.msdn.com/ericwhite" mce_href="https://blogs.msdn.com/ericwhite">blog</A>.</p>
<h2>Tables</h2>
<p>This text introduces the following tables:</p>
<table border="1">
<tr>
<td>
<p>
<b>Order Number</b>
</p>
</td>
<td>
<p>
<b>Order Date</b>
</p>
</td>
<td>
<p>
<b>Amount</b>
</p>
</td>
</tr>
<tr>
<td>
<p>124245</p>
</td>
<td>
<p>10/24/2008</p>
</td>
<td>
<p>42.55</p>
</td>
</tr>
<tr>
<td>
<p>147867</p>
</td>
<td>
<p>10/31/2008</p>
</td>
<td>
<p>88.99</p>
</td>
</tr>
</table>
<p />
<p>Item Detail for Order 124245</p>
<table border="1">
<tr>
<td>
<p>
<b>Line Number</b>
</p>
</td>
<td>
<p>
<b>Item</b>
</p>
</td>
<td>
<p>
<b>Quantity</b>
</p>
</td>
</tr>
<tr>
<td>
<p>1</p>
</td>
<td>
<p>HH242</p>
</td>
<td>
<p>3</p>
</td>
</tr>
<tr>
<td>
<p>2</p>
</td>
<td>
<p>TY149</p>
</td>
<td>
<p>8</p>
</td>
</tr>
<tr>
<td>
<p>3</p>
</td>
<td>
<p>ZZTXT</p>
</td>
<td>
<p>4</p>
</td>
</tr>
</table>
<p />
</body>
</html>

Thanks to Dirk Myers who suggested that this approach could support modes.

DocxToHtml.zip