Querying LINQ to XML Nodes in Reverse Document Order with Better Performance

Article
06/24/2009

Occasionally I need to query LINQ to XML nodes in reverse document order. I’m currently writing some LINQ to XML queries over Open XML documents where I need to select paragraph nodes based on content in the immediately preceding paragraph. However, nodes in LINQ to XML are forward-linked only. We can see evidence of this in the XNode.NodesBeforeSelf and the XElement.ElementsBeforeSelf methods - these methods return collections of nodes in document order, not reverse document order. This was by design – LINQ to XML was designed to provide great performance for the vast majority of scenarios with the minimum memory footprint possible. The need to process nodes in reverse document order is rare, so the designers of LINQ to XML decided that it was more important to reduce memory footprint than to allow for good performance in the few scenarios that require processing in reverse document order, and of course it was a good decision. But the need does exist.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC(Update June 25, 2009 - fixed bugs in event handlers associated with deleting last node and inserting node at beginning of list)

In my scenario (a functional transform that processes Open XML document revisions), it is possible that I would need to process 80,000 (or more) paragraphs. If we use the XNode.PreviousNode property, we won’t have acceptable performance. There is an easy work-around that provides us the ability to query in reverse document order in a way that performs well.

We define a new class, PreviousNodeAnnotation , that contains one public field, public XNode PreviousNode;.
We add instances of this class as annotations on nodes that we need to query in reverse document order.

In the following small example, I select nodes based on previous node value for a document size of 50,000. The slow version exhibits performance of O(n²). I limited the sample size in the example to 50,000 nodes. When I increased the doc size to 80,000 nodes (the size of one of my documents that I need to query), the execution time of the slow version exceeded my patience. In any case, it is clear that I can’t use XNode.PreviousNode for my scenario.

using System;
using System.Linq;
using System.Xml.Linq;

class PreviousNodeAnnotation
{
public XNode PreviousNode;
public PreviousNodeAnnotation(XNode prev) { PreviousNode = prev; }
}

class Program
{
static int DocumentSize = 50000;

static void SlowPreviousNodeAccess()
{
// create a tree with lots of nodes
XElement root = new XElement("Root",
Enumerable.Range(0, DocumentSize).Select(i => new XElement("Child", i)));

// query for all elements where the previous element has a value of 1000
DateTime start = DateTime.Now;
var q = root
.Elements()
.Where(e =>
{
XElement p = e.PreviousNode as XElement;
return (string)p == "1000";
});
var q2 = q.ToList(); // force iteration
TimeSpan duration = DateTime.Now - start;
Console.WriteLine(duration);
}

static void FastPreviousNodeAccess()
{
// create a tree with lots of nodes
XElement root = new XElement("Root",
Enumerable.Range(0, DocumentSize).Select(i => new XElement("Child", i)));

// initialize previous node annotations
XElement prev = null;
foreach (var item in root.Elements())
{
item.AddAnnotation(new PreviousNodeAnnotation(prev));
prev = item;
}

// query for all elements where the previous element has a value of 1000
DateTime start = DateTime.Now;
var q = root
.Elements()
.Where(e =>
{
XElement p = e
.Annotation<PreviousNodeAnnotation>()
.PreviousNode as XElement;
return (string)p == "1000";
});
var q2 = q.ToList(); // force iteration
TimeSpan duration = DateTime.Now - start;
Console.WriteLine(duration);
}

static void Main(string[] args)
{
FastPreviousNodeAccess();
SlowPreviousNodeAccess();
}
}

On my old slow laptop, the execution time of these two queries is .015 seconds and 30 seconds, respectively.

We can expand on this technique a bit by declaring three extension methods:

public static XNode PreviousNodeFast(this XNode node)
public static IEnumerable<XNode> NodesBeforeSelfFast(this XNode node)
public static IEnumerable<XElement> ElementsBeforeSelfFast(this XElement element)

It’s convenient to use these methods in queries.

In addition, while I’m partial to pure transforms with no side effects, we can declare two event handlers that keep the previous node annotations in sync when adding or deleting nodes. Here’s an example that includes these extension methods and event handlers:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;

classPreviousNodeAnnotation
{
publicXNode PreviousNode;
public PreviousNodeAnnotation(XNode prev) { PreviousNode = prev; }
}

publicstaticclassExtensions
{
publicstaticXNode PreviousNodeFast(thisXNode node)
{
return node.Annotation<PreviousNodeAnnotation>().PreviousNode;
}

publicstaticIEnumerable<XNode> NodesBeforeSelfFast(thisXNode node)
{
XNode currentNode = node;
while (true)
{
XNode prevNode = currentNode.PreviousNodeFast();
if (prevNode == null)
yieldbreak;
else
yieldreturn prevNode;
currentNode = prevNode;
}
}

publicstaticIEnumerable<XElement> ElementsBeforeSelfFast(thisXElement element)
{
return NodesBeforeSelfFast(element).OfType<XElement>();
}
}

classProgram
{
staticvoid ValidatePreviousNodes(XElement element)
{
XNode prev = null;
foreach (XNode node in element.Nodes())
{
if (node.PreviousNodeFast() != prev)
{
Console.WriteLine("ERROR: previous nodes are invalid");
Environment.Exit(0);
}
prev = node;
}
Console.WriteLine("Validated");
}

staticvoid Main(string[] args)
{
// create a tree with lots of nodes
XElement root = newXElement("Root",
Enumerable.Range(0, 100).Select(i => newXElement("Child", i)));

// setup previous nodes after tree creation
XElement prev = null;
foreach (var item in root.Elements())
{
item.AddAnnotation(newPreviousNodeAnnotation(prev));
prev = item;
}

// add event handlers to take care of adding / deleting nodes
root.Changed += newEventHandler<XObjectChangeEventArgs>((o, e) =>
{
if (e.ObjectChange == XObjectChange.Add)
{
Console.WriteLine("Add");
XNode node = o asXNode;

// o could be an XAttribute, in which case it's not applicable
if (node != null)
{
node.AddAnnotation(newPreviousNodeAnnotation(node.PreviousNode));
if (node.NextNode != null)
{
node.NextNode.RemoveAnnotations<PreviousNodeAnnotation>();
node.NextNode.AddAnnotation(newPreviousNodeAnnotation(node));
}
}
}
});
root.Changing += newEventHandler<XObjectChangeEventArgs>((o, e) =>
{
if (e.ObjectChange == XObjectChange.Remove)
{
Console.WriteLine("Remove");
XNode node = o asXNode;

// o could be an XAttribute, in which case it's not applicable
if (node != null)
{
if (node.NextNode != null)
{
node.NextNode.RemoveAnnotations<PreviousNodeAnnotation>();
node.NextNode
.AddAnnotation(newPreviousNodeAnnotation(node.PreviousNode));
}
}
}
});

ValidatePreviousNodes(root);

root.Elements().ElementAt(3).AddAfterSelf(
newXElement("NewChild", 999)
);
ValidatePreviousNodes(root);

root.Nodes().ElementAt(2).Remove();
ValidatePreviousNodes(root);

root.Nodes().ElementAt(3).NodesBeforeSelfFast().Remove();
ValidatePreviousNodes(root);

ValidatePreviousNodes(root);
root.AddFirst(
newXElement("ANode", 1)
);
ValidatePreviousNodes(root);
root.Add(
newXElement("ANode", 2)
);
ValidatePreviousNodes(root);
root.Nodes().Last().NodesBeforeSelfFast().Remove();
ValidatePreviousNodes(root);
root.Add(
newXElement("ANode", 2)
);
ValidatePreviousNodes(root);
root.Add(
newXElement("ANode", 2)
);
root.Add(
newXElement("ANode", 2)
);
root.Add(
newXElement("ANode", 2)
);
ValidatePreviousNodes(root);
root.Nodes().First().Remove();
ValidatePreviousNodes(root);
root.Add(
newXElement("ANode", 2)
);
ValidatePreviousNodes(root);
root.Nodes().Last().Remove();

root.Add(Enumerable.Range(0, 100).Select(i => newXElement("Child", i)));

XElement last = root.Elements().ElementAt(50);
foreach (var item in last.NodesBeforeSelfFast())
{
Console.WriteLine(item);
}
}
}

I would personally only define these extension methods in the module where I need good performance of reverse document order queries. It would be messy to have these extension methods in scope for modules that don't set up the annotations.

Code is attached.

PrevNode.zip

Querying LINQ to XML Nodes in Reverse Document Order with Better Performance

Additional resources