Equality Semantics of LINQ to XML Trees

Article
01/27/2009

In certain scenarios, it is important to be able to compare two XML trees for equivalence. For example, if you are writing a web service that serves results of queries, and you want to cache query results so that duplicate queries use previously cached results instead of always accessing the underlying database. However, the senders of those queries may potentially be using a variety of tools to generate the queries, and these tools may introduce trivial differences into the XML. The intent of the queries may be identical, but XNode.DeepEquals returns false if you compare the XML that contains semantically equivalent, but trivially different queries.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC(July 1, 2009 - Updated - Normalize empty elements to use a self-closing tag.)

This post describes an approach and presents code for normalizing LINQ to XML trees. After normalizing XML trees, you can call XNode.DeepEquals with a greater chance that semantically equivalent XML trees will evaluate as equivalent. The approach presented in this post makes use of the assistance that XSD can provide when normalizing XML trees. If an XML document has been validated using XSD, and the XML tree has been annotated with the Post Schema Validation Infoset (PSVI), then we can use that PSVI to help normalize the tree. Much of this post is based on the OASIS specification Schema Centric XML Canonicalization Version 1.0 (C14N), which describes a variety of ways to normalize an XML tree, including through the use of PSVI.

Let’s look at a few common cases where XNode.DeepEquals reports that equivalent XML trees are unequal. In the following example, the two trees have exactly the same content. The only difference is that one of them uses a default namespace, and the other uses the same namespace using a namespace prefix. XNode.DeepEquals will report that the two trees are different.

XElement root1 = XElement.Parse(
@"<Root xmlns='https://www.northwind.com'>
<Child>1</Child>
</Root>");

XElement root2 = XElement.Parse(
@"<n:Root xmlns:n='https://www.northwind.com'>
<n:Child>1</n:Child>
</n:Root>");

if (XNode.DeepEquals(root1, root2))
Console.WriteLine("Equal");
else
Console.WriteLine("Not Equal");

Here’s another case where XNode.DeepEquals returns false:

XElement root1 = XElement.Parse(
@"<Root a='1' b='2'>
<Child>1</Child>
</Root>");

XElement root2 = XElement.Parse(
@"<Root b='2' a='1'>
<Child>1</Child>
</Root>");

if (XNode.DeepEquals(root1, root2))
Console.WriteLine("Equal");
else
Console.WriteLine("Not Equal");

These two are equivalent, but the attributes are not ordered. However, because you can’t have two attributes in an element that have the same qualified name, we can sort the attributes by namespace and name, and then compare.

The situation gets more interesting when we consider normalizing a document that has been validated using XSD. Consider the following simple schema:

<xsd:schemaxmlns:xsd='https://www.w3.org/2001/XMLSchema'>
<xsd:elementname='Root'
type='xsd:double'/>
</xsd:schema>

Let’s say that we are comparing two small XML documents:

Document 1:

Document 2:

These two documents have essentially the same content, but the ‘+’ before the 25 in the second document will cause the two documents to compare as not equals. However, if we have first validated the documents and annotated the tree with a PSVI using the XDocument.Validate extension method, we know that the data type of the element is double, and can normalize the value of the element. Once normalized, the two documents will compare as equals.

Going further, a schema can declare default attributes and elements. If an XML document in one case has an attribute with the default value, and in another case, the XML document is missing the default attribute, the attribute can be added when validating, and the two trees will compare as equivalent.

For illustration, consider this schema and two documents:

<xsd:schemaxmlns:xsd='https://www.w3.org/2001/XMLSchema'>
<xsd:elementname='Root'>
<xsd:complexType>
<xsd:simpleContent>
<xsd:extensionbase='xsd:string'>
<xsd:attributename='ADefaultBooleanAttribute'
default='false'/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:element>
</xsd:schema>

Document 1:

<Root/>

Document 2:

The above two documents will compare as equivalent after validation and adding PSVI.

In the remainder of this post, I’ll present a list of areas where it would be possible to normalize a tree before comparison. Then, I’ll discuss a simple implementation of normalization using LINQ to XML.

Normalization Issues

The following is a list of issues that can be addressed when normalizing an XML tree:

Insignificant white space should not exist in a normalized tree.
Namespace prefixes and the use of default namespaces should not be significant. It is sufficient to compare qualified names while disregarding whether the namespaces are serialized by a prefix, or as the default namespace.
Missing default elements and attributes should be added to the XML tree when normalizing.
Values of elements and attributes of certain data types can be normalized. Types that can be normalized include xsd:boolean, xsd:dateTime, xsd:decimal, xsd:double, xsd:float, xsd:hexBinary,and xsd:language.
The attributes xsi:schemaLocation and xsd:noNamespaceSchemaLocation exist only to give hints to a schema processor about the location of the schema. We can discard these attributes when normalizing an XML tree.
We can order attributes alphabetically by namespace and name, eliminating insignificant ordering differences.
Comments and processing instructions are not semantically significant when comparing trees. We can remove them when normalizing.
Empty elements (<Tag></Tag>) can be normalized to use a self-closing tag (<Tag/>). These two forms are semantically equivalent per the XML specification. Further, if an element is declared to be EMPTY, only a self-closing tag will do, whereas any empty element can be converted to a self-closing tag, so it's always safe to convert to self-closing. This is the default behavior of LINQ to XML.

One important point about normalization is that you can define application-specific normalization rules. It would be easy to modify the normalization code presented in this post to, say, look for a particular named complex type in the PSVI, and normalize that element per specific rules, perhaps converting attribute or element values to upper or lower case, or some other bespoke normalization.

LINQ to XML Implementation of Normalization

The approach that I took when normalizing is to generate a new, normalized XML tree (instead of modifying the existing tree to normalize it.) This has the advantage that we can write this code in the pure functional style. The resulting code will be small and easy to maintain. In the example code that I present in this post, the pure functional transform to generate a new normalized XML tree is about 190 lines of code. For more information on this style of cloning, see the blog post “Manually Cloning LINQ to XML Trees”.

Instead of optimizing for processing speed, I optimized for compactness and readability of code. This is probably a good trade-off. This code will be pretty fast in any case, and other processes in our solution, such as database access, can be multiple orders of magnitude slower, so it probably doesn’t matter if there are ways to write the code so that it will execute slightly faster.

XML Trees Must Validate before Normalization

The first and most important point is that this code relies on the XML validating successfully before normalizing. The code allows you to specify no schema, but in this case, it only does the normalization that doesn’t require PSVI. But if you specify a schema, and the code doesn’t throw an exception, then the document(s) were valid per the schema.

White Space

The default behavior of LINQ to XML is to discard insignificant white space when populating an XML tree. No text nodes for insignificant white space are materialized. This makes it easy for us – the code presented here assumes that insignificant white space is not in the XML tree.

Namespace Prefix Normalization

Because LINQ to XML has a simple model for namespaces, normalization is easy. The full namespace is included in every XName object. There can be XAttribute objects that contain namespace declarations, however, these “attributes” only have an effect upon serialization. For the purposes of normalization, we can simply remove all XAttribute objects that are namespace declarations. We can determine whether an attribute is a namespace declaration using the XAttribute.IsNamespaceDeclaration property.

Adding Default Elements and Attributes

One of the overloads of the Validate extension methods for XDocument allows us to pass a Boolean parameter that indicates that the XML tree be populated with the PSVI. As part of this population of the PSVI, default elements and attributes will be added to the tree. They will, then, be cloned in the new tree.

Normalizing Values of Elements and Attributes of Certain Data Types

For five data types (xsd:boolean, xsd:decimal, xsd:double, xsd:dateTime, and xsd:float), we can take advantage of one particular aspect of LINQ to XML semantics: LINQ to XML is lax when validating values that are passed to constructors, and is strict when serializing values. For instance, if we know that an attribute is type xsd:double, we can create a new normalized attribute like this:

return new XAttribute(a.Name, (double)a);

And in a similar way, we can create a normalized xsd:double element like this:

return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(double)element);

Elements and attributes of type xsd:hexBinary and xsd:language are case insensitive. We can normalize by converting them to lower case.

Removing Comments, Processing Instructions, and the Attributes xsi:schemaLocation and xsi:noNamespaceSchemaLocation

This is straightforward – the code trims these while cloning.

Ordering Attributes by Name

While cloning the tree, it is straightforward to order attributes by name.

Normalize Empty Elements to Self-Closing Elements

The default behavior of LINQ to XML is to serialize an empty element as a self-closing element.

Normalization Issues Not Handled

Schema Centric XML Canonicalization Version 1.0 (C14N) describes a number of other possibilities for normalization that I’ve not handled:

When using an XML programming interface other than LINQ to XML, there are some things we could do to normalize entities from a DTD. However, because LINQ to XML has eliminated entity support, and all entities are expanded before the LINQ to XML tree is populated, we can disregard normalization issues with entities.
Data stored in base64Binary can have white space interjected at very specific points. Normalization could include making sure that this white space conforms to the specification. I left this as an exercise for the reader. This can be accomplished in a few lines of code each for elements and attributes. Feel free to post your solution in a comment below. J
This code does no character modeling normalization other than any normalization done by XmlReader when deserializing the XML.
An XML document can contain an element or attribute that contains an XPath expression. This expression will probably rely on the use of particular namespace prefixes. This code does not attempt to normalize XPath expressions that use specific namespace prefixes. If the XML that you are normalizing contains XPath expressions, it will be necessary to add in the XAttribute namespace definitions so that the XML will serialize with the correct namespace prefix. The code presented here doesn’t deal with these issues.
There are rules for normalization of white space within attributes for certain data types. This code doesn’t do any normalization of this white space.
The C14N document states that if a complex element contains only child elements (mixed content is not allowed), and if the element validates against a model group whose compositor is xsd:all, then the order of child elements is not significant. It could be possible to sort them by namespace and name when normalizing. Initially, I wrote the code to do this normalization. However, I think it’s possible that code can implement differing behavior based on ordering, so I elected to not implement this. If you believe that this normalization is or isn’t valid, please feel free to advise me in a comment. J

About the Code

This post presents two methods: Normalize and DeepEqualsWithNormalization.

Normalize

This method creates and returns a new, cloned, normalized XDocument.

Because it relies on schema validation, it normalizes an XDocument, not an XElement. However, it is easy to create an XDocument from an XElement if you need to normalize an XML tree rooted in an XElement.

It is valid to pass null for the schema parameter, in which case the method will only do the normalizations that are possible without using PSVI.

Signature:

public static XDocument Normalize(XDocument source, XmlSchemaSet schema)

DeepEqualsWithNormalization

This method compares two XDocument objects after normalization.

It is valid to pass null for the schema parameter, in which case the method will only do the normalizations that are possible without using PSVI.

Signature:

public static bool DeepEqualsWithNormalization(XDocument doc1, XDocument doc2,
XmlSchemaSet schemaSet)

Test Cases

In the attached code, along with the public and private methods to do the normalization, etc, there are about a dozen test cases that cause full code coverage of the normalization code. The sample code runs each test case, and reports whether the new normalized trees are equivalent.

The Code

The following listing shows the code used for XML tree normalization and comparison, including an extensions class and the Normalize and DeepEqualsWithNormalization methods:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.Xml.Schema;

public static class MyExtensions
{
public static string ToStringAlignAttributes(this XDocument document)
{
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.OmitXmlDeclaration = true;
settings.NewLineOnAttributes = true;
StringBuilder stringBuilder = new StringBuilder();
using (XmlWriter xmlWriter = XmlWriter.Create(stringBuilder, settings))
document.WriteTo(xmlWriter);
return stringBuilder.ToString();
}
}

class Program
{
private static class Xsi
{
public static XNamespace xsi = "https://www.w3.org/2001/XMLSchema-instance";

public static XName schemaLocation = xsi + "schemaLocation";
public static XName noNamespaceSchemaLocation = xsi + "noNamespaceSchemaLocation";
}

public static XDocument Normalize(XDocument source, XmlSchemaSet schema)
{
bool havePSVI = false;
// validate, throw errors, add PSVI information
if (schema != null)
{
source.Validate(schema, null, true);
havePSVI = true;
}
return new XDocument(
source.Declaration,
source.Nodes().Select(n =>
{
// Remove comments, processing instructions, and text nodes that are
// children of XDocument. Only white space text nodes are allowed as
// children of a document, so we can remove all text nodes.
if (n is XComment || n is XProcessingInstruction || n is XText)
return null;
XElement e = n as XElement;
if (e != null)
return NormalizeElement(e, havePSVI);
return n;
}
)
);
}

public static bool DeepEqualsWithNormalization(XDocument doc1, XDocument doc2,
XmlSchemaSet schemaSet)
{
XDocument d1 = Normalize(doc1, schemaSet);
XDocument d2 = Normalize(doc2, schemaSet);
return XNode.DeepEquals(d1, d2);
}

private static IEnumerable<XAttribute> NormalizeAttributes(XElement element,
bool havePSVI)
{
return element.Attributes()
.Where(a => !a.IsNamespaceDeclaration &&
a.Name != Xsi.schemaLocation &&
a.Name != Xsi.noNamespaceSchemaLocation)
.OrderBy(a => a.Name.NamespaceName)
.ThenBy(a => a.Name.LocalName)
.Select(
a =>
{
if (havePSVI)
{
var dt = a.GetSchemaInfo().SchemaType.TypeCode;
switch (dt)
{
case XmlTypeCode.Boolean:
return new XAttribute(a.Name, (bool)a);
case XmlTypeCode.DateTime:
return new XAttribute(a.Name, (DateTime)a);
case XmlTypeCode.Decimal:
return new XAttribute(a.Name, (decimal)a);
case XmlTypeCode.Double:
return new XAttribute(a.Name, (double)a);
case XmlTypeCode.Float:
return new XAttribute(a.Name, (float)a);
case XmlTypeCode.HexBinary:
case XmlTypeCode.Language:
return new XAttribute(a.Name,
((string)a).ToLower());
}
}
return a;
}
);
}

private static XNode NormalizeNode(XNode node, bool havePSVI)
{
// trim comments and processing instructions from normalized tree
if (node is XComment || node is XProcessingInstruction)
return null;
XElement e = node as XElement;
if (e != null)
return NormalizeElement(e, havePSVI);
// Only thing left is XCData and XText, so clone them
return node;
}

private static XElement NormalizeElement(XElement element, bool havePSVI)
{
if (havePSVI)
{
var dt = element.GetSchemaInfo();
switch (dt.SchemaType.TypeCode)
{
case XmlTypeCode.Boolean:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(bool)element);
case XmlTypeCode.DateTime:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(DateTime)element);
case XmlTypeCode.Decimal:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(decimal)element);
case XmlTypeCode.Double:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(double)element);
case XmlTypeCode.Float:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
(float)element);
case XmlTypeCode.HexBinary:
case XmlTypeCode.Language:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
((string)element).ToLower());
default:
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
element.Nodes().Select(n => NormalizeNode(n, havePSVI))
);
}
}
else
{
return new XElement(element.Name,
NormalizeAttributes(element, havePSVI),
element.Nodes().Select(n => NormalizeNode(n, havePSVI))
);
}
}
}

Program.cs