Indexing XML, XML ids and a better GetElementByID method on the XmlDocument class

One question that I get is “How can you index an XML document? If you had indexes, all your lookups would be many times faster rather than having to walk the entire XML document?” The XML 1.0 specification has no ability to index a document based upon unique node identity and hence when you do searches with XPath, it searches from the start of the document in a top down, left traversal manner to find the matching nodes. XSLT does come to the rescue with the xsl:key keyword which allows you to create name value pairs that can be used as a lookup in a stylesheet. However often you simply want to find the value in a document without having to resort to XSLT, the universal cure for all XML problems.

You then discover the GetElementById method that seems to do exactly what you want - jump to a location in a document based upon an id value. Unfortunately this all horribly breaks down as you need to define a DTD or an XML schema to indicate which attributes are of type id, which provides more pain than it is worth. What you really want is just to have values in your document that you know are ids. This is a very application specific requirement. There has been some discussion recently within the W3C on the subject of the idness of a node based upon a set of xml:id requirements that were published in August 2003. This is somewhat useful, although it is really only interesting to use this in the context of a query language that works over in-memory documents, such as via the XPath id() function. It will be interesting to see whether there is any demand for a standard definition of an id value on a node in the absence of a schema to indicate this. My feeling is that like newly recommended XML 1.1 this will take a long time, if ever, to be adopted within parsers and by users and you are best of simply creating your own solution.

So how can you do this? If you need index like lookup today it is simple to implement on the XmlDocument class in .NET. Given this exmaple XML document, called “demo.xml” with id attributes defined on “child” named element nodes;

<?xml version="1.0" ?>

<root xmlns="urn:webdata">

      <child id="1">text1</child>

      <child id="2">text2</child>

      <child id="3">text3</child>

      <child id="4">text4</child>

</root>

First derived your own specific CustomDocument type with a Hashtable lookup, register for the NodeInserted event handler and then at Load() time build up the Hashtable of id values with the cooresponding “parent“ element. The GetElementByIdAttribute() method on your CustomDocument class then uses the id string to index the cooresponding XmlElement value and return them. Instant indexes on your XML document.

using System;

using System.Xml;

using System.IO;

using System.Collections;

namespace XmlDocumentGetElementById

{

      public class CustomDocument : XmlDocument

 {

        Hashtable idContainer;

        string idAttributeName;

        string elementName;

   

        public CustomDocument( string elementName, string idAttributeName ) : base()

        {

            this.elementName = elementName;

            this.idAttributeName = idAttributeName;

            this.idContainer = new Hashtable();

            this.NodeInserted += new XmlNodeChangedEventHandler( NodeInsertedHandler );

        }

        public void NodeInsertedHandler( object sender, XmlNodeChangedEventArgs args )

        {

            if( args.Node.NodeType == XmlNodeType.Attribute

                && args.NewParent.Name == elementName

                && args.Node.Name == idAttributeName

       ) {

                string id = args.Node.Value;

                if( idContainer[id] == null ) {

                    idContainer[id] = args.NewParent;

                }

                else {

                    throw new Exception( "Id already present" );

                }

            }

        }

        public XmlElement GetElementByIdAttribute( string id )

        {

            return (XmlElement)idContainer[id];

        }

    }

}

Using this CustomDocument becomes very easy. Create one, provide the name of the element that you want to index (here called “child”) and the attribute name that acts as the index key (here called “id“) and then simply call the GetElementByIdAttribute() method to return the index value. No DTD or XML schema necessary.

 

static void Main(string[] args)

{

CustomDocument td = new CustomDocument("child", "id");

td.PreserveWhitespace = true;

td.Load("demo.xml");

Console.WriteLine("Outputting Information using the ID attribute information. ");

Console.WriteLine();

int temp = 1;

while (temp < 5)

{

XmlElement ele = td.GetElementByIdAttribute(temp.ToString());

OutputNode(ele);

temp++;

}

}

static void OutputNode ( XmlNode node)

{

if( node == null )

{

return;

}

if( node.NodeType == XmlNodeType.Document ) {

return;

}

Console.Write("Name: {0,-20}",node.Name);

Console.Write("NodeType: {0,-10}", node.NodeType);

Console.WriteLine("Value: {0,-10}", node.InnerText);

}

}

As an exercise for the reader, make this more useful by providing a custom XPath function called idkey()that simply wraps this functionality within an XPath query, thereby providing XSLT key() like functionality for your document. See the Adding Custom Functions to XPath Extreme XML article on how to do this.