Indexing XML, XML ids and a better GetElementByID method on the XmlDocument class


One question that I get is “How can you index an XML document? If you had indexes, all your lookups would be many times faster rather than having to walk the entire XML document?” The XML 1.0 specification has no ability to index a document based upon unique node identity and hence when you do searches with XPath, it searches from the start of the document in a top down, left traversal manner to find the matching nodes. XSLT does come to the rescue with the xsl:key keyword which allows you to create name value pairs that can be used as a lookup in a stylesheet. However often you simply want to find the value in a document without having to resort to XSLT, the universal cure for all XML problems.


You then discover the GetElementById method that seems to do exactly what you want - jump to a location in a document based upon an id value. Unfortunately this all horribly breaks down as you need to define a DTD or an XML schema to indicate which attributes are of type id, which provides more pain than it is worth. What you really want is just to have values in your document that you know are ids. This is a very application specific requirement. There has been some discussion recently within the W3C on the subject of the idness of a node based upon a set of xml:id requirements that were published in August 2003. This is somewhat useful, although it is really only interesting to use this in the context of a query language that works over in-memory documents, such as via the XPath id() function. It will be interesting to see whether there is any demand for a standard definition of an id value on a node in the absence of a schema to indicate this. My feeling is that like newly recommended XML 1.1 this will take a long time, if ever, to be adopted within parsers and by users and you are best of simply creating your own solution.


So how can you do this? If you need index like lookup today it is simple to implement on the XmlDocument class in .NET. Given this exmaple XML document, called “demo.xml” with id attributes defined on “child” named element nodes;


<?xml version="1.0" ?>


<root xmlns="urn:webdata">


      <child id="1">text1</child>


      <child id="2">text2</child>


      <child id="3">text3</child>


      <child id="4">text4</child>


</root>


First derived your own specific CustomDocument type with a Hashtable lookup, register for the NodeInserted event handler and then at Load() time build up the Hashtable of id values with the cooresponding “parent“ element. The GetElementByIdAttribute() method on your CustomDocument class then uses the id string to index the cooresponding XmlElement value and return them. Instant indexes on your XML document.


using System;


using System.Xml;


using System.IO;


using System.Collections;


 


namespace XmlDocumentGetElementById


{


      public class CustomDocument : XmlDocument


      {


        Hashtable idContainer;


        string idAttributeName;


        string elementName;


   


        public CustomDocument( string elementName, string idAttributeName ) : base()


        {


            this.elementName = elementName;


            this.idAttributeName = idAttributeName;


            this.idContainer = new Hashtable();


            this.NodeInserted += new XmlNodeChangedEventHandler( NodeInsertedHandler );


        }


 


        public void NodeInsertedHandler( object sender, XmlNodeChangedEventArgs args )


        {


            if( args.Node.NodeType == XmlNodeType.Attribute


                && args.NewParent.Name == elementName


                && args.Node.Name == idAttributeName


                ) {


                string id = args.Node.Value;       


                if( idContainer[id] == null ) {


                    idContainer[id] = args.NewParent;


                }


                else {


                    throw new Exception( "Id already present" );


                }


            }


        }


 


        public XmlElement GetElementByIdAttribute( string id )


        {


            return (XmlElement)idContainer[id];


        }


    }


}


 


Using this CustomDocument becomes very easy. Create one, provide the name of the element that you want to index (here called “child”) and the attribute name that acts as the index key (here called “id“) and then simply call the GetElementByIdAttribute() method to return the index value. No DTD or XML schema necessary.


 


static void Main(string[] args)


{


      CustomDocument td = new CustomDocument("child", "id");


      td.PreserveWhitespace = true;


      td.Load("demo.xml");


      Console.WriteLine("Outputting Information using the ID attribute information. ");


      Console.WriteLine();


      int temp = 1;


      while (temp < 5)


      {


            XmlElement ele = td.GetElementByIdAttribute(temp.ToString());


            OutputNode(ele);


            temp++;


      }


}


 


static void OutputNode ( XmlNode node)


{


    if( node == null )


    {


        return;


    }


    if( node.NodeType == XmlNodeType.Document ) {


        return;


    }


        Console.Write("Name: {0,-20}",node.Name);


        Console.Write("NodeType: {0,-10}", node.NodeType);


        Console.WriteLine("Value: {0,-10}", node.InnerText);


    }


}


 


As an exercise for the reader, make this more useful by providing a custom XPath function called idkey()that simply wraps this functionality within an XPath query, thereby providing XSLT key() like functionality for your document. See the Adding Custom Functions to XPath Extreme XML article on how to do this.

Comments (11)
  1. This would be a very useful feature and will be best extended not only to store an id attribute’s value but to generate and maintain an id for every type of node.

    Yes, I need something with the functionality of the standard XSLT function generate-id() with time complexity much better than the O(N^2) exhibited by the current .Net XSLT implementation.

    Therefore, why not just fix generate-id() itself and also the current xsl:key implementation?

    Let me assure you that this is really very much needed by all XSLT programmers and not just an exercise for the curious of us.

    Cheers,

    Dimitre Novatchev [XML MVP],

    FXSL developer, XML Insider,

    http://fxsl.sourceforge.net/ — the home of FXSL

    Resume: http://fxsl.sf.net/DNovatchev/Resume/Res.html

  2. RichB says:

    {pedant}

    You have used invalid URN syntax.

    rfc2141 (URN Syntax) states:

    All URNs have the following syntax (phrases enclosed in quotes are REQUIRED):

    <URN> ::= "urn:" <NID> ":" <NSS>

    {/pedant}

  3. Mark Fussell says:

    Yes, the implementation of xsl:key in the .NET XSLT 1.0 and 1.1 processor had a bad bug. In the next Service Pack (SP) due to be released for .NET 1.0 and .NET 1.1 this has been fixed. I am sure that you know of this KB http://support.microsoft.com/default.aspx?scid=kb;EN-US;324478

    The generate-id function although useful for generating id values for HTML anchors via href, cannot be used with the xsl:key or xsl:id functions which is a limitation. However adding a new user defined method to XSL to do this is a good idea. I will also look into the 0^2 behavior. As I mentioned at PDC 2003 we have done a lot of work to increase the perf of the XSLT processor in .NET V2.0

  4. generate-id() *can* be used with xsl:key (e.g. in the "use" attribute) and correspondingly with the key() function and this use is not at all rare.

    If you need concrete examples, I will be glad to provide them.

    As for the coming fix, I knew this a year ago — the problem is the fix is still not available.

    Please, understand me correctly — we are accustomed with the unprecedented quality and performance of MSXML4 and would expect the same from dotNet XslTransform, especially taking into account that it is much more easier to write an XSLT processor in a modern, garbage-collected language than in C++.

    Thank you,

    Dimitre Novatchev.

  5. Steve Saxon says:

    This is a pretty neat way to work around the DTD/schema limitation.

    Probably worth pointing out it would be really easy to build much the same thing for XPathDocument (now that the DOM is dead!).

    You would derive from XPathDocument, and have a custom XmlReader that worked together with the reader. You hook the reader’s Read() method to trap when you’re on an element, and as one flys by, you look for the id attribute and pass it back to your XPathDocument-derived class through some internal method.

    If you then added the XPath function extension to bind the id, that would work for this too, by the power of XPathNavigator.

    BTW, the link for the XPath function extensions should be;

    http://msdn.microsoft.com/library/en-us/dnexxml/html/xml10212002.asp

  6. little things... says:

    Missing from here:

    – how to handle node removal from the document?

    – how to handle node insertion in a document fragment?

  7. For a pure XSLT solution of a closely related problem — to assign ids to nodes — see the thread "Counting nodes efficiently" in the xsl-list:

    http://aspn.activestate.com/ASPN/Mail/Message/XSL-List/2002866

    The generated "_id" attribute can be easily used with a corresponding definition of an xsl:key and then with the key() function.

  8. Here I go again with another experimental creature from my XML Bestiary: IndexingXPathNavigator. This one was inspired by Mark Fussell’s "Indexing XML, XML ids and a better GetElementByID method on the XmlDocument class". I’ve been experimenting with Mark’s extended XmlDocument, played a bit with XPathDocument and "idkey()" extension function Mark…

  9. In related news – The XML Core Working Group has released the First Public Working Draft of xml:id Version 1.0. The specification introduces a predefined attribute name that can always be treated as an ID and hence can always be recognized. What can be said? At last! Finally! xml:id Version…

Comments are closed.

Skip to main content