Streaming with LINQ to XML - Part 2

In the first post in this series we gave some background to a problem the LINQ to XML design team has been working on for some time: how to easily yet efficiently work with very large XML documents.  In today's world, developers have a somewhat unpleasant choice between doing this efficiently with fairly difficult APIs such as the XmlReader/XmlWriter or SAX, and doing this easily with DOM or XSLT and accepting a fairly steep performance penalty as documents get very large.

Let's consider a real world example - Wikipedia abstract files.  Wikipedia offers free copies of all content to interested users, on an immense number of topics and in several human languages.  Needless to say, this requires terabytes of storage, but entries are indexed in abstract.xml files in each directory in a hierarchy arranged by language and content type.  There doesn't seem to be a published schema for these abstract files, but each has the basic format:

<feed>
<doc>
<title></title>
<url></url>
<abstract></abstract>
<links>
<sublink linktype="nav"><anchor></anchor><link></link></sublink>
<sublink linktype="nav"><anchor></anchor><link></link></sublink>
</links>
</doc>
.
. [lots and lots more "doc" elements]
.
</feed>

Something one might want to do with these files is to find the URLs of articles that might be interesting given information in the  <title> or  <abstract>  elements.  For example, here is a conventional LINQ to XML program that will open an abstracts file and print out the URLs of entries that contain 'Shakespeare' in the  <abstract>.  (If you want to run this, it would be best to copy a small subset of a real Wikipedia file such as  abstract.xml  -- !! I do NOT recommend clicking on this link, it's about 10 MB!! of XML, which will keep your browser busy for awhile and possibly choke up your internet connection --  to the appropriate local directory.)

 using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml;
using System.Xml.Linq;
using System.IO;

namespace SimpleStreaming
{
    class Program
    {
        static void Main(string[] args)
        {
            XElement abstracts = XElement.Load(@"abstract.xml");
            IEnumerable<string> bardQuotes =
            from el in abstracts.Elements()
                where el.Element("abstract").Value
                   .Contains("Shakespeare")
                select (string)el.Element("url");
            foreach (string str in bardQuotes)
            {
                Console.WriteLine(str);
            }
        }
    }
}

Note that this is a typical LINQ to XML program - we query over the top level elements in the tree of abstracts for those which contain an <abstract> subelement with a value that contains the string "Shakespeare", then print out the values of the <url> subelements.   Of course, actually running this program with a multi-megabyte input file will consume a lot of time and memory; it would be more efficient to query over a stream of top level elements in the raw file of abstracts, and perform the very same LINQ subqueries (and transformations, etc.) that are possible when querying over an XElement tree in memory. 

As noted in the earlier post, we did not manage to find a design that would do this in a generic, discoverable, easy to use, yet efficient way.  Instead, we hope to teach you how to do this in a custom, understandable, easy to use, and efficient way... with just a bit of code you can tailor to your particular data formats and use cases.  In other words, to abuse the old cliche, rather than giving you a streaming class and feeding you for a day, we'll teach you to stream and let you feed yourself for a lifetime. [groan]  But seriously folks, with just a little bit of learning about the XmlReader and some powerful features of C# and .NET, you can extend LINQ to XML to process huge quantities of XML almost as efficiently as you can with pure XmlReader code, but in a way that any LINQ developer can exploit without knowing the implementation details.

The key is to write a custom axis method that functions much like the built-in axes such as Elements(), Attributes(), etc. but operates over a specific type of XML data.  An axis method typically returns a collection such as IEnumerable<XElement>. In the example here, we read over the stream with the XmlReader's ReadFrom method, and  return the collection by using yield return. This provides the deferred execution semantics necessary to make the custom axis method work well with huge data sources, but allows the application program to use ordinary LINQ to XML classes and methods to filter and transform the results.

Specifically, we will modify only a couple of lines in the application:

XElement abstracts = XElement.Load(@"abstract.xml");

goes away, because we do not want to load the big data source into an XElement tree.  Let's replace it with a simple reference to a big data source:

string inputUrl = @https://download.wikimedia.org/enwikiquote/20070225/enwikiquote-20070225-abstract.xml;

Next,
from el in abstracts.Elements()

morphs into a call to the custom axis method we are going to write, passing the URL of the data to process and the element name that we expect to stream over:
    from el in SimpleStreamAxis(inputUrl, "doc")

 Writing the custom axis method is a bit tricker (but not as scary as the name might sound), and requires only a bare minimum of knowledge about the XmlReader class (and Intellisense will help with that). The key steps are to:

a) create a reader over the inputUrl file:
    using (XmlReader reader = XmlReader.Create(inputUrl))

b) move to the content of the file and start reading:
    reader.MoveToContent();
while (reader.Read())

c) Pay attention only to XML element content (ignore processing instructions, comments, whitespace, etc. for simplicity ... especially since the Wikipedia files don't contain this stuff):

    switch (reader.NodeType)
{
case XmlNodeType.Element:

d) If the element has the name  that we were told to stream over, read that content into an XElement object and yield return it:

    if (reader.Name == matchName)
{
XElement el = XElement.ReadFrom(reader) as XElement;
if (el != null)
yield return el;
}
break;
e) Close the XmlReader when we're done.
    reader.Close();

That's not so hard is it?  The simple example program is now:

 using System;
using System.Collections.Generic;
using System.Linq;
using System.Xml;
using System.Xml.Linq;
using System.IO;

namespace SimpleStreaming
{
    class Program
    {
        static IEnumerable<XElement> SimpleStreamAxis(
                       string inputUrl, string matchName)
        {
            using (XmlReader reader = XmlReader.Create(inputUrl))
            {
                reader.MoveToContent();
                while (reader.Read())
                {
                    switch (reader.NodeType)
                    {
                        case XmlNodeType.Element:
                            if (reader.Name == matchName)
                            {
                                XElement el = XElement.ReadFrom(reader) 
                                                      as XElement;
                                if (el != null)
                                    yield return el;
                            }
                            break;
                    }
                }
                reader.Close();
            }
        }

        static void Main(string[] args)
        {
            string inputUrl = 
               @"https://download.wikimedia.org/enwikiquote/20070225/enwikiquote-20070225-abstract.xml";
            IEnumerable<string> bardQuotes =
                from el in SimpleStreamAxis(inputUrl, "doc")
                where el.Element("abstract").Value.Contains("Shakespeare")
                select (string)el.Element("url");

            foreach (string str in bardQuotes)
            {
                Console.WriteLine(str);
            }
        }
    }
}
  

The actual results contain more than just Shakespeare quotes; feel free to add whatever logic it takes to exploit Wikipedia's conventions in a more sophisticated way.  Likewise, you might wish to experiment with other LINQ to XML techniques to transform the matching elements into RSS or HTML data.  Or you might wish to experiment with a more sophisticated query language, e.g. an XPath subset, rather than using the simple name matching scheme here.  The possiblities are endless!  We'll explore some in a bit more depth in the next installment, and address a question that the LINQ to XML design team wrestled with for a long time: How do to handle documents with a more complex structure, such as a header containing contextual data that needs to be preserved, or more deeply nested documents where you want to stream over multiple levels of the hierarchy.  

TreeBard.cs