Streaming with LINQ to XML - Part 1

 

This is the first of a multi-post series (#2 is now online)  on how to use LINQ to XML in scenarios that require streaming over a large input and/or output data source rather than loading a document into memory, processing it, and saving it. A considerable number of XML users are faced with a dilemma: they are asked to process very large data sources, but their tools (such as DOM and XSLT) assume that the data can be loaded into an in-memory tree. The option of simply writing smaller files may not be available or appropriate. For example, XML is increasingly used as a format for database dumps and logfiles which are intrinsically large. The only currently supported option  in the .NET environment is to use a pull parser --  XmlReader or XmlTextReader.  This, however, generally requires much more work (and a new set of skills to be learned) compared with tree-based technologies.  

 

In other words, the target audience for LINQ to XML will sometimes encounter large documents or arbitrary streams of XML; they want the ease of use that LINQ to XML offers, but they don't want to have to load an entire data source into an in-memory tree before starting to work with it. They could use XmlReader, of course, but that is a considerably lower-level API that requires attention to all sorts of details of XML syntax that mainstream developers don't want to worry about. The LINQ to XML design team has spent a lot of effort over the last year or so wrestling with alternative ways to provide streaming functionality in a way that is consistent with the API's overall philosophy.  The May 2006 CTP included support for streaming output (but not input) via an XStreamingElement class that essentially allowed XML-like trees of IEnumerable<T> instances that could be lazily consumed while being saved as XML text.  This was removed from later CTPs to "keep the slate clean" while incompatible options were considered.

 

Requirements Investigated by Design Team

Specifically, the design team looked at how to support the following requirements:
• It should be possible to work with large XML data sources (potentially infinite streams) using essentially the same concepts, classes, and methods that are used to process small documents in memory.
- It should be possible to use the LINQ query operators to filter the input stream
- It should be as easy to transform the results to a new structure as with the tree API.
- Operations such as sorting that intrinsically require an entire dataset to be examined need not be supported.
• There should be a way to stream output as well as input so that it is not necessary to build an entire XML tree before writing it out.
The approach should be more declarative than imperative. The obvious way to support streaming is imperatively, much like that Java XML APIs StaX or XOM do: the user writes a filter function / subclass that the XML API uses to determine which elements in the data source to pass through to the calling application. We think, however, that the better way is to do it more declaratively -- specifying what to do rather than how to do it. We encourage the use of the LINQ query operators and functional construction in the tree-oriented parts of the API, it would offer a better user experience if this approach can be easily employed with the streaming parts of the API.
• It is not necessary to support streaming over arbitrary XML structures, only those regularly-repeating, shallowly nested, element-centric structures commonly found in logfiles, database dumps, RSS/Atom feeds, etc.
• It is desirable to allow relatively small contextual elements to be loaded into the tree but to stream over repeating content below or within the context. In the example below, Users may wish to load the <channel> content but stream over the <item> sub-elements one at a time consider an RSS feed with the structure:

 <rss>
<channel>
<title>My News Feed</title>
<description>All the buzz that’s fit to blog</description>
<language>en-us</language>
<item>
<title>Blah</title>
<pubDate></pubDate>
<description>Some blah blah</description>
<link>https://example.com/blah1 </link>
</item>
<item>
<title>Blah Blah</title>
<pubDate></pubDate>
<description>More blah blah</description>
<link>https://example.com/blah2 </link>
</item>
.
.
.
<item>
<title>Last Blah</title>
<pubDate></pubDate>
<description>No more blah blah</description>
<link>https://example.com/blah999999999 </link>
</item>
</channel>
</rss>

Outcome of Design Discussions

After literally months of discussion and prototyping, we decided to address these requirements by a) putting XStreamingElement back in the supported API as it was in the May CTP; b) to NOT push any streaming input API into Orcas RTM, but to release one or more implementations of the ideas we’ve discussed as code samples that can be implemented on top of the public API.  For example, Ralf Lämmel has presented some very powerful ideas about how LINQ to XML could be extended to support streaming in a highly functional manner, and that prototype was implemented on top of the public API, not in the implementation code.

We need XStreamingElement in the LINQ to XML library for a number of reasons, the principal one being that it would be rather difficult to implement streaming output at the application level without its features in the public API. The main reason for releasing the streaming input feature as a code sample rather than a supported API is that we can’t find a single API that covers a wide range of use cases while maintaining discoverability and usability. 

One thing we noted in all the feedback on proposed streaming input designs is that experienced developers’ intuitions of how this feature would work was at odds with designs that we knew would actually work.  The realities of streaming (e.g., you can only traverse the stream once, and you can’t backtrack/consume it out of order) lead to counter-intuitive behavior that even a designer of Anders' caliber was not able to figure out how to hide behind a simple API.   Likewise, many people intuitively expected this functionality to be exposed on the XStreamingElement class, but doing so leads to essentially all the XElement functionality being duplicated in XStreamingElement (or else there are all sorts of kludgy clones and casts exposed to the user).  

 We believe it will be easier to show people how to write specialized code for their use cases than to explain the various abstractions we need to invent to cover a wider range of plausible corner cases.   As sample code, we give a very compelling illustration of the power of the underlying LINQ and System.Xml APIs and offer the bare bones of a toolkit that does indeed solve the majority of streaming scenarios that we know of, with code written by those who know the API better than anyone. 

I can understand the frustration of people who just want an 80:20 version of the streaming input feature to work "out of the box."   We'll present some pretty straightforward code to handle the most basic use cases such as huge logfiles with a very flat XML structure.  Maybe some of these examples can migrate into a "LINQ to XML Power Tools" library of some sort (hint, hint, MVPSs?).  Anyway, stay tuned for the details, and please let us know what works for you and what you still need to work with XML easily and efficiently.