Effective Xml Part 2: How to kill the performance of an app with XPath…

Article
09/26/2011

XPath expressions are pretty flexible. This flexibility allows for very creative ways of using XPath. Unfortunately some of them are suboptimal and cause bad performance of apps. This is especially visible in Xslt transformations where stylesheets contains tens if not hundreds of XPath expressions. Here is the list of the most common bad practices (or even anti-patterns) I have seen:

1. // (descendant-or-self axis)

This is a very common pattern that very often leads to serious performance problems. The way it works is that it flattens the whole subtree (the most common usage I saw is flattening the whole xml document) and then it looks for the specified elements. Now in the .NET Framework there aren’t any specific optimizations for this patterns and using it *is costly* . I tried to figure out in what scenarios using // is really useful or necessary and the only real case where I found it useful was when the Xml was designed to have elements with the same name on different levels (e.g. recursive xml document describing the file system where <folder> element can have <folder> child elements). Other than that I could not really find any usage that could not be replaced with a more efficient counterpart. Usually being as specific as possible in your query helps (and is enough). Why do people use // then?

“It’s easy” - I agree but for me it is not an argument. Sure it’s the best when fast performing apps don’t need a lot of effort to write but it is rarely the case. Put some extra effort and you will see tangible results.
“If I use // it works otherwise it does not” – check your XPath expression. Go step by step and see if you don’t have any typos. Finally look for namespaces (this is the most common thing I noticed for this kind of usage – there are elements in namespaces so “nothing but //*[local-name() = ‘{put-element-name-here}’] works”. Spend a day with Xml namespaces. I know it is not an easy stuff but understanding Xml namespaces will help you a lot if work with Xml.)
“I don’t know the structure of the Xml document” – now you are in trouble. If you don’t know the structure of the Xml documents you are processing how do you know you will not capture some extra elements with //? Using // in this kind of situation feels like introducing bugs.
“The structure of my document changes very often” – (this is pretty similar to the one above) you should be updating you XPath expressions accordingly and consciously. Otherwise you may start processing nodes you did not really mean to process or, if some elements were removed entirely, your application will continue to use expensive query just to find out that there is nothing to process.

Yes, I know that many books, articles or blogs that teach Xslt use //. I suppose the reason there is that when you start learning Xslt you probably don’t know what the XPath is and // is the simplest thing that works. Unfortunately even after people learned XPath they continue to use // all over the place and complain that Xml/Xslt/XPath is slow (real story: I was refactoring a stylesheet that was slow – the transformation took 2 minutes and it was a web application. Just getting rid of all // (none of them was necessary) the transformation time decreased to 5 seconds).

2. Absolute XPath Expressions Everywhere

While sometimes it may be beneficial to use an absolute XPath expression (i.e. starting from the document element - /) it is not usually the case. Yet I saw Xslt stylesheets that were using only absolute XPath expressions. This is not only costly (though not as costly as //) but in a lot of cases it is just plainly wrong. If you are positioned on an element and you want to process all the child elements of this element using an absolute path will work only if the element you are positioned on does not have any sibling elements with the same name. Otherwise you will start processing all the child elements of all the siblings with the same name as the element you are positioned on.

3. Filter as Close as You Can

Take a look at the following expression: a/b/c[../../@id = 3]. What is happening here? We drill into the tree only to go up to check the condition on one of the ancestor elements. As a result we may process hundreds of <b> and <c> elements without any reason (because the condition is not satisfied). In addition the expression is not really readable. How about this expression: a[@id = 3]/b/c? The result should be the same as in the case above but it can be much faster because now we will not process elements we are not really interested in. It is also much more readable – I don’t need to virtually traverse the Xml document to be able to tell what element the condition applies to.

4. Don’t call document() function in a loop

document() function has to go and access an external file or a network resource. Accessing file system (let alone network resource) is much more expensive than accessing memory. So instead of doing this in a loop (by loop I mean either xsl:for-each or xsl:template invoked more than once) just select what you really need once, store it in a variable and use the variable instead of document() method.

5. Learn Axes and Use Axes

XPath has 13 different axes. From what I noticed people don’t know about XPath axes although they use them almost in each XPath expression they write. The reason for this is that the default axis is child axis and can be omitted. As a result /child::a/child::b (child:: denotes child axis) becomes /a/b. Similarly . is a shortcut for self::node() and .. is a shortcut for parent::node() (so, parent::node()/child::b becomes ../b). With these shortcuts (and // which is another shortcut – this time for /descendant-or-self::node()/ axis and which you should not be using without a good reason) people are able to address most of their needs without even having to know about axes. It’s more like “this is how XPath works” rather than “I really want to navigate the tree this way” (although in most cases the result of these two will be pretty much the same).

Here is an example – I want to access the next element on the same level that the element I am positioned on. It’s pretty easy with the following-sibling axis: following-sibling::*[1]. Without using the following-sibling axis the expression can get complicated – you would probably have to capture the current position in a variable, go to the parent element and then address the element you are interested in using an index calculated based on the variable.

If you happen to use any of the above patterns or you think you are not using the right axes take a closer look at your stylesheets and think whether you could improve things and give a nice boost to your queries or transformations. Have I mentioned that you should avoid using //?

Pawel Kadluczka