Transforming XML data into a rowset using the nodes() method


Usually there are a couple of scenarios in which you want to take XML data and convert it into a regular relational rowset.  In SQL 2000, you probably used OpenXML for this, which works pretty well if you are only working on a single piece of XML data, but as soon as you want to convert a rowset of XML instances into a rowset of relation columns things begin to break down pretty quickly.


In SQL Server 2005, the solution is to use the XML datatype’s nodes() method.  The method works like a table valued function that accepts a single parameter, which is an XQuery statement which will result in a set of nodes.  We then return references (basically a pointer) from the nodes() TVF to each node which is selected by the XQuery statement.  You can then apply the XML datatype “scalar” functions such as exist(), value(), and query() to these references.


By way of an example, lets unpack some of this.  Lets say I have a piece of XML which represents some kind of contact list.  It has a bunch of people, their various phone numbers, and some metadata about the people (in this case, their age). 


What I want to do is query out the names and ages of each person.  I will use the nodes() method to get a reference to each <Person /> element, and then use value() method on each of these to pull out the properties that I am interested in.


create table contacts (pk int primary key identity(1,1), contactXml xml)


insert into contacts values (‘
<People>
 <Person age=”35″>
  <Name>Pete</Name>  
  <Phone>
   <Mobile>555-555-1234</Mobile>
   <Home>555-555-0001</Home>
  </Phone>
 </Person>
 <Person age=”40″>
  <Name>Paul</Name>  
  <Phone>
   <Mobile>555-555-4567</Mobile>
  </Phone>
 </Person>
 <Person age=”24″>
  <Name>Susan</Name>  
  <Phone>
   <Home>555-555-2323</Home>
  </Phone>
 </Person>
</People>
‘)


select
 pk,
 person.ref.value(‘(Name/text())[1]’, ‘nvarchar(30)’) as [name],
 person.ref.value(
‘@age’, ‘smallint’) as [age]
from contacts c
cross apply c.contactXml.nodes(‘/People/Person’) person(ref)


The cool thing about nodes() method is that it is also composable.  I can continue to drill into the data to get the phone numbers for each contact as well by applying the nodes() method again to the references which are returned by the first nodes() invocation.


select
 pk,
 person.ref.value(‘(Name/text())[1]’, ‘nvarchar(30)’) as [name],
 person.ref.value(
‘@age’, ‘smallint’) as [age],
 phoneNumbers.ref.value(‘local-name(.)’, ‘nvarchar(30)’) as [numberType],
 phoneNumbers.ref.value(‘text()[1]’, ‘nvarchar(30)’) as [number]
from contacts c
cross apply c.contactXml.nodes(‘/People/Person’) person(ref)
cross apply person.ref.nodes(‘Phone/*’) phoneNumbers(ref)


The result of this query is the following rowset:


pk          name                           age    numberType                     number
———– —————————— —— —————————— ——————————
1           Pete                           35     Mobile                         555-555-1234
1           Pete                           35     Home                           555-555-0001
1           Paul                           40     Mobile                         555-555-4567
1           Susan                          24     Home                           555-555-2323


The nodes() method also allows us to filter, and since we are doing a cross apply, this filter will reduce the total number of rows that we see output (as opposed to outer apply, which is more like an outer join).  So if we only want to see people that have Mobile numbers, then we can alter the XQuery statement in the second nodes() invocation from ‘Phone/*’ to ‘Phone/Mobile’. 


In general, nodes() method is a great way to manipulate XML data both when it comes into the server for persisting as a relation rowset, or for converting your already stored XML data into a rowset for returning to the client.  Using XML Indexes are also a great way to increase nodes() performance since the reference that is returned via the nodes() method is part of the primary key (along with the primary key of the base table) of our clustered XML Index.  As a consequence of this, if you are using nodes() heavily in your application, you definitely want to look into utilizing an XML Index to increase query performance. 

Comments (5)

  1. Scott says:

    What if your example:

    1) Had 100,000 person elements

    and

    2) The XML was coming into in as a @parameter, in other words, it’s being passed into the database for shredding and bulk insertion into a Person table.

    Would you still use a nodes(‘/People/Person) approach?

  2. Well, I guess it would depend on my performance expectations. In this particular case, I threw together a quick test without really optimizing anything – default database settings, fully logged, in other words nothing really tweaked out for performance. I got what I would consider to be pretty acceptable times.

    What other approach were you thinking about?

  3. Scott says:

    Well, I was working on a small project to learn some of the new 2005 features.

    I built ‘yet another’ app to import IIS logs to a database. I have logs in an XML format.

    Performance wasn’t a top priority, but I was glad to see the XML approach import 300-400 records per second on very modest hardware. I could import 60,000 log entries in under 5 minutes, which is great for a batch task running in the middle of the night.

    Problem is, I’ve found the application useful for day to day use, but now the database has 2 million records and the import takes 8 hours!

    I’ve done maintenance (reindex / defrag / etc), but can’t get back to a reasonable speed.

    I’m just wondering if I’m using the feature in a way it was not intended. Plenty of other options exist to transform XML into records, but I’m just curious what I’ve done wrong here so I can learn from the experience.

  4. Scott says:

    What I’ve done is redo the import with a SQLCLR sproc. The sproc uses an XmlDocument and SelectSingleNodes to tear apart the XML and issue an insert for each element. Import time for 60,000 records is now under 2 minutes, while the nodes() approach was around 10 hours :/

  5. JohnGallardo says:

    Scott, I think it would be beneficial to move this over to the MSDN forums for SQL Server XML functionality. The url is http://forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=89&SiteID=1. If you could provide some code samples there and describe what you are seeing it will have more visibility to the SQL engine team as a whole. Thanks!