Using XML value() method to take advantage of a relational index

The Performance Optimizations for XML Data Type in SQL Server 2005 whitepaper written by Shankar gives some pretty good advice for writing your XQuery statements so that they take advantage of an available XML Index.  However, oftentimes you will have to deal with incoming XML data which is represented as a variable and hence is not indexed.  In such a scenario, it is common to map some incoming values their corresponding representations within your database.  For example, you may want to map a particular customer name in an XML instance to their corresponding customer id in your relational store for the purpose of updating some record.  For mapping this value, you have basically two options, use exist() to perform the comparison in the XQuery context, or use value() to perform the comparison within the SQL context. 

After reading the performance optimization paper you will probably be inclined to use exist() method, however you might be surprised at the performance benefit you will get by using value() method in some cases.  Lets outline one of those cases now.

My scenario is pretty simple, I have a table with customer information, and another table which stores calls that this customer has made.  Imagine I am a cell-phone provider or something.  I will receive chunks of XML which indicate the customer that called (referenced by some billing code), the phone number the customer called, and the number of minutes the call lasted.  Each packet of XML will contain multiple such records.  I store the data relationally in the following schema:

create table [customers] (
[id] int primary key identity(1,1),
[billing_code] varchar(32) unique,
[name] varchar(32)
) ;

create table [calls] (
[call_id] int primary key identity(1,1),
[customer_id] int foreign key references [customers] (id),
[number] varchar(32),
[minutes] int
) ;

insert [customers] ([billing_code], [name]) values ('JOHN123', 'John')
insert [customers] ([billing_code], [name]) values ('DENN456', 'Denny')

Note that the unique index on the [billing_code] column in the [customers] table will implicitly create a nonclustered index on that column.

The goal now is to write a stored procedure which will accept the XML data, and perform the appropriate insertions into the relational data.  My first attempt will use nodes() and exist() method to map the [billing_code] to the appropriate [customer_id] for insertion into the [calls] table.  Also, let me show you an example instance which I will be parsing:

declare @callSummary xml
set @callSummary =
N'<CallSummary>
<Call billingCode="JOHN123" callTime="5" number="555 123-4567" />
<Call billingCode="DENN456" callTime="1" number="555 456-0000" />
<Call billingCode="DENN456" callTime="10" number="555 456-0010" />
<Call billingCode="DENN456" callTime="15" number="555 456-0001" />
</CallSummary>'

The resulting stored procedure which uses nodes() and exist() is the following:

create proc parseSummaryExists
@callSummary xml
as
begin
insert [calls] ([customer_id], [number], [minutes])
select [cust].[id],
[call].[ref].value(<'@number'>, 'varchar(32)'),
[call].[ref].value(<'@callTime'>, 'int')
from @callSummary.nodes('/CallSummary/Call') [call]([ref])
join [customers] [cust]
on 1 = [call].[ref].exist('@billingCode[.=sql:column("cust.billing_code")]')
end

By looking at the plan that is generated for this update, it can be seen that we aren't really taking advantage of our index on the [billing_code] column.  Instead, we are simply scanning that index and sucking all of the values out, and joining that with each billingCode attribute from our xml instance.  This is not a solution that will scale well as we get more customers!  One problem that we have is since the comparison is being done in XQuery, we have to use XQuery comparison semantics.  This comparison is not collation-sensitive, and since it is possible for the billingCode attribute to have more than 32 characters, we have to perform the comparison using an nvarchar(max).  In other words, we simply don't have enough information in our index to perform this comparison in XQuery efficiently.

Rather then trying to perform the comparison in XQuery, why not perform the comparison using T-SQL and the value() method?  Here is a similar stored proc which uses the value() method:

create proc parseSummaryValue
@callSummary xml
as
begin
insert [calls] ([customer_id], [number], [minutes])
select [cust].[id],
[call].[ref].value(<'@number'>, 'varchar(32)'),
[call].[ref].value(<'@callTime'>, 'int')
from @callSummary.nodes('/CallSummary/Call') [call]([ref])
join [customers] [cust]
on [cust].[billing_code] = [call].[ref].value(<'@billingCode'>, 'varchar(32)')
end

Now, looking at the plan here, we can see a nested loops join where for each <Call> element in the incoming xml, we perform an index seek on the [customers] table to find the appropriate customer id.  This solution will scale in terms of the size of the incoming data, rather than the total number of customers.

There are a couple of key points here to recognize.  The first is that the two queries are not the same.  One truncates the XML text node to 32 characters for the purposes of being able to perform a quick index lookup.  Depending on your scenario this may not be desireable, and you will want to perform the comparison using XQuery semantics.  In this case, you should use exist() but be aware that you may not be able to take much advantage of your existing relational indexes.  Another point is that value() is not just for "top level" projects of data from your XML instances, but can also be a useful tool for performing comparisons in the SQL context.  Finally, The last point that I am trying to make is that there is no silver bullet for getting the most performance out of your XQuery statements.  It is just as wrong to say "use exist()" as it is to say "use value()" for comparisons because they do different things.  Instead, you should carefully analyze your scenario by looking at the plans that are generated and make your decisions based on that.