Seeking on the XML Value Index

This might be long, so I am going to cut to the chase right now. If you are using predicates which involve non-string types in your XQuery statements, you want to use typed data. There, I said it. Now lets try to figure out why.

The issue basically boils down to how we index our XML data in SQL Server 2005. The big clue is that our XML Index format is exactly the same regardless of whether or not you have an XML Schema associated with the XML column. In the XML Index, there are three columns which form the “value” if a particular node: [value], [lvalue] and [lvaluebin]. For our purposes, we are going to focus on the [value] column, which contains the first 128 bytes of the data in a particular node as a sql_variant. This storage works in our favor in some cases, because it allows us to store both numerical data and string data in as compact a form as possible without losing fidelity. However, it can prove to be complicated for purposes of comparisons. Here is an example which demonstrates this:

 create table t3 (acol sql_variant)
insert t3 values ('f')
insert t3 values (ascii('f'))
go

select *
from t3 [1]
join t3 [2] on ([1].acol >= [2].acol)

On my machine, I get the following results:

 acol       acol
---------- ----------
f          f
102        f
102        102

Ok, so obviously there are some gotchas when it comes to comparing sql_variants, and it is because of these that we have to be very careful when performing comparisons on data stored in the XML Index so that we preserve the proper XQuery comparison semantics.

Before moving any further, we should talk about how we decide what to actually store in the [value] column of the XML index. The representation of the value that is stored is completely dependent on the associated typing information (or lack thereof). In the case where there is no typing information, our XML shredder returns string data (nvarchar(max)). So we store the first 64 characters in the [value] column as an nvarchar(64), and the whole data will get stored in the [lvalue] column. In the case of typed data, it depends on the primitive type of the value. In many cases, the primitive type will map to a numeric sql type. In other words, the [value] column for a typed instance will contain a mix of various primitive sql types.

With that knoweledge in hand, we can begin to look at the query side of things. Of course, our XQuery implementation is aware of the storage mechanism of the XML Index, and we use this information, along with the static typing information that we infer to decide whether or not we have to perform any extra casting in order to get proper comparison semantics. This is probably best demonstrated by an example. Lets begin with this:

 create table sample (pk int primary key identity(1,1), xmlcol xml)
create primary xml index i_xml on sample(xmlcol)
create xml index i_xml_value on sample(xmlcol)
    using xml index i_xml
   for value
go

insert sample values (
'<foo>
   <bar>5</bar>
</foo>')

insert sample values (
'<foo>
    <bar>5.00</bar>
</foo>')

insert sample values (
'<foo>
 <bar>100</bar>
</foo>')
go

select count(*)
from sample
where xmlcol.exist('/foo/bar/text()[.<=5]') = 1

Now, make sure you can see the plan that is generated (there are various ways, I like to use the “Include Actual Plan” button in Management Studio to return a graphical execution plan along with the results). There should be a single nested loops join, with the inner operator being a filter over a clustered index seek. This seek is pulling all of the XML index rows for a given instance (and filtering so we only get <bar />’s text node) and returning the [value], [lvalue] and [hid] columns. Remember, that this instance is untyped, so our XML Index is going to contain the string “5” for the value of the element <bar /> in the [value] column. Our comparison though, is for the numerical value 5. Per XQuery general comparison semantics, we promote the atomic value of the path expression to a double and then perform the comparison. The lexical value-space of the numerical value 5 is actually a superset of the lexical value-space of the string “5” (in other words, we want to match things like the string value “5.0” as well as the string value “5”). In order to do this, we have to convert the string values which are stored in the XML Index into numerical values (note that we don’t necessarily have to convert all of them, we might be able to use other indexes to prune down the number of conversions we perform) and then perform the comparison. To perform the comparison, you will see that we actually have a special “xsd_cast” intrinsic which performs our XQuery specific casts (this is part of the “Filter” node). This conversion will prevent us from performing an index seek operation on the XML VALUE index. Now, run the following query:

 select count(*)
from sample
where xmlcol.exist('/foo/bar/text()[.="5"]') = 1

Look at the plan, you should see something quite different. In this case we are able to use the XML VALUE (i_xml_value) index to seek for all of the nodes which have the string value “5” and evaluate the query starting there. Moreover, since the value index also has the [hid] column in it, which represents the path to the node, we are only looking at data in which the path “/foo/bar/text()” has the value “5”. The result is a query that is really taking advantage of our XML Index and will perform quite well.

So how can we improve things for the query with the numerical value 5? Well, by typing the instance, of course! Run the following DDL to alter the column to be typed and recreate our XML indexes:

 drop index i_xml on sample
go

create xml schema collection sample_schema as N'
<xs:schema xmlns:xs="https://www.w3.org/2001/XMLSchema">
  <xs:element name="foo">
       <xs:complexType>
          <xs:all>
              <xs:element name="bar" type="xs:decimal" />
           </xs:all>
     </xs:complexType>
 </xs:element>
</xs:schema>
'
go

alter table sample alter column xmlcol xml(document sample_schema)
go

create primary xml index i_xml on sample(xmlcol)
create xml index i_xml_value on sample(xmlcol)
  using xml index i_xml
   for value
go

Now, when we evaluate queries on this XML data, our XQuery compiler has extra information which it can use to optimize the queries. In our case, it will know that the value of the <bar /> element must be an xs:decimal and that it will be stored as such in the XML Index. Try executing the following query:

 select count(*)
from sample
where xmlcol.exist('/foo/bar[.>=5]') = 1

From looking at the plan, it can be determined that we are using the XML VALUE index (note the index seek on the inner side of the nested loop join). From the query optimizers perspective, the reason that it is safe (and more performant) to use the XML VALUE index is because we are not applying the conversion operation ontop of it. The XQuery compiler does not put the conversion operator because it can reason statically over the data due to the presence of the XML Schema and knoweledge of our storage mechanism. Astute readers will also notice that I have removed the /text() node test. This is actually for a good reason, and perhaps I will delve into that detail a bit more in a future blog entry.

Hopefully this has shed a little bit of light into XML VALUE index selection, and how you can type your data in order to take advantage of this indexing option that we have provided in SQL Server 2005.

I am going to leave with a question though… Why doesn’t the query optimizer put the index seek on the outer (top) portion of the nested loops join? Would it be faster? If so, under what conditions?