Spatial Indexing in SQL Server 2012 ("Denali")

I attended the recent PASS Summit in Seattle, staffed the "Experts Pod" dealing with application development, specialized in Entity Framework questions obviously, but also got to answer lots of other questions: T-SQL, jdbc drivers, etc.

While there I was fortunate to be able to attend a presentation by Michael Rhys on spatial indexing in SQL Server 2012. This was a fascinating subject, and had lots of geek-interest for myself. So warning: this post is about to geek-out into some mathematics and geometry that I, having majored in Math in college, find interesting.

If you need to create a spatial index, read Spatial Indexing Overview (or if you want more detail than I provide here). If you are interested generally in what is new in SQL Server 2012 spatial features, read this white paper by Ed Katibah, who also has a blog.

The bottom line is that spatial indexing really does make a difference, of about an order of magnitude: Michael had queries that took a couple of minutes without an index, but which ran in just a few seconds with an index. This isn't surprising to anyone who has done database development: it's precisely why you index your tables. So that was pretty much the result I would have expected.

What was fascinating was exactly how the indexes were created. Most indexes on tables are simply one-dimensional lists of key-value pairs that enable you to look up some character string ("CompanyName" perhaps) very quickly, which enables efficient joins between tables. But a spatial object can represent a 2-dimensional geographic shape. In some cases you can avoid using a spatial object: for example to find all the restaurants in Seattle, if the Restaurant table has a column called City, then you just search on that. But suppose we want a list of all the hiking trail heads within 5 miles of Interstate 90, between Seattle and Snoqualmie Pass, that are also within the Alpine Lakes Wilderness Area. Obviously (?) you need to have a spatial object that represents the Wilderness Area, and if you create an index on it, queries will run much faster.

So how do you get a one-dimensional index out of a two-dimensional object? SQL Server uses a "Hilbert space-filling curve" to do this. David Hilbert was a mathematician active around 1900, who compiled a list of 23 remaining outstanding problems in math that hadn't been solved yet (some, like the Riemann hypothesis, still haven't been). This was around the same time that many physicists felt that the major outstanding questions of physics had been solved , and that the future would consist mainly of refinements rather than new breakthroughs (then Einstein, Bohr and others came along...).

One of Hilbert's discoveries was that you could fill a 2-dimensional space by creating a space-filling curve, by means of an infinite series of recursive steps (i.e. a fractal): basically, any point in the space would reside on the curve after n iterations of a recursive process. Many people found this idea rather disturbing, and mathematicians at that time came up with a variety of strange "pathological" curves. For example, if you studied parabolas and other smooth curves in college algebra, you are likely accustomed to seeing "continuous curves". Some mostly continuous curves have points that are non-continuous, for example the graph of a tangent where x = 0, the tangent approaches infinity on one side of 0, and negative infinity on the other side of 0. The graph of the exponential function does the same thing around 0. Anyway, someone came up with a connected curve that wasn't continuous anywhere, which my recollection is that you can't actually draw.

Anyway, getting back to software engineering, when creating a spatial index, SQL Server apparently creates a path through a geographical area which is  a specified number of steps along the recursive process of filling the area. The above article ( Spatial Indexing Overview) illlustrates the process in terms of successively finer grids that cover the object to be indexed, and describes how the numbering of the grid cells is used to create an index entry.

So for me, this presentation was yet another instance of software engineering suddenly colliding with what had in college seemed like a pretty abstract area of knowledge. Anyone know of any applications of Algebraic Topology in software engineering?

(Reassuring note: subsequent blog posts will be a bit more directly related to Entity Framework topics)