Nearest Neighbors

Article
10/22/2008

Hi Folks,

Spatial users often want to find the object nearest a given point. This operation, usually referred to as nearest neighbor search, is remarkably common in many areas of computer science. In general, we may wish to find not only the nearest, but the k-nearest neighbors.

How can we accomplish this with SQL Server? Here we’ll look at finding the single nearest neighbor; the extension to k-nearest neighbors is relatively straight forward.

First, let's examine the naive method for accomplishing this: simply order the table by distance and restrict the results. For all of these examples, we’ll assume a table T with a spatial column g, as well as a parameter @x containing the search point:

 SELECT TOP(1) *

 FROM T

 ORDER BY g.STDistance(@x) ASC

This solution certainly has simplicity on its side, but consider the work that needs to be done. The entire table must be scanned, and the distance of each to the search point must be calculated. Ouch.

We could conceivably improve on this by restricting our search space to to the immediate region around the target point:

 DECLARE @region geography = @x.STBuffer(10000)

 SELECT TOP(1) *

 FROM T

 WHERE g.Filter(@region) = 1

 ORDER BY g.STDistance(@x) ASC

But this solution requires that we know our data very well: if there are no rows in the region, then we will fail to find the nearest neighbor; if there are too many, then we will again be left with a rather inefficient query.

How do we escape this morass? We can do so by starting with a very small region---so small that we can be certain not to encounter too many results---and then keep enlarging it until we find something. Doing this with a loop is not hard, but Steven Hemingray showed me how to do this with entirely declarative syntax:

 DECLARE @start FLOAT = 1000;

 WITH NearestPoints AS

    SELECT TOP(1) WITH TIES *,  T.g.STDistance(@x) AS dist

    FROM Numbers JOIN T WITH(INDEX(spatial_index))

    ON T.g.STDistance(@x) < @start*POWER(2,Numbers.n)

    ORDER BY n

 SELECT TOP(1) * FROM NearestPoints

 ORDER BY n, dist

This requires some explanation. First, the @start parameter gives the initial region to search. I’ve chosen one kilometer, but this can be adjusted downward if your data is very dense. Second, you’ll notice that we make use of a Numbers table, which just contains the numbers 1 through n. This just contains a long list of integers, which is is useful in many situations.

The inner query examines a set of exponentially-expanding regions. The ORDER BY clause along with the TOP(1) allows the query to stop as soon as it finds the smallest non-empty region. The WITH TIES statement makes sure that all of the objects in that region will be in the result set.

Once the inner query returns a list of potential results, the outer query examines them to find which is actually nearest. With this approach, we can select a start area small enough to keep the cost low in dense data, but also be guaranteed to find a distant nearest neighbor.

Incidentally, if you don’t already have a numbers table, you can create one quite quickly with some mildly-black magic like this:

 SELECT TOP 100000 IDENTITY(int,1,1) AS n

 INTO numbers

 FROM MASTER..spt_values a, MASTER..spt_values b

 CREATE UNIQUE CLUSTERED INDEX idx_1 ON numbers(n)

This isn’t a particularly pretty solution, but to proactively answer a question, we didn’t add a method for this primarily because we ran out of time. Look for something more built-in the next go around.

Cheers,

-Isaac

Nearest Neighbors

Additional resources