K-nearest neighbor spatial search

I apologize to everyone for the hiatusĀ - I realise a post is more than a little overdue and will try to be more regular in the future. I appreciate the support that I've received for the blog and continuing it.

Consider the following problem: you're starting a website for a chain of restaurants, and you want to have a "Locations" feature where a person can enter their address and get a list of the 10 outlets closest to their location, in order of distance from their locationĀ (as the crow flies, for simplicity). Moreover, each time they hit the "next" button, they should get the next 10 locations in order of distance.

More abstractly, say you have a list of locations on a 2D map, each specified as a point with an x and y coordinate. You can do any sort of preprocessing you like of this list of points. Then I ask you to efficiently answer: given some point, which is not necessarily one of the points in the list, produce a list of the k points closest to that point, in order of their distance from that point.

This problem, called the k-nearest neighbor problem, is very well-studied and has a variety of solutions. Here we describe one very simple but effective incremental algorithm first proposed by Hjaltason and Samet in their "Ranking in Spatial Databases" in 1995 (Citeseer). A spatial database is a database specialized to perform this type of query over its (potentially very large) data set.

Consider the following very simple but inefficient algorithm: we take each point, one at a time, and add it to a priority queue with its key equal to the distance from the search point. We then extract minimum repeatedly to obtain the points in order of their distance from the search point. Once we've extracted the first k, we're done, and with a good data structure this requires only O(k log n) time. Unfortunately, even with the best priority queue data structures based on Fibonacci heaps, adding all n points requires amortized linear (O(n)) time and space, which for a very large number of points as one would expect to find in a spatial database is prohibitive.

One approach to narrowing down the points we have to examine is to split the space into larger regions. The idea is that points close to the search point should be found in regions close to the search point. In more detail, we first divide the space into two rectangular regions, each containing some of the points. Each of these regions is, in turn, subdivided into two rectangular regions, and we continue in this manner until all the resulting regions have some small constant number of points in them. The result is a tree of regions, where each node is a subregion of its parent node, and each point lies in one of the leaves. It really doesn't matter how we choose to divide up the space - the authors used PMR quadtrees, but you could use kd-trees, simple quadtrees, or whatever you like. All that matters is that in the end each leaf region contains only a small number of points.

Creating the regions requires examining all the points, but the key is that this part does not depend on the search point, and so only has to be done once in preprocessing. The regions may require rebalancing as points are added or removed, but this can be done efficiently as well.

However, there's a trick to this: if the search point lies near the boundary of a region, its closest points may actually lie in a different, adjacent region. How can we correctly locate the nearest point without spuriously gathering all the points in all the regions around the point?

Here's the key insight: our priority queue will contain two types of objects, single points and regions. The key of each point is set to its distance from the search point, while the key of each region is set to the minimum distance of that region from the search point. Each time we extract a point from the queue, we output it; each time we extract a region from the queue, we expand it by adding its subregions to the queue, or if it's a leaf region, the points in that region. Because the region's key is set to the minimum distance of that region from the search point, any point that is outputted before the region is expanded is closer to the search point than any point in the region. This ensures correctness.

k-nearest neighbor spatial search example

For example, consider the search problem shown to the right. There are 9 points, and we wish to find the 5 closest points to the big red search point in order. The space is recursively subdivided into regions as shown in the tree.

For the purpose of discussion we will refer to each region by the set of points it contains. Initially, the queue contains only the large region containing all points.

Current queue in order: {1,2,3,4,5,6,7,8,9}

This contains the red search point, so its minimum distance from the search point (its key) is zero. We remove it from the queue and add its two children {1,3,4,8} and {2,5,6,7,9}. The region {2,5,6,7,9} again contains the red search point and so has key zero. The minimum distance to the region {1,3,4,8} is shown by the red line extending out of the red point and to the left and is larger than zero.

Current queue in order: {2,5,6,7,9}, {1,3,4,8}

We extract and expand {2,5,6,7,9}, getting leaf region {2} and region {5,6,7,9}. The distance to region {2} is shown by the red line extending upward from the red point, which is clearly longer than the distance to region {1,3,4,8}. The region {5,6,7,9} has distance zero.

Current queue in order: {5,6,7,9}, {1,3,4,8}, {2}

We extract and expand {5,6,7,9} to get the regions {5,9} and {6,7}. The distance to region {6,7} is shown by the red line extending right from the red point, the longest distance so far encountered. The region {5,9} contains the red point and has distance zero.

Current queue in order: {5,9}, {1,3,4,8}, {2}, {6,7}

Next, we extract and expand region {5,9}. This is a leaf region, so we add the points 5, 9 to the queue. Point 5 is only slightly closer to the red point than region {2}, but farther than region {1,3,4,8}. Point 9 is farther than everything so far encountered.

Current queue in order: {1,3,4,8}, 5, {2}, {6,7}, 9

We extract and expand {1,3,4,8} to get {1,3,4}, at the same distance as {1,3,4,8}, and {8}, the distance of which is shown by the diagonal red line.

Current queue in order: {1,3,4}, {8}, 5, {2}, {6,7}, 9

We extract and expand {1,3,4}, which is a leaf region, and add the three points. Point 4 is closest, point 3 lies between {6,7} and 9 in distance, and point 1 is slightly closer than point 9.

Current queue in order: 4, {8}, 5, {2}, {6,7}, 3, 1, 9

We extract and output point 4, the closest point. We expand region {8} and add point 8, which is slightly farther than point 3.

Current queue in order: 5, {2}, {6,7}, 3, 8, 1, 9

We extract and output point 5. We expand region {2} and add point 2, which is slightly closer than point 8.

Current queue in order: {6,7}, 3, 2, 8, 1, 9

We extract and expand region {6,7}, adding points 6 and 7. Point 6 is farthest yet and point 7 even farther.

Current queue in order: 3, 2, 8, 1, 9, 6, 7

We've now expanded all regions and the remaining points in the queue are the remaining points not yet outputted in order.

The nice thing about this particular algorithm is that it's fully incremental: we could have stopped at any time once we had enough points, and pick up later when we want more. This makes it quite appealing in database queries, where you don't really know how many points you need in advance, because you might want to filter the results according to additional attributes. In practice, in a typical spatial database, one only needs to expand a small fraction of the regions in order to find a small number of closest points. The idea can be extended into higher-dimensional regions, spatial objects other than points such as areas and line segments, and even more general problems where one can easily compute the minimum key of a subtree of objects.

I hope you found this informative and I appreciate your comments.