Geocoding semi-imaginary data points

As improbable (and humorous) as it may seem, this is an exercise that I recently engaged in: geocoding semi-imaginary data points.

For those of you who are familiar with the original AdventureWorks database, you’ll remember that there are a large number of fictitious addresses mixed into the database. For the updated schema AdventureWorks2008, we added a GEOGRAPHY column to contain the lat/long for each address to show off one of the new spatial types… which is the point at which it landed on my desk. Thump. Heh.

Although I fully intended to construct Service Broker-enabled sample built on triggers and CLR stored procedures that leverage the MapPoint web service to keep the Person.Address.SpatialLocation evergreen, I ran out of runway when CTP6 got it into its purty little head to go early… Doh! I still plan to go back and finish the skeleton solution that I started, but y’all know what happens when timelines get compressed…

Fortunately, the Virtual Earth and MapPoint SDK came to my rescue. Formatting the files for uploading isn’t clearly documented in the SDK, IMO, but if I can muddle through and figure it out, I’m sure that you can, too! Heh. Using the web-based Virtual Earth developer tools (which you can sign up for to use for free), I extracted all of the AdventureWorks2008 addresses to a flat file (which was much easier in the new, more normalized model that put all of the addresses in one table) and uploaded it to MapPoint for geocoding. Which is when the big surprise arrived!

MapPoint reported that it had matched more than 2% of our known imaginary addresses AT THE ADDRESS LEVEL… Whoops! That shouldn’t happen. My spidey sense was tingling because I already spend waaay too much time talking to legal as it is. In the immortal words of Scooby Doo, "Ruh-roh!"

I was hoping for matching at the city or postal code level at best. Fortunately, upon further examination, they all seemed to be false positives on street names with very similar spellings or other similar "close but not quite" matches. Relieved that I didn’t have a legal issue on my hands, I turned to the "lazy database developers tool" (a.k.a. Microsoft Office Excel 2007) to crack open the 20k rows of flat file that I needed to produce a raft of update statements similar to:

UPDATE Person.Address SET SpatialLocation = geography::Point(42.0231369319282, -118.535265856701, 4326) WHERE AddressID = 1;

Because the vast majority of my data was geocoded only to the city or postal code, it wouldn’t look very "purty" (a technical term) on a map to have all the points in one big dollop for each city or postal code, so a small amount of randomness was in order. (In retrospect, 15% might be a bit too much potential randomness.) Add a formula to my table, Fill|Down and Excel 2007 did it’s little magic for all 19,614 rows:

="UPDATE Person.Address SET SpatialLocation = geography::Point(" & G2+(RANDBETWEEN(-15,15)/100)*G2 & ", " & I2+(RANDBETWEEN(-15,15)/100)*I2 & ", 4326) WHERE AddressID = " & A2 & ";"

Where G2 is the original latitude of the city or postal code and I2 is the longitude. (At least as long as the order is lat/long and not long/lat.) And, of course, A2 is the integer identity of the row in Person.Address.

The +/-15% randomness may result (in this iteration) in addresses that fall inside water boundaries or other undesirable locations; however, time constraints limited my options and ability to visually check the alignment. For future releases, time-permitting, we’ll consider taking the time to lay them all out with MapPoint and adjusting for a desired level of "purtiness."

I did manage to encounter an interesting CLR error during my testing. After running the same script a few too many times in my Virtual PC environment with an early pre-CTP6 build, I apparently managed to exhaust the RAM available to the CLR (and thereby to the GEOGRAPHY data type):

Msg 10316, Level 16, State 1, Line 11521

The app domain with specified version id (100) was unloaded due to memory pressure and could not be found.

Whoops! With a pre-CTP6 build 10.00.1098.11, bouncing the service didn’t help, but restarting the virtual environment allowed me to recover (of course). FWIW, I have not been able to kill the CTP6 build 10.00.1300.13 the same way (yet)… I’ll keep trying, though. Heh.

Comments (2)

  1. Reed Me says:

    One of the shiny new features added to Management Studio for 2008 allows you to plot spatial data for