In his post last year, Integrating Virtual Earth and GeoNames, Johannes described how to load the GeoNames database into SQL Server using the "Import Data" Wizard in SQL Server Management Studio. In his scenario, Johannes did not need to take advantage of the spatial type support in SQL Server 2008, since there were columns of latitude and longitude available for use by Virtual Earth. But what if you wanted to load the GeoNames locations into SQL Server Spatial and use the new geospatial features? This post describes how to do that in 6 easy steps...
- Download GeoNames data
- Prepare GeoNames data
- Create a database table for the GeoNames data
- Load the GeoNames data
- Add, update and index the spatial data column
- Sample spatial query
So, let's get started...
Download GeoNames Data
You can find GeoNames data organized by country (and a few other organizational categories) by browsing the GeoNames Download page. If you wanted to download the GeoNames data for the United States, you would choose US.zip. For this exercise, we are going to download the allCountries.zip file, containing the full worldwide GeoNames database. Unzipping the downloaded file will produce the file allCountries.txt.
Note: This version of GeoNames data was download on January 13, 2009 and contained 6,906,334 records.
Prepare GeoNames Data
The GeoNames data is available as a tab-delimited, UTF-8 encoded text file. Each record is terminated with newline character ('\n'). SQL Server 2008 does not support UTF-8, expecting such data in UTF-16 encoding with each record delimited with a carriage return, newline combination ('\r\n'). Additionally, it needs a Byte Order Marker (BOM) at the beginning of the file. Johannes introduced NAnt as a way to convert UTF-8 data to UTF-16. I chose another way. Using the Windows text editor, EditPad Pro, I converted the allCountries.txt file into a form compatible with SQL Server 2008. Here is the workflow I used with EditPad Pro:
Note: While this conversion could conceivably be done in Word 2007, Word cannot open and operate efficiently on files of this size. EditPad Pro can reasonably handle data of this magnitude.
a. Open allCountries.txt file in EditPad Pro.
b. Set Record Delimiter. In the "Convert" menu (main menu), select "To Windows (CR LF)". This will set the record delimiter to '\r\n'.
This took several minutes to complete in EditPad Pro after the "To Windows (CR LF)" option was selected - presumably loading the file into memory and performing the requested operation...
c. Set text encoding. While still in the "Convert" menu, choose "Text Encoding".
In the Text Encoding menu note the Original Encoding, set to Unicode, UTF-8. Select the "Encode the original data with another character set." button. Under New Encoding, choose "Unicode, UTF-16 little endian".
After hitting "OK", my humble dual proc machine with 2GB of RAM squawked about low virtual memory, but continued to process the file...
d. Set the Byte Order Marker (BOM). In the "Options" menu (main menu), select "Configure File Types...".
Choose the "Encoding" tab. Under File Types, make sure "Text Document" is selected. Make sure that the Text Encoding is set as follows:
Note on the Default Line Break Style: While it appears that the line break was already set under stop b. (above), it appears that setting the line break in both places is required, though it is not obvious, why...
Note on the Byte Order Marker (BOM): You can see the BOM as the FF FE bytes at the start of the file by pressing Crtl+H in EditPad Pro (this switches EditPad Pro to hex display mode). If your file does not include the BOM, SQL Server will complain, when loading data, that the "...file does not have a Unicode signature."
e. Create new file copy, correctly encoded for SQL Server 2008. In the "File" menu (main menu), select "Save Copy As...".
Save the file with the desired name (I used allCountries_utf16.txt):
The file, allCountries_utf16.txt, is now a Unicode UTF-16, BOM encoded text file with tab-delimited ('\t') fields and CR\LF row terminators ('\r\n'), ready for SQL Server 2008.
Before we leave this section, here are some statistics:
Create a database table for the GeoNames data
Here is the T-SQL to create a table called GeoNames...
CREATE TABLE GeoNames(
geonameid int NOT NULL,
name nvarchar(200) NULL,
asciiname nvarchar(200) NULL,
alternatenames nvarchar(max) NULL,
latitude float NULL,
longitude float NULL,
feature_class char(2) NULL,
feature_code nvarchar(10) NULL,
country_code char(3) NULL,
cc2 char(60) NULL,
admin1_code nvarchar(20) NULL,
admin2_code nvarchar(80) NULL,
admin3_code nvarchar(20) NULL,
admin4_code nvarchar(20) NULL,
population int NULL,
elevation int NULL,
gtopo30 int NULL,
timezone char(31) NULL,
modification_date date NULL
Note on the alternatenames column: This column holds the GeoNames field which requires Unicode (UTF-8, UTF-16).
Load the GeoNames data
I used the BULK INSERT command to load the allCountries_utf16.txt file. Note the DATAFILETYPE = 'widechar' parameter - this is required for loading UTF-16 data.
DATAFILETYPE = 'widechar',
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n'
--(6906334 rows(s) affected) (00:05:44)
Note on the ROWTERMINATOR: While I have changed the Row Terminator in the allCountries_utf16.txt file to '\r\n', SQL Server, curiously, requires the Row Terminator to be set to '\n'. There is no good explanation for this...
Add, update and index the spatial data column
The current table does not contain a column to hold native spatial data. Since the data is in WGS84 geographic coordinates, I used the geography data type for the new column:
ALTER TABLE GeoNames
ADD geog GEOGRAPHY NULL
To populate the new geography column (geog), I used the STGeomFromText constructor to create a POINT feature for each GeoNames row. Since this is a text-based constructor, the longitude and latitude data, coming from columns of like name, will need to be converted into text using the CAST function. The SRID for the STGeomFromText constructor is set to 4326, the EPSG value for WGS84 coordinates.
SET geog = GEOGRAPHY::STGeomFromText
('POINT(' + CAST(longitude AS CHAR(20))
+ ' ' + CAST(latitude AS CHAR(20)) + ')',4326)
--(6906334 rows(s) affected) (00:09:40)
Note on coordinate order: Since this is an OGC-based operator, the WKT is formed as: 'POINT (<longitude> <latitude>)'.
In order to create a spatial index, the table must have a primary key. While I could have created the primary key in the CREATE TABLE DDL, this would have meant that the data was loaded into an existing index, slowing down the load process considerably.
ALTER TABLE GeoNames
ADD CONSTRAINT pk_geonames_geonameid
PRIMARY KEY (geonameid )
Here is the DDL to create a spatial index on the geography column, geog:
CREATE SPATIAL INDEX geonames_mmmm16_sidx
GRIDS = (MEDIUM, MEDIUM, MEDIUM, MEDIUM),
CELLS_PER_OBJECT = 16,
PAD_INDEX = ON
Sample spatial query
As an example of the type of queries which can now be supported using the GeoNames table, consider the following: "Find all place names within 1 kilometer of the Travis County (Texas) boundary line":
Note: the distance parameter for the STBuffer() method (geography data type) is in meters...
DECLARE @g GEOGRAPHY
DECLARE @h GEOGRAPHY
DECLARE @i GEOGRAPHY
DECLARE @j GEOGRAPHY
SELECT @g = geog FROM US_Counties WHERE NAME = 'Travis'
--SQL Server Execution Times: Elapsed time = 8 ms
SELECT @h = @g.STBuffer(1000) -- Positive buffer around Travis County
--SQL Server Execution Time: Elapsed time = 16 ms
SELECT @i = @g.STBuffer(-1000) -- Negative buffer around Travis County
--SQL Server Execution Time: Elapsed time = 16ms
SELECT @j = @h.STDifference(@i)-- Difference Polygon
--SQL Server Execution Time: Elapsed time = 4 ms
SELECT name FROM GeoNames WHERE geog.STIntersects(@j)=1
ORDER BY name ASC
--91 rows affected
--SQL Server Execution Time: Elapsed time = 823 ms.
Here is the list of place names returned by the query:
Cypress Elementary School
McNeil High School
Here is the visual result of the place names locations, within the county boundary buffer, presented in Management Studio:
Here is a more detailed view (the actual point locations are represented by the block dots in the center of each point symbol):
And there you have it - sub-second performance against 6.9 million rows with a complex spatial object on a very modest machine...