Loading GeoNames Data Into SQL Server 2008 (yet another way...)

In his post last year, Integrating Virtual Earth and GeoNames, Johannes described how to load the GeoNames database into SQL Server using the "Import Data" Wizard in SQL Server Management Studio. In his scenario, Johannes did not need to take advantage of the spatial type support in SQL Server 2008, since there were columns of latitude and longitude available for use by Virtual Earth.  But what if you wanted to load the GeoNames locations into SQL Server Spatial and use the new geospatial features?  This post describes how to do that in 6 easy steps...

  1. Download GeoNames data
  2. Prepare GeoNames data
  3. Create a database table for the GeoNames data
  4. Load the GeoNames data
  5. Add, update and index the spatial data column
  6. Sample spatial query

So, let's get started...

Download GeoNames Data 

You can find GeoNames data organized by country (and a few other organizational categories) by browsing the GeoNames Download page. If you wanted to download the GeoNames data for the United States, you would choose US.zip.  For this exercise, we are going to download the allCountries.zip file, containing the full worldwide GeoNames database.  Unzipping the downloaded file will produce the file allCountries.txt.

Note: This version of GeoNames data was download on January 13, 2009 and contained 6,906,334 records.

Prepare GeoNames Data

The GeoNames data is available as a tab-delimited, UTF-8 encoded text file. Each record is terminated with newline character ('\n').  SQL Server 2008 does not support UTF-8, expecting such data in UTF-16 encoding with each record delimited with a carriage return, newline combination ('\r\n').  Additionally, it needs a Byte Order Marker (BOM) at the beginning of the file.  Johannes introduced NAnt as a way to convert UTF-8 data to UTF-16.  I chose another way.  Using the Windows text editor, EditPad Pro, I converted the allCountries.txt file into a form compatible with SQL Server 2008. Here is the workflow I used with EditPad Pro:

Note: While this conversion could conceivably be done in Word 2007, Word cannot open and operate efficiently on files of this size. EditPad Pro can reasonably handle data of this magnitude.

a. Open allCountries.txt file in EditPad Pro.

b. Set Record Delimiter. In the "Convert" menu (main menu), select "To Windows (CR LF)".  This will set the record delimiter to '\r\n'.

image

This took several minutes to complete in EditPad Pro after the "To Windows (CR LF)" option was selected - presumably loading the file into memory and performing the requested operation...

c. Set text encoding. While still in the "Convert" menu, choose "Text Encoding". 

image

In the Text Encoding menu note the Original Encoding, set to Unicode, UTF-8.  Select the "Encode the original data with another character set." button.  Under New Encoding, choose "Unicode, UTF-16 little endian".

image

After hitting "OK", my humble dual proc machine with 2GB of RAM squawked about low virtual memory, but continued to process the file...

d. Set the Byte Order Marker (BOM).   In the "Options" menu (main menu), select "Configure File Types...".

 image

Choose the "Encoding" tab.  Under File Types, make sure "Text Document" is selected. Make sure that the Text Encoding is set as follows:

image 

Note on the Default Line Break Style: While it appears that the line break was already set under stop b. (above), it appears that setting the line break in both places is required, though it is not obvious, why...

Note on the Byte Order Marker (BOM): You can see the BOM as the FF FE bytes at the image start of the file by pressing Crtl+H in EditPad Pro (this switches EditPad Pro to hex display mode). If your file does not include the BOM, SQL Server will complain, when loading data, that the "...file does not have a Unicode signature."

 

 

e.  Create new file copy, correctly encoded for SQL Server 2008. In the "File" menu (main menu), select "Save Copy As...".

image

Save the file with the desired name (I used allCountries_utf16.txt):

image

The file, allCountries_utf16.txt, is now a Unicode UTF-16, BOM encoded text file with tab-delimited ('\t') fields and CR\LF row terminators ('\r\n'), ready for SQL Server 2008.

Before we leave this section, here are some statistics:

FILE                SIZE
allCountries.zip      174,594 KB
allCountries.txt      783,093 KB
allCountries_utf16.txt   1,563,976 KB

Create a database table for the GeoNames data

Here is the T-SQL to create a table called GeoNames...

CREATE TABLE GeoNames(
geonameid int NOT NULL,
name nvarchar(200) NULL,
asciiname nvarchar(200) NULL,
alternatenames nvarchar(max) NULL,
latitude float NULL,
longitude float NULL,
feature_class char(2) NULL,
feature_code nvarchar(10) NULL,
country_code char(3) NULL,
cc2 char(60) NULL,
admin1_code nvarchar(20) NULL,
admin2_code nvarchar(80) NULL,
admin3_code nvarchar(20) NULL,
admin4_code nvarchar(20) NULL,
population int NULL,
elevation int NULL,
gtopo30 int NULL,
timezone char(31) NULL,
modification_date date NULL
)
GO

Note on the alternatenames column: This column holds the GeoNames field which requires Unicode (UTF-8, UTF-16).

Load the GeoNames data

I used the BULK INSERT command to load the allCountries_utf16.txt file.  Note the DATAFILETYPE = 'widechar' parameter - this is required for loading UTF-16 data.

BULK
INSERT GeoNames
FROM 'C:\temp\allCountries_utf16.txt'
WITH(
DATAFILETYPE = 'widechar',
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n'
)
GO
--(6906334 rows(s) affected) (00:05:44)

Note on the ROWTERMINATOR: While I have changed the Row Terminator in the allCountries_utf16.txt file to '\r\n', SQL Server, curiously, requires the Row Terminator to be set to '\n'. There is no good explanation for this...

Add, update and index the spatial data column

The current table does not contain a column to hold native spatial data.  Since the data is in WGS84 geographic coordinates, I used the geography data type for the new column:

ALTER TABLE GeoNames
ADD geog GEOGRAPHY NULL
GO

To populate the new geography column (geog), I used the STGeomFromText constructor to create a POINT feature for each GeoNames row.  Since this is a text-based constructor, the longitude and latitude data, coming from columns of like name, will need to be converted into text using the CAST function.  The SRID for the STGeomFromText constructor is set to 4326, the EPSG value for WGS84 coordinates.

UPDATE GeoNames
SET geog = GEOGRAPHY::STGeomFromText
    ('POINT(' + CAST(longitude AS CHAR(20))
    + ' ' + CAST(latitude AS CHAR(20)) + ')',4326)
GO
--(6906334 rows(s) affected) (00:09:40)

Note on coordinate order: Since this is an OGC-based operator, the WKT is formed as: 'POINT (<longitude> <latitude>)'.

In order to create a spatial index, the table must have a primary key.  While I could have created the primary key in the CREATE TABLE DDL, this would have meant that the data was loaded into an existing index, slowing down the load process considerably.

ALTER TABLE GeoNames
ADD CONSTRAINT pk_geonames_geonameid
PRIMARY KEY (geonameid )
GO

Here is the DDL to create a spatial index on the geography column, geog:

CREATE SPATIAL INDEX geonames_mmmm16_sidx
ON GeoNames(geog)
USING GEOGRAPHY_GRID
WITH (
GRIDS = (MEDIUM, MEDIUM, MEDIUM, MEDIUM),
CELLS_PER_OBJECT = 16,
PAD_INDEX = ON

   )
GO
--(00:10:30)

Sample spatial query

As an example of the type of queries which can now be supported using the GeoNames table, consider the following: "Find all place names within 1 kilometer of the Travis County (Texas) boundary line":

Note: the distance parameter for the STBuffer() method (geography data type) is in meters...

DECLARE @g GEOGRAPHY
DECLARE @h GEOGRAPHY
DECLARE @i GEOGRAPHY
DECLARE @j GEOGRAPHY
SELECT @g = geog FROM US_Counties WHERE NAME = 'Travis'
--SQL Server Execution Times: Elapsed time = 8 ms
SELECT @h = @g.STBuffer(1000) -- Positive buffer around Travis County
--SQL Server Execution Time: Elapsed time = 16 ms
SELECT @i = @g.STBuffer(-1000) -- Negative buffer around Travis County
--SQL Server Execution Time: Elapsed time = 16ms
SELECT @j = @h.STDifference(@i)-- Difference Polygon
--SQL Server Execution Time: Elapsed time = 4 ms
SELECT name FROM GeoNames WHERE geog.STIntersects(@j)=1
ORDER BY name ASC
--91 rows affected
--SQL Server Execution Time: Elapsed time = 823 ms.

Here is the list of place names returned by the query:

Anderson Mill Anderson Mill Elementary School Apple Spring Apple Spring Hollow Austin Air Ranch Airport Austin Raceway Park Balcones Country Club Balcones Country Club Lake Balcones Country Club Lake Dam Bear Creek Best Western Southgate Inn Ste Blessed Juan Diego Catholic Church Bonnet Cemetery Boultinghouse Mountain Bratton Cemetery Bridgeway Community Church Cambrian Creek Canyon Vista Middle School Canyon Vista Pool Cedar Park High School Center Union Church Chalk Knob Chalk Knob Branch Child Evangelism Fellowship Church Church of Christ - Pond Springs Clarence Washington Farm Dam Clarence Washington Farm Lake Cottonwood Branch Crestwood Suites - Austin Cypress Creek Baptist Church

Cypress Elementary School Devils Hollow Esa Austin-Round Rock-South Fall Creek Cemetery Fitzhugh Fitzhugh Cemetery Garfield Pumping Station Gateway Community Church Gay Hollow Hamilton Hill Hammetts Crossing Hampton Inn Austin Round Rock Harris Branch Haynie Flat Cemetery Hilton Garden Inn Round Rock Hope Presbyterian Church Huddleston Cemetery Ingram Cemetery Jollyville Jollyville Cemetery Jollyville Elementary School KGTN-AM (Creedmoor) Koenig Ranch La Frontera in Round Rock La Quinta Inn & Suites Round Rock South Manchaca Optimist Youth Sports Complex Manchaca Springs Marriott Austin North Martin Hill McNeil

McNeil High School Merrelltown Muleshoe Bend Muleshoe Bend Trail New Hope Community Church Niederwald Cemetery North Creek Park Pond Springs Presbyterian Church of the Hills Purple Sage Elementary School Rattan Creek Trail Red Bluff Creek Residence Inn By Marriott Austin Round Rock Rhodes Cemetery Rim Rock Trail Round Rock Korean Presbyterian Church Round Rock Opportunity Center Roy Creek Saint Vincent de Paul Catholic Church Shingle Hills Signal Hill Spicewood Elementary School Spillar Ranch Studio 6 Austin Northwest Tanglewood Spring Texas No Name Number 9 Dam The Marbridge School Turkey Bend Turkey Bend Trail Vasquez Cemetery Woods Hill

Here is the visual result of the place names locations, within the county boundary buffer, presented in Management Studio:

image

Here is a more detailed view (the actual point locations are represented by the block dots in the center of each point symbol):

image

And there you have it - sub-second performance against 6.9 million rows with a complex spatial object on a very modest machine...

 

Technorati Tags: GeoNames,Unicode,UTF-8,UTF-16,SQL Server,Spatial,Data Loading,EditPad