Working with Invalid Data and the SQL Server 2008 Geography Data Type, Part 2

In a long past post, Working with Invalid Data and the SQL Server 2008 Geography Data Type, Part 1b (August 2008), I discussed techniques for dealing with invalid data and the geography data type. In my last post, I listed the new geography validation functions in the SQL Server Spatial Tools project on CodePlex.  In this post, I'm going to use three of these new functions:

  • IsValidGeographyFromGeometry
  • MakeValidGeographyFromGeometry
  • MakeValidGeographyFromText

For this exercise, I'm going to use the Zillow Neighborhood Boundaries .  Since the neighborhood data (polygons) are delivered as shapefiles, I used the well respected Shape2SQL data loader from SharpGIS. Note that there are 42 individual state-based shapefiles which will need to be individually downloaded and then loaded into SQL Server. Since the goal of this exercise is to end up with geography features, the data must be in geographic coordinates to start with. The projection file (.prj) associated with each shapefile confirmed this, containing the following information:

GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137.0,298.257222101]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]]

This is EPSG:4269, so that's the SRID I used in Shape2SQL.

My goal is to load this data into a column of type geography so it would make sense to load the data directly into a geography column.  I'll save you the suspense and tell you right now that that some of this data will not load directly into a geography column - big surprise, right?  Consequently, I  loaded each shapefile's spatial data directly into a table, ZillowNeighborhoods, with a geometry column named GEOM.  After the data loading was completed, I added a column of type geography named GEOG:

ALTER TABLE ZillowNeighborhoods
ADD GEOG geography NOT NULL

Being bold, I next attempted to use the new MakeValidGeographyFromGeometry function to update the new geography column:

UPDATE ZillowNeighborhoods
SET GEOG = dbo.MakeValidGeographyFromGeometry(GEOM)

Well, suffice-it-to-say, it wasn't quite that easy, even with our new fangled function, which returned the following error message:

Msg 6522, Level 16, State 1, Line 1
A .NET Framework error occurred during execution of user-defined routine or aggregate "MakeValidGeographyFromGeometry":
System.ArgumentException: 24200: The specified input does not represent a valid geography instance.

So, what to do next? Why not use the IsValidGeographyFromGeometry function so identify the candidate group of which geometries contained invalid data from a geography-perspective:

SELECT ID FROM ZillowNeighborhoods WHERE dbo.IsValidGeographyFromGeometry(GEOM) = 0
--Results:
--3499
--4319
--4328
--4428
--4458
--5182
--5201
--5214
--5224

Now I needed to figure out which of these geometries could not be converted, so I used the following query, once for each ID:

SELECT dbo.MakeValidGeographyFromGeometry(GEOM) FROM ZillowNeighborhoods WHERE ID = 3499

The queries which failed allowed me to isolate the problem geometries to the following rows (identified by the ID field): 

3499, 4328, 5224

Before I go any further, let's update the geography column, sans the problematic polygons:

UPDATE ZillowNeighborhoods
SET GEOG = dbo.MakeValidGeographyFromGeometry(GEOM)
WHERE ID NOT IN (3499, 4328, 5224)

Now, let's investigate the problem geometries to see if we can determine what the issues are that prevent conversion. Here is the first troublesome geometry (ID = 3499):

SELECT GEOM FROM ZillowNeighborhoods WHERE ID = 3499

Here is the returned geometry in the Spatial results tab in Management Studio:

clip_image001

A careful visual analysis didn't find any obvious defects in the geometry.

Then, I decided to fiddle with the original geometry using the Reduce method with very small tolerances. Interestingly enough, I was able to get the geometry to convert to a geography using the following query:

SELECT dbo.MakeValidGeographyFromGeometry(GEOM.Reduce(.0000001))
  FROM ZillowNeighborhoods WHERE ID = 3499

So what is happening here?  First of all, the tolerance used in the Reduce method is very small. So how many vertex points were removed from the original geometry by Reduce?  It turns out that 6 vertices were removed. A quick discussion with the SQL Server spatial developers revealed that during current validation processing some very close vertices can shift slightly and cause edges to cross/overlap (remember that geographic edges are not straight lines...).  In the next release of SQL Server (SQL11), we have changed the underlying methodology and eliminated this issue.  For now, we will have to workaround this issue using Reduce with very small tolerances. 

By chance, I happened to look at the comments on my August 2008 post, referenced above, and found that I was not the first one to discover this technique:

-----------------------------------------------------------------------

# Craig Tadlock said on February 16, 2009 9:36 PM:
This is an excellent post. It works for 99.9% of the geometries Ive had to deal with. For the other .1%, try adding this..

Reduce the geometry value by a very small amount 1st. I'm not actually sure what this does internally, but it's solved most my issues.

Example...

select geography::STGeomFromWKB(geom.MakeValid().Reduce(.00000001).STUnion(geom.STStartPoint()).MakeValid().STAsBinary(), 4326) from tl_2008_us_csa

-----------------------------------------------------------------------

So, my hat's off to you, Craig, a true invalid geography pioneer!

-----------------------------------------------------------------------

NOTE: I've subsequently done a little more investigation and found that Alastair (author of the first published SQL Server Spatial book , Beginning Spatial with SQL Server 2008) had already figured this trick out in his excellent post, Fixing Invalid Geography Data  (Oct. 18, 2008). Nothing like "discovering" what everybody else already knew ;-)

-----------------------------------------------------------------------

To complete the investigation, let's see if we can isolate the vertexes which were removed by Reduce. To do this I buffered the difference polygons between the original and generalized polygons using the following query:

--Find problem areas and mark with a buffer polygon
DECLARE @g0 GEOMETRY
DECLARE @g1 GEOMETRY

SELECT @g1 = GEOM.Reduce(.00000001) FROM ZillowNeighborhoods WHERE ID = 3499
SELECT @g0 = GEOM FROM ZillowNeighborhoods WHERE ID = 3499

SELECT @g0.STDifference(@g1).STBuffer(.001)

Using Safe Software's Feature Manipulation Engine's (FME) Universal Viewer, I've created the following visualization where I’ve overlaid the original and generalized geometry over each other with the modified areas indicated by the buffer regions generated with the above query. I then marked the obvious vertexes which were removed by the Reduce in the Windows Paint program. The remaining vertexes where so close to other vertexes that it was not possible to actually identify them, though it is clear where they are.

image

Needless-to-say, the generalized polygon with this tiny tolerance did not materially change the polygon's basic geometric quality. So I would posit that this technique is a valid method.  To complete this sequence, I updated the row where ID=3499 with the following query:

UPDATE ZillowNeighborhoods
SET GEOG = dbo.MakeValidGeographyFromGeometry(GEOM.Reduce(.00000001))
WHERE ID = 3499

So what about the other two polygons which would not convert (ID's: 4328, 5224)?  Well it turns out the "Reduce" technique works for these too.  But there is a twist ... it wouldn't be any fun if I was done, right?

It turns out that when run through the Reduce-based technique, each of these geometries produces GeometryCollections:

-----------------------------------------------------------------------

NOTE: The SQL Server Spatial Team has entertained many inquiries on why Reduce, under certain circumstances, would take higher dimension objects (such as polygons) and create lower dimension objects (such as lines and points) from them (most other spatial databases maintain the dimensionality of the spatial object when generalizing). The following examples are a good illustration of how to capitalize on the behavior of Reduce when it encounters extremely thin polygons.  

-----------------------------------------------------------------------

Here is the Well-Known Text (WKT) for ID = 5224 from the following query:

SELECT GEOM.Reduce(.0000001).STAsText() FROM ZillowNeighborhoods WHERE ID = 5224

GEOMETRYCOLLECTION (LINESTRING (-86.7028556317091 36.198983579874039, -86.702864557504654 36.198975399136543), LINESTRING (-86.72271192073822 36.195485174655914, -86.7227138876915 36.195484921336174), LINESTRING (-86.725671902298927 36.1651765704155, -86.7256792485714 36.165166437625885), POLYGON ((-86.724177449941635 36.164932250976562, -86.72216959297657 36.165002837777138, -86.720163896679878 36.165325194597244, ... , -86.725447744131088 36.164976820349693, -86.724177449941635 36.164932250976562)))

So what's with the LINESTRING's?  Let's take a look.

First, let's create geometry instances from each of the LINESTRINGs.  Since the LINESTRINGs are very small, let's buffer them to help make the area where there occur along the polygon visible:

DECLARE @g1 VARCHAR(MAX) = 'LINESTRING (-86.7028556317091 36.198983579874046, -86.702864557504654 36.19897539913655)'
DECLARE @g2 VARCHAR(MAX) = 'LINESTRING (-86.72271192073822 36.195485174655914, -86.7227138876915 36.195484921336167)'
DECLARE @g3 VARCHAR(MAX) = 'LINESTRING (-86.725671902298942 36.1651765704155, -86.7256792485714 36.165166437625885)'

SELECT GEOMETRY::STGeomFromText(@g1,4269).STBuffer(.001)
UNION ALL
SELECT GEOMETRY::STGeomFromText(@g2,4269).STBuffer(.001)
UNION ALL
SELECT GEOMETRY::STGeomFromText(@g3,4269).STBuffer(.001)
UNION ALL
SELECT GEOM FROM ZillowNeighborhoods WHERE ID = 5224

Here is the view from Management Studio:

image

Let's "zoom" in on a single problem area:

-----------------------------------------------------------------------
-- Isolate one of the problem areas which is identified by the linestring ...
-----------------------------------------------------------------------
-- Define linestring
DECLARE @g1 VARCHAR(MAX) = 'LINESTRING (-86.7028556317091 36.198983579874046, -86.702864557504654 36.19897539913655)'
-- Clip the original geometry instance by intersecting a buffer around the linestring with the original geometry instance
SELECT GEOMETRY::STGeomFromText(@g1,4269).STBuffer(.0001).STIntersection(GEOM)
FROM ZillowNeighborhoods WHERE ID = 5224
UNION ALL
SELECT GEOMETRY::STGeomFromText(@g1,4269).STBuffer(.00001)

Here is how this looks in Management Studio:

image

So the LINESTRING is a "spike". The "line" in the brown buffer region is actually an extremely thin extension of the base polygon. It turns out that the rest of the LINESTRINGs are also spikes.  The last geometry (ID = 4328) was likewise festooned with spikes.  A quick recap on why spikes cause validation issues, from above: "...during current validation processing some very close vertices can shift slightly to cause edges to cross/overlap".

To create the final geometry objects, ready for conversion to geography, simply extract the POLYGON WKT from the GEOMETRYCOLLECTIONS returned by the Reduce-based processing and create geometry instances which can then be used to update the original geometry after conversion by MakeValidGeographyFromText (note the use this new function):

DECLARE @geom VARCHAR(MAX) = POLYGON ((-86.724177449941635 36.164932250976562, -86.72216959297657 36.165002837777138, ..., -86.725447744131088 36.164976820349693, -86.724177449941635 36.164932250976562))'

UPDATE ZillowNeighborhoods
SET GEOG = dbo.MakeValidGeographyFromText(@geom,4269)
WHERE ID = 5224

Let's fix the last neighborhood polygon, ID=4328.  A quick check shows that the core neighborhood polygon is actually a MULTIPOLYGON:

SELECT * FROM ZillowNeighborhoods WHERe ID = 4328

image

A quick check of the structure of the MULTIPOLYGON with the following query...

SELECT dbo.MakeValidGeographyFromGeometry(GEOM.Reduce(.00000001)).STAsText()
FROM ZillowNeighborhoods
WHERE ID = 4328

...reveals that there are 4 polygons (expected) and 1 line (a "spike") in the GEOMETRYCOLLECTION.  I removed the LINESTRING from the WKT and, using the following query, updated the row:

DECLARE @geom VARCHAR(MAX) = 'GEOMETRYCOLLECTION (POLYGON ((-73.8411675542593 40.795716315507896, ..., -73.8411675542593 40.795716315507896)), POLYGON ((-73.853304713964462 40.7884576357901, ..., -73.853304713964462 40.7884576357901)), POLYGON ((-73.85657873749733 40.78686719387769, ..., -73.85657873749733 40.78686719387769)), POLYGON ((-73.85907343775034 40.785239797085531, ..., -73.85907343775034 40.785239797085531)))'
DECLARE @geog GEOGRAPHY
SELECT @geog = dbo.MakeValidGeographyFromText(@geom,4269)
UPDATE ZillowNeighborhoods
SET GEOG = @geog
WHERE ID = 4328

This completes the exercise, updating the ZillowNeighborhoods table with a column of type geography.

Technorati Tags: SQL Server,2008,Spatial,MakeValidGeographyFromGeometry,MakeValidGeographyFromText,IsValidGeographyFromGeometry,Shape2SQL,SharpGIS,CodePlex,Geography,Validation,Zillow,Neighborhoods,Safe Software,Feature Manipulation Engine,FME,Universal Viewer