SQL Spatial - Tips on Storage

Article
11/03/2008

This post talks about ways to Store Spatial Reference data in SQL in order to improve performance of your map displays.

Overview: There are two broad categories of Spatial Data.

Spatial Reference data, the "stuff" you draw maps with; Geo-Political Boundaries (States, Postcodes), Rivers, Roads etc. It tends to be changed infrequently & typically refreshed via a batch process.
Spatial Attributes. They are properties that helps describe some other data eg: The Location column in a Customer Address record. Or the location of a Service Van at a point in time. This information, these records are quite volatile.

We will only look at the first group "Spatial Reference data". The second group is just standard database design & tuning.

Warning: The following section outlines some changes you can make to your database schema to improve performance. Perhaps you don't need to. As with all SQL performance tuning it is worth testing to get a baseline. And testing for benefit, before putting hints in your code or racing off creating complexity for operations staff.

Storing your Spatial Reference Data in SQL

1. Consider using a Read-Only filegroup

If you have a lot of data & it is largely static, this can simplify maintenance operations of your database. After you've loaded the data & backed it up once, it doesn't have to be backed up again, until you change it. This reduces the time is takes to run your "normal" backups & also reduces their file size. Restore operations are faster as SQL "knows" that the pages haven't been modified.

This is invisible to your application, none of your queries need to change. (Unless they are trying to modify the table in which case, this is unsuitable.)

Typically you would update the tables on a READ-ONLY filegroup via a batch job. The batch would, (i) swap the filegroup to READ-WRITE, (ii) change the data, (iii) swap back to READ-ONLY & then (iv) run a BACKUP.

Alternatively, you would partition your table over multiple filegroups. Your new or changing rows would be inserted in the RW part of the table & the historical records would remain unchanged on the part of the table residing on the partition located in the Read-Only filegroup.

 -- ==< Sample code to Manipulate a READ-ONLY Filegroup >==
-- This assumes you've created a database called SpatialDB --
USE SpatialDB
go

-- < Add a File & FileGroup the database >--
ALTER DATABASE SpatialDB ADD FILEGROUP fgSpatialRO
go

ALTER DATABASE SpatialDB 
              ADD FILE (NAME = spatialfile1, FILENAME = 'c:\temp\spatialfile1')
              TO FILEGROUP fgSpatialRO
go
-- < Create a table and associate it to a filegroup >--
CREATE TABLE dbo.Roads (
   ID INT PRIMARY KEY IDENTITY(1,1),
   Geom INT-- Normally this would be a Geography type
) ON fgSpatialRO
go

-- Insert 13,000 rows with different identity values
INSERT INTO dbo.Roads VALUES(1)
go 13000  

-- ==< Mark the filegroup read-only >==
ALTER DATABASE SpatialDB MODIFY FILEGROUP fgSpatialRO READ_ONLY --  swap back with READ_WRITE
go

-- --< Backup the Database >--
-- BACKUP DATABASE SpatialDB .....

Note: Shared locks are taken when querying a table located on a Read-Only filegroup. It is a common misconception that they are not.

Bonus code. If you'd like to compare the locks taken when a Filegroup is RO vs RW, or If a Database is RO vs RW use this code.

 -- run a transaction with repeatable read isolation
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
BEGIN TRAN
SELECT * from dbo.Roads where (ID % 5 = 0) AND ID BETWEEN 7000 AND 9000

-- Look at the locks that are being held
EXEC sp_lock @@spid

-- Free the shared locks
ROLLBACK TRAN
--COMMIT TRANSACTION

2. Consider using a Read-Only database

In addition to the benefits offered by a RO Filegroup, Read-Only databases offer a performance benefit, when you query them almost no locks are taken. This can free memory to used for other purposes, perhaps to hold more data in RAM. Of course, the transaction log never grows & recovery is faster for same reasons as read-only filegroups.

The Disadvantage is the potential to disrupt user processing when swapping a database from Read-Only to Read-Write or vice versa.. Sure it is only 1 command which runs in milliseconds but All connections to that database will need to be dropped. This issue can be minimised by being intelligent with your connections. eg:

1. "Connect to" or "Use" another database, perhaps the one you are using for other data & then refer to the ReadOnly spatial database using a 3 part naming scheme.
  
  eg: SELECT ... FROM ROSpatialDB.Zoom3.Suburbs
2. Don't hold a connection open if you aren't using it. Have an error handler that reconnects & reties the query if you lose the server.
3. Block out a time when the database will be unavailable due to maintenance. Increasingly companies are demanding 7 *24, but maybe few people are looking at maps at 2am Sunday morning. So they will tolerate needing to reconnect if you accidentally disconnect them.

 -- --< Backup the Database >--
-- BACKUP DATABASE SpatialDB .....

-- ==< Sample: How to make a database Read Only >==
-- Ensure you aren't using the Database you're changing or you need to reconnect.
USE master
go
-- Change the database Option
ALTER DATABASE SpatialDB SET READ_ONLY;
--ALTER DATABASE SpatialDB SET READ_WRITE;
go
-- Go back to your "original" db
USE SpatialDB
go

3. Can't I just get better concurrency & lower memory overhead by using a "With NOLOCK" hint?

Maybe, but just like the "Cut off your legs to lose weight" diet, you really want to think it thru. The links below may help your research. But real rule is, look to see if your schema & queries are optimal, then try it as a last resort. Test to see if it makes a difference & if you go with it, document everything; why, the version & build you ran your tests on, etc.

Previously committed rows might be missed if NOLOCK hint is used

Timebomb - The Consistency problem with NOLOCK / READ UNCOMMITTED

Using NOLOCK and READPAST table hints in SQL Server

Common Spatial Data Store Designs. (Background optimising your table schema)

Can I get a better user experience if I pre-compute objects, remove some of the detail & put them in Lower Detail tables? Often the answer is yes. The following may give you some ideas.

Option 1. Split your data into Layers

Option 1a: Create a Table for each layer

Each layer of your map; Roads, Rivers, Features, gets its own table. eg: WA.Railways, WA.MainRoads, WA.Suburbs, WA.Rivers.

This is the most common approach. Possibly as was the easiest to create 30 years ago. Possibly because that is how GIS Data vendors sell their data & no-one spent a lot of time thinking about alternatives. Possibly because is easiest way to offer users display flexibility, as users check the option to display/hide a layer, ie: Railways, it is easy to query the relevant table & display it.

An advantage to this that all the spatial data in the table is of the same dimensionality (ie: Railways are typically Linestrings, Countries are polygons), so this may let you optimise your code.

Option 1b: Create a Schema for each major region

If you have huge amounts of detail & each business unit is only focused on a limited geographic area (ie State Mgr), it may be worthwhile to create a schema name for each major Region, then create tables for each layer within each schema. eg: Create a schema for each state, containing all roads, land within that state etc. eg: The Roads table would be split into; NSW.Roads, VIC.Roads, WA.Roads ...

Option 1c: Create a Schema for each "Zoom" level

Storing multiple versions of your map data, each with a reduced level of detail can significantly reduce the load on your system, both for data retrieval & also for client rendering. eg: zoom1.Roads may only contain Major Highways that stretch across the state. zoom3.Roads might be major roads in each city, zoom5.Roads might show all roads but only as lines. Perhaps you might go all the way in to "Zoom9" before you display all the Exit ramps, turning lanes & minute details.

Pre-computing your reference data & storing the same tablename with a different schema makes writing your application simpler when needing to zoom in. Changing the schema name from say, 1 to 9, is cleaner than using conditional code to change table names ie: Highways, MajorRoads, Roads, RoadsWithDetails.

Also makes it simpler to refresh data as it is updated. You could just truncate all tables of that feature & regenerate them all with the new data. NB: You may not need to regenerate ALL zoom levels, it is rare that changes are so significant they are visible beyond the "City" level.

Option 1d: Create an extra column for each "Zoom" level

This option doesn't have the benefit of reducing the number of rows returned. And has the disadvantage of creating really long rows. But makes sense in situations where your table doesn't have many rows but contains very detailed polygons & lots of other supporting columns. This is typical of thematic maps which mainly look at the same regions coloured with different data. Sometimes you view a dashboard showing N little copies (ie: Violent crimes, Minor Crimes, etc), occasionally you expand one map to the entire screen.

The advantage is it saves you from maintaining a set of Zoom Tables (like Option 1c).

Option 1e: Add a "ZoomLevel" column & duplicate each row.

If your row didn't have a lot of other columns or you can normalize your design to put the associated data in a separate table, it probably easier to add an "ZoomLevel" column & duplicate the row pre-computed to a lower resolution.

Option 1f: Add a "Reduce by value" column & compute on the fly.

Adding a column containing a default or suggested "Reduce by" value to pass to a .Reduce method, saves the overhead of calculating an appropriate value. This may let you define different values for different rows, saving diskspace. A variation of this approach is to create a "ZoomLevel" table with the appropriate value for each "level" you offer in your app. Of course computing this at query time will hit your CPU harder than if you pre-Reduce the row.

Option 1g: Add an "Envelope" column to each object.

Especially for Geography types, you may find it useful to store the extreme edges of your shape; Most Left, Most Right, Lowest Lat, Highest Lat. These 4 values can be stored as 4 float columns or combined into one geography column containing a 2 point LineString (Lower Left, Upper Right). This persists the STEnvelope values to help quickly calculate an optional value to Reduce by. Handy if your app offers the users a Pan & Zoom capability. Less useful where a report is just displaying a static, thematic map.

2. Split your data into Geographic regions

Option 2: Create Tiles.

Divide your world up into little sections. Possibly based on a functional unit of analysis (ie: Postcode or Suburb boundaries) OR possibly squares of a certain size, (NB: This is how Virtual Earth / Google Maps do it).

Typically you would pre-compute entire “sections” of a map perhaps storing them as GeometryCollections.

Advantage

Very efficient for panning around a map. You can improve speed by pre-fetching just the non-visible area around the edge of the map. And it is easy to filter out details not viewed.

3. Do both

Create Layered Tiles, & maybe create a Schema for each Zoom Level.

Final Thought

Source: my young son. Sure it is lame, but if you know some pre-teens & tell it right, it cracks them up.

Two cupcakes are sitting in an oven.

One says to the other "It's really hot in here"

The other says "Whoa, A talking cupcake!"

Thanks

Please give feedback. Is this series of articles useful? Did it save you time? What was good, What could be better?, Notice any errors? What would you like me to cover? All thoughts, comments, suggestions welcome.

Technorati Tags: SQL Server,SQL Server Spatial,Spatial,TSQL