Conor vs. Data Warehouses (an introduction)

I received this question from an internal (as in within Microsoft but not in the SQL Server team) user of SQL Server.  The basic question has to do with how you set up a data warehouse and whether one should or should not create foreign keys in a data warehouse.

OK, so what’s a data warehouse?  It’s a database that has a particular schema layout/pattern and a particular query pattern.  I am making up this example, but let’s say that I want to track data about every newspaper issue article ever written.

 

Paper Date Editor Category Pages
The Daily Prophet Jan 25 2010 Bob Jones News 23
Dallas Cowboys Weekly Jan 25 2010 Conor Cuningham Sports 10
Bug Collectors Tribune Jan 25 2010 Sarah Smith Reference 5
The Daily Prophet Jan 26 2010 Bob Jones News 21

 

(I will apologize for the names and such, but it’s Friday and my brain does what it wants sometimes :))

 

With 3 or 4 rows, you can store this data in a text file and stop reading now.  However, you can also put it in a database and forget about it as well – as long as the data is small, you are optimizing for your coding time instead of the specific performance of your queries.

However, Let’s say that you wanted to store LOTS of rows… For example, a few billion rows.  All of a sudden, you really need to think about things because the storage for these fields can really start to add up quickly.  If I have a row for every issue of a newspaper, each row could take hundreds of bytes.  All of a sudden I am paying a lot of money for storage space.  So, a data warehouse deals with this by normalizing pretty much everything it can from this table.  So, instead of storing strings for each row, you create additional tables and have ID fields to link them all together.  So, you have a new table like this:

PaperID Paper
1 The Daily Prophet
2 Dallas Cowboys Weekly
3 Bug Collectors Tribune

(We repeat this for all data of any size in this table)

Your original table now looks like:

PaperID Date EditorID CategoryID Pages
1 Jan 25 2010 1 1 23
2 Jan 25 2010 2 2 10
3 Jan 25 2010 3 3 5
1 Jan 26 2010 1 1 21

This representation has far less data per row.  So, if I want to have a few billion rows, my storage costs have been reduced because I’ve avoided unnecessary data duplication.   The main table is called the “fact” table (each row contains a one or more facts, such as the number of pages), and the other tables are called “dimension” tables.  This is a standard star schema (there are more complex warehouse schemas called snowflake schemas).

OK, now I can answer the original question.  The connection between the fact and dimension tables are classic foreign key relationships.  So, you can create a foreign key and get the database system to enforce the relationship for you.  (The system can also leverage that information to sometimes better optimize queries, as I have described in an earlier post on foreign key join elimination).  Now, why would a customer want to get rid of the foreign keys?  Well, in SQL Server they are implemented using indexes.  Indexes take space.  If you build your data warehouse properly, you can guarantee that the relationship between the two tables is correct yourself (especially if the data is read-only, which many data warehouses can be).  So, given that each index takes space, you might want to save some money on disks by just not defining those foreign keys.  The queries still work, right?  So, index space savings is the primary reason. 

The particular optimization I referenced in the prior post dealt with existence checks against the foreign key – ie the system does the join to make sure that there is a valid row on the other side (example: when doing an insert to the fact table).  This isn’t really the primary purpose of a data warehouse once it is built, so the specific optimization may not be that important to you.  SQL Server does contain logic to understand that the rows in the fact table generally reference rows in the dimension table.  Additionally, your average warehouse query is an aggregate over the fact table with filters on the dimension tables:

 

SELECT SUM(pages) FROM Fact INNER JOIN PaperDimension ON (Fact.PaperID = PaperDimension.PaperID) WHERE Paper = ‘The Daily Prophet’; 

This kind of pattern can’t eliminate the join.

The FK is still very useful to the system when optimizing, but I hope this gives you insight into why a warehouse might choose to do without its benefits :)

 

Happy Querying!

Conor Cunningham