Conor vs. Data Warehouses (an introduction)

Article
01/29/2010

I received this question from an internal (as in within Microsoft but not in the SQL Server team) user of SQL Server. The basic question has to do with how you set up a data warehouse and whether one should or should not create foreign keys in a data warehouse.

OK, so what’s a data warehouse? It’s a database that has a particular schema layout/pattern and a particular query pattern. I am making up this example, but let’s say that I want to track data about every newspaper issue article ever written.

Paper	Date	Editor	Category	Pages
The Daily Prophet	Jan 25 2010	Bob Jones	News	23
Dallas Cowboys Weekly	Jan 25 2010	Conor Cuningham	Sports	10
Bug Collectors Tribune	Jan 25 2010	Sarah Smith	Reference	5
The Daily Prophet	Jan 26 2010	Bob Jones	News	21

(I will apologize for the names and such, but it’s Friday and my brain does what it wants sometimes :))

With 3 or 4 rows, you can store this data in a text file and stop reading now. However, you can also put it in a database and forget about it as well – as long as the data is small, you are optimizing for your coding time instead of the specific performance of your queries.

However, Let’s say that you wanted to store LOTS of rows… For example, a few billion rows. All of a sudden, you really need to think about things because the storage for these fields can really start to add up quickly. If I have a row for every issue of a newspaper, each row could take hundreds of bytes. All of a sudden I am paying a lot of money for storage space. So, a data warehouse deals with this by normalizing pretty much everything it can from this table. So, instead of storing strings for each row, you create additional tables and have ID fields to link them all together. So, you have a new table like this:

PaperID	Paper
1	The Daily Prophet
2	Dallas Cowboys Weekly
3	Bug Collectors Tribune

(We repeat this for all data of any size in this table)

Your original table now looks like:

PaperID	Date	EditorID	CategoryID	Pages
1	Jan 25 2010	1	1	23
2	Jan 25 2010	2	2	10
3	Jan 25 2010	3	3	5
1	Jan 26 2010	1	1	21

This representation has far less data per row. So, if I want to have a few billion rows, my storage costs have been reduced because I’ve avoided unnecessary data duplication. The main table is called the “fact” table (each row contains a one or more facts, such as the number of pages), and the other tables are called “dimension” tables. This is a standard star schema (there are more complex warehouse schemas called snowflake schemas).

OK, now I can answer the original question. The connection between the fact and dimension tables are classic foreign key relationships. So, you can create a foreign key and get the database system to enforce the relationship for you. (The system can also leverage that information to sometimes better optimize queries, as I have described in an earlier post on foreign key join elimination). Now, why would a customer want to get rid of the foreign keys? Well, in SQL Server they are implemented using indexes. Indexes take space. If you build your data warehouse properly, you can guarantee that the relationship between the two tables is correct yourself (especially if the data is read-only, which many data warehouses can be). So, given that each index takes space, you might want to save some money on disks by just not defining those foreign keys. The queries still work, right? So, index space savings is the primary reason.

The particular optimization I referenced in the prior post dealt with existence checks against the foreign key – ie the system does the join to make sure that there is a valid row on the other side (example: when doing an insert to the fact table). This isn’t really the primary purpose of a data warehouse once it is built, so the specific optimization may not be that important to you. SQL Server does contain logic to understand that the rows in the fact table generally reference rows in the dimension table. Additionally, your average warehouse query is an aggregate over the fact table with filters on the dimension tables:

SELECT SUM(pages) FROM Fact INNER JOIN PaperDimension ON (Fact.PaperID = PaperDimension.PaperID) WHERE Paper = ‘The Daily Prophet’;

This kind of pattern can’t eliminate the join.

The FK is still very useful to the system when optimizing, but I hope this gives you insight into why a warehouse might choose to do without its benefits :)

Happy Querying!

Conor Cunningham

Conor vs. Data Warehouses (an introduction)

Additional resources