Surrogate keys

After you have been building databases for a while, surrogate keys become second nature but many people do not know about them so I thought I should devote some time to them. In addition, a constant challenge is choosing between using an int or a uniqueidentifier (i.e. GUID) as a surrogate key and I thought this would be a great forum to get some feedback on when you should use one or the other.

Surrogate keys

One of the more useful database design strategies is to use surrogate keys. The idea is that instead of using, say, a social security number as an ID for people you use a completely artificial key (usually an int , bigint or a uniqueidentifier) called a surrogate key. In the database application, you may display and search on the natural key (social security number in this case) but all joins use the surrogate key.

There are a number of reasons for using surrogate keys but the main reason is to deal with change. Natural keys like social security numbers often change. If you do not plan carefully you may assume that, say, a person’s social security number will never change. Then reality sets in—a person’s social security number can change. Suddenly, you have to propagate that change to everywhere that you refer to a person. What happens when we run out of social security numbers and need to change the format? Suddenly, you have to convert a large amount of data and resize many rows. What happens when your company expands overseas where they have their own national ID system? Suddenly you have to rewrite a lot of code. However, if you had used a surrogate key instead then each of these scenarios would be much easier to deal with.

The other common reasons for using surrogate keys are to do with efficiency. In general, surrogate keys are significantly smaller than natural keys. Thus table rows and indexes will be smaller and comparisons quicker (although bookmarks complicate the picture). Since more data is in each disk page, table scans, index lookups and index scans will all go faster. Similarly, surrogate keys replicate faster and reduce backup times.

However, there are disadvantages to surrogate keys, which some people think outweigh their advantages. Surrogate keys can lead to more joins since a join is required back to a master table to get the natural key for a report. Surrogate keys make ad hoc queries harder to write since more joins are required and users must have a deeper understanding of the data. In addition, surrogate keys require the use of an extra uniqueness constraint instead of just the primary key constraint with an associated loss of efficiency.

I tend to favor the surrogate key approach unless there is a natural key that I think will never change and the key is a reasonable size. However, even in the space of representing detailed information about compiled programs, good natural keys are rare.

Should a surrogate key be an int or uniqueidentifier?

I still struggle with this issue. In a naïve way, an int or a bigint is the obvious choice at 4 or 8 bytes instead of 16 bytes with some nice support for table partitioning in SQL Server 2005. Also, integer based keys can be added sequentially and table appends are typically faster than table inserts. However, replication and database merging complicate the issue.

If there is no row GUID then merge replication will add one. Thus, if you are going to pay the cost anyway it is tempting to use a uniqueidentifier as the key. The disadvantage is that the row size of every table the ID occurs in will increase.

Related to the merge replication issue is the issue of generating keys in the middle or client tier. This is easier with a uniqueidentifier based key than an integer based key. There are numerous strategies for assigning integer key ranges but they complicate the logic and make the system less robust.

I tend to favor using bigint based keys unless I am sure the number of rows will be low. Although it hurts every time I do it. The databases I am working on may contain an unusually large number of rows so int based columns should be sufficient for most people.

Your thoughts please

Does all this match up with your experience? I would be interested in hearing people’s views on surrogate keys and in particular, the int or uniqueidentifier issue.

This posting is provided "AS IS" with no warranties, and confers no rights.