Artificial Keys

How big?

I’ve started noticing a lot of DBs (sometimes turning up in embarrassing places) that neglect some pretty easy scaling techniques. An often-looked consideration is designing a table to support efficient indexing. I personally find that, unless there’s a really good overriding argument, I default to creating an artificial key of the smallest numeric type that will give me plenty of room for growth. This usually means an integer (32bits on most architectures) , but sometimes means a long (64bits on most architectures) set to be an identity. This artificial key is then almost always a great choice for being the clustered index. Some out there are grumbling “but real users almost never query directly on an artificial key – why waste your one and only clustered index and thereby miss out on all the sequential goodness”. However, in a lot of cases, this consideration is overridden by the fact that your nonclustered indexes use your clustered index as a pointer to records. Therefore, the smaller your clustered index key is, the smaller your nonclustered indexes entries will be. The smaller your nonclustered index entries are, the more index entries that can fit in a page, and therefore the fewer I/O operations the DB will have to perform when doing an index lookup. Since I/O is often the bottleneck for SQL operations – especially large ones, this can have a major impact on performance. Besides, since SQL Server 2005, we've had another arrow in our quiver – physical partitioning. Consider using your partition function to maximize physical disk proximity of your records while using a small artificial key as your clustered index to get the best of both worlds! Be warned– if I see your database using a string or GUID in a clustered index, expect questions… If you’re using a clustered index on a string with a low cardinality (number of distinct values), consider creating a numeric code column populated by a lookup or hash function. If a GUID, sit back and think if you really need a GUID (such as for complex many node replication architectures), or if you’re just accommodating a lazy programmer. As for scaling, it would take a lot of effort to convince me a long won’t scale high enough. 2^64 records should be enough for most any application. 

How small?

However, my elders would castigate me – I rarely go smaller than an integer – in my experience the room to grow is worth the extra bits. The index size difference when you’re dealing with such small tables means that you won’t see a large gain regardless. Ex: If you decide to go with a 2-byte clustered index instead of a 4-byte index and max-out the space, you’ll be looking at 2^16 index entries each 2 bytes smaller than they would be otherwise. This works out to be 131,072 (65,536x2) bytes, or only 16 8KB pages. That size index is going to end up in fast memory cache with any sort of usage. Also, don’t forget that your non-clustered index column selection contributes to the non-clustered index size, and a column of any significant size will quickly drown out the gains from going smaller than 4 bytes.

Agree? Disagree? Think I’m nuts? Are you from a development team of a widely distributed product and want to justify its DB design decisions? Comment below.

Technorati Tags: SQL Server,DB Design,Performance