T-SQL Anti-pattern of the day: Prefix codes stored in (n)varchar fields

In the course of my work as a SQL Server Premier Field Engineer, we tend to see a lot of patterns in code. Some of these patterns do not lend themselves to optimal performance. This post is the first in a series of such ‘how not to code’ tips which would hopefully help some T-SQL developers out there.

Scenario

OK. So imagine a situation where there’s a table like this:

CREATE TABLE Orders
(
    OrderNumber nvarchar(50) PRIMARY KEY NOT NULL,
    ShipAddress nvarchar(255),
    IsDelivered    bit
)

Now in itself the table design looks innocuous except that the OrderNumber is encoded as per a scheme which looks like this:

First 3 characters: BranchId which generated this order
Next 12 characters: Date and time encoded as YYYYMMDDHHmm
Next 35 characters would accommodate a running number reset each day

Because of the fact that the PRIMARY KEY constraint is by default implemented as a clustered index, we get the following index on OrderNumber:

PK__Orders__0AD2A005 clustered, unique, primary key located on PRIMARY OrderNumber

Problem

SELECT queries often run on this table which list the orders by branch. Those typically look like this:

SELECT OrderNumber, ShipAddress
FROM Orders
WHERE SUBSTRING(OrderNumber, 1, 3) = N’004′

Not surprisingly we get a table scan, though there is an index on OrderNumber. Note the parallelism gather streams operator also. That’s because the cost of the overall query plan exceeded the default ‘cost threshold for parallelism’ (5).

This is a very common pattern I observe, especially on transactional tables which have a large number of enquiries (SELECTs) on them. The resultant logical reads due to the index scans cause poor response times. In many cases the cost of the scan exceeds the cost threshold for parallelism and further compounds the problem.

Root Cause

So, what is the problem here? Evidently it is a problem of database design. Storing such encoded values in a field is a violation of the First Normal Form in database design (http://support.microsoft.com/kb/283878). So the right solution would involve taking this encoded BranchId and placing it in another field.

Workaround

However, it is not easy to implement this overnight in most systems which are in production. For the above specific query, there is a way to improve performance, though. The SUBSTRING is used to strip the first 3 characters and then compare with an input parameter. This can be replaced with an equivalent query such as:

SELECT OrderNumber, ShipAddress
FROM Orders
WHERE OrderNumber LIKE N’004%’

The resultant execution plan looks like this, with a dramatically reduced cost now that the index is being used to do a seek:

Of course there are other possibilities involving indexed views / computed columns but those I will examine in another post.

Conclusion

Database design conforming to the normal forms is generally a good idea to drive usage of indexes and have acceptable response times. In some cases where queries use a predicate comparing the prefix of values stored in a field, it may be worthwhile examining the use of LIKE as compared to a SUBSTRING.

Arvind Shyamsundar's technical blog

Arvind Shyamsundar is a Principal PM @ MSFT Azure Data, working on Azure SQL. Data geek. Apache Accumulo and Fluo PMC. SQL MCM, ex-Principal PFE (MSFT Services). These are my own opinions and not those of Microsoft.

T-SQL Anti-pattern of the day: Prefix codes stored in (n)varchar fields

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply