Deadlock Troubleshooting, Part 3

Here’s an example of the classic scenario that is usually used to introduce the concept of a deadlock in a database:

Process A

Process B

1. Begin Transaction

1. Begin Transaction

2. Update Part table

2. Update Supplier table

à

3. Update Supplier table

3. Update Part table

ß

4. Commit Transaction

4. Commit Transaction

If Process A and Process B each reached step #3 in their respective transactions at approximately the same time, it’s easy to see how they could end up blocking each other. The most obvious solution to this deadlock is to change the order of the UPDATE statements in one of the transactions, so that lock resources are acquired in a consistent order.

Instead of this overly simplistic deadlock, let’s take a closer look at the deadlock scenario demonstrated in Deadlock Troubleshooting, Part 2. In that case, these two stored procedures ended up deadlocked:

       CREATE PROC p1 @p1 int AS

       SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1

       GO

       CREATE PROC p2 @p1 int AS

       UPDATE t1 SET c2 = c2+1 WHERE c1 = @p1

       UPDATE t1 SET c2 = c2-1 WHERE c1 = @p1

       GO

There’s no UPDATE, DELETE, or INSERT in the first proc; it consists of a single SELECT. And even the two UPDATE statements in the second proc aren’t wrapped in an outer BEGIN TRAN/COMMIT TRAN. Both UPDATEs ran within their own autocommit transaction, which means that only one of them could have been involved in the deadlock. Clearly this doesn’t fit the stereotypical “modify A then modify B / modify B then modify A” deadlock model described above. This isn’t an edge case, by the way. We actually see this type of deadlock – where one or both of the participants are in the middle a single-query, autocommit transaction – more often than easy-to-understand deadlock scenarios involving two multi-statement transactions that just modify two tables in a different order.

So, what would you do if DTA hadn’t automagically recommended a new index that prevented this deadlock? To craft your own solution by hand, you need a deeper understanding of the deadlock than we have at the moment.

What caused this deadlock?

We’ll need to refer back to the deadlock summary that was distilled from the -T1222 output (see Deadlock Troubleshooting, Part 1 for a refresher on decoding -T1222):

            Spid X is running this query (line 2 of proc [p1], inputbuffer “… EXEC p1 4 …”):
SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1
Spid Y is running this query (line 2 of proc [p2], inputbuffer “EXEC p2 4”):
UPDATE t1 SET c2 = c2+1 WHERE c1 = @p1

The SELECT is waiting for a Shared KEY lock on index t1.cidx. The UPDATE holds a conflicting X lock.
The UPDATE is waiting for an eXclusive KEY lock on index t1.idx1. The SELECT holds a conflicting S lock.

First, let’s look at the query plan for the SELECT query. To view the plan, execute “SET STATISTICS PROFILE ON”, then run “EXEC p1 4”.

   SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1

  |--Nested Loops(Inner Join, OUTER REFERENCES:([Uniq1002], [t1].[c1]))

   |--Index Seek(OBJECT:([t1].[idx1]), SEEK:([t1].[c2] >= [@p1] AND [t1].[c2] <= [@p1]+(1)) ORDERED FORWARD)

  |--Clustered Index Seek(OBJECT:([t1].[cidx]), SEEK:([t1].[c1]=[t1].[c1] AND [Uniq1002]=[Uniq1002]) LOOKUP ORDERED FORWARD)

A Nested Loop join executes its first child operator once, and executes the second child operator for each row returned by the first child (see this post for details). In this case, the first child is a nonclustered Index Seek to find the rows “WHERE c2 BETWEEN @p1 AND @p1+1”. For each qualifying row in the nonclustered index, a second seek is done on the clustered index to look up the whole data row. This clustered index seek is necessary because the nonclustered index does not cover the query. If you’re running SQL 2000, you’ll see a different-looking plan:

  SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1

  |--Bookmark Lookup(BOOKMARK:([Bmk1000]), OBJECT:([t1]))

  |--Index Seek(OBJECT:([t1].[idx1]), SEEK:([t1].[c2] >= [@p1] AND [t1].[c2] <= [@p1]+1) ORDERED FORWARD)

For practical purposes, these two plans are identical. The purpose of the Bookmark Lookup operator in the SQL 2000 plan is to visit the clustered index to retrieve the full set of columns for a row identified by a nonclustered index. In SQL 2005 this same operation is expressed as a loop join between the nonclustered index and the clustered index. For this deadlock, it’s simply important to note that both plans calls for a seek from the nonclustered index, then a seek from the clustered index.

Now let’s look at the UPDATE:

  UPDATE t1 SET c2 = c2+1 WHERE c1 = @p1

  |--Clustered Index Update(OBJECT:([t1].[cidx]), OBJECT:([t1].[idx1]), SET:([t1].[c2] = [Expr1004]))

   |--Compute Scalar(DEFINE:([Expr1013]=[Expr1013]))

      |--Compute Scalar(DEFINE:([Expr1004]=[t1].[c2]+(1), [Expr1013]=CASE WHEN CASE WHEN ...

         |--Top(ROWCOUNT est 0)

            |--Clustered Index Seek(OBJECT:([t1].[cidx]), SEEK:([t1].[c1]=[@p1]) ORDERED FORWARD)

The UPDATE has a fairly simple query plan. The two most significant operators are the first and the last one. The Clustered Index Seek locates the rows that quality for the “WHERE c1 = @p1” predicate. Once a qualifying row has been found, the Clustered Index Update operator acquires an eXclusive key lock on the clustered index and modifies the row.

We now have a full understanding of how the UPDATE blocks the SELECT: the UPDATE acquires an X lock on a clustered index key, and that lock blocks the SELECT’s bookmark lookup on the clustered index. But the other half of the deadlock – the reason that the SELECT blocks the UPDATE – isn’t quite so obvious. The -T1222 told us “The UPDATE is waiting for an eXclusive KEY lock on index t1.idx1. The SELECT holds a conflicting S lock.” It’s not very apparent from the plan, but the UPDATE needs an X lock on the nonclustered index [idx1] because the column it is updating ([c2]) is one of the non-clustered index’s key columns. Any change to an index key column means that a row in the index must be relocated, and that relocation requires an X lock.

This is a key point to remember when trying to understand many deadlocks: the access path to find the qualifying rows is important, but index updates implied by the columns being modified can be just as important. To make things more confusing, sometimes you’ll see explicit “Index Update” or “Index Delete” operators in the plan for each nonclustered index that needs to be updated, while other times these don’t show up in the plan. (For more info on this check out Wide vs. Narrow Plans.)

To summarize: the SELECT used the nonclustered index to find a qualifying row. While holding a Shared lock on the nonclustered index, it needs to jump over to the clustered index and retrieve some columns that aren’t part of the nonclustered index. While it’s doing this, the UPDATE is busy doing a seek on the clustered index. It finds a row, locks it and modifies it. But because one of the columns being modified is a key column in the nonclustered index, it then has to move to the nonclustered index and update that index, too. This requires a second X key lock on the nonclustered index. So, the SELECT ends up blocked waiting for the UPDATE to release his X lock on the clustered index, while the UPDATE winds up blocked and waiting for the SELECT to release his S lock on the nonclustered index.

Hopefully it’s clear that even though each participant in this deadlock is just a single query, this is still a problem caused by out-of-order resource access patterns. The SELECT statement locks a key in the nonclustered index, then locks a key in the clustered index. The problem is that the UPDATE needs to lock the same two resources, but because of its query plan, it tries to lock them in the opposite order. In a sense, it’s really the same problem as the simple deadlock scenario described at the beginning of this post.

The locks acquired by a query aren’t acquired all at once. A query plan is like a little program. It wouldn’t be terribly inaccurate, for example, to think of a nested loop join as a FOR loop. Each iteration of the loop acquires a key lock on the outer table, then holds that lock while looking up (and locking) matching rows in the inner table. Deadlocks like this one are a little harder to figure out because the order of resource access within a single query depends on the query plan, and can’t be determined just by looking at the T-SQL.

How did DTA’s new index avoid the deadlock?

Here’s an index that will prevent this deadlock:

            CREATE INDEX idx2 ON t1 (c2, c3)

This index “covers” the query “SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1”, which is just another way of saying that the index includes all of the columns referenced by the query. SQL will use this index instead of the [idx1] index because the plan based on the covering index is cheaper. The fact that the index covers the query means that the bookmark lookup against the clustered index is no longer necessary. And since the SELECT no longer needs to access the clustered index, it won’t get blocked by the UPDATE’s lock on the clustered index.

What other solutions are available?

All deadlocks boil down to out-of-order resource access patterns. In the simple deadlock scenario described at the beginning of this post, the solution is obvious: just reverse the two UPDATE statements in one of the transactions, and you won’t end up deadlocked. But in the more complex scenario that we just explored, it’s not so clear how to change the order in which locks are acquired. Each deadlock participant is running a single-query, autocommit transaction, so you can’t just swap the order of two queries to acquire resources in a different order. SQL is a language designed to express high-level set operations; the specifics of how the database should go about retrieving and updating the specified set of data is generally left up to the SQL engine, with good reason. However, you do have some options for either influencing which lock resources a query needs, or modifying the order in which it acquires the locks. Below are six different possible solutions to this deadlock. Some of these are not ideal for this particular deadlock, but they are still worth exploring since the approach to deadlock avoidance that they illustrate may be the best possible solution for some other deadlock you encounter.

  • The new index is arguably the simplest and most elegant solution. This deadlock occurs because two queries take different paths to the same resource. The new index avoids the deadlock by eliminating any need for the SELECT to access the row in the clustered index. As a happy side effect, it also speeds up the SELECT query.
           CREATE INDEX idx2 ON t1 (c2, c3)
  • If you’re running SQL 2005, you could use the new SNAPSHOT or READ_COMMITTED_SNAPSHOT isolation levels.
           ALTER DATABASE deadlocktest SET READ_COMMITTED_SNAPSHOT ON
  • Adding a NOLOCK hint to the SELECT will avoid the deadlock, but be cautious of this solution -- dirty reads can cause runtime errors and will expose you to uncommitted data.
           ALTER PROC p1 @p1 int AS
    SELECT c2, c3 FROM t1 WITH (NOLOCK) WHERE c2 BETWEEN @p1 AND @p1+1
  • As mentioned above, this deadlock occurs because two queries take different paths to the same resource. By forcing one of the queries to use the same index as the other query, you can prevent the deadlock. However, SQL chose query plans that used two different indexes because those were the most efficient plans available for the two queries. By forcing a different index path, you are actually slowing down one of the queries. This may be OK since it does avoid the deadlock, but you should test to make sure the cost is acceptable.
           ALTER PROC p1 @p1 int AS
    SELECT c2, c3 FROM t1 WITH (INDEX=cidx) WHERE c2 BETWEEN @p1 AND @p1+1
    If this query was coming from an application as an ad hoc query (not part of a stored proc), you could either modify the app to specify the index hint or use a plan guide with OPTION (USE PLAN...) if modifying the app wasn't possible. Plan guides are available in SQL 2005 and later.
  • One way to look at this deadlock is as a problem that arises because there’s an index on a frequently-updated column. Dropping the nonclustered index [idx1] will avoid the deadlock by (a) depriving the SELECT of its alternate access path to the row, and (b) preventing the UPDATE from having to update the nonclustered index row when it updates the [c2] column. Like the prior solution, however, this will slow down the SELECT and any other queries that use this index.
    DROP INDEX t1.idx1
  • You could force one of the transactions to block at an earlier point, before it has had an opportunity to acquire the lock that ends up blocking the other transaction. In the example below, the SELECT proc has been modified to run a new query that acquires and holds a lock on the clustered index before it accesses the nonclustered index. In effect, this changes the order of resource access from (nonclustered, clustered) to (clustered, nonclustered). Since that’s the same order that the UPDATE uses, the deadlock is no longer an issue.
           ALTER PROC p1 @p1 int AS
    BEGIN TRAN
    DECLARE @x int
    SELECT @x = COUNT(*) FROM t1 WITH (HOLDLOCK, UPDLOCK) WHERE c1 = @p1
    SELECT c2, c3 FROM t1 WHERE c2 BETWEEN @p1 AND @p1+1
    COMMIT TRAN

If you can think of any other solutions, please share them in a comment.