More on TOP

Last week I wrote about a special case of the TOP operator known as ROWCOUNT TOP.  This week I'll take a look at some other interesting TOP scenarios.  In general, TOP is a fairly mundane operator.  It simply counts and returns the specified number of rows.  SQL Server 2005 does include two enhancements to TOP that were not present in SQL Server 2000.  First, in SQL Server 2000, we can only specify an integer constant for the number of rows to return.  In SQL Server 2005, we can specify an arbitrary expression, including an expression containing T-SQL variables or parameters.  Second, SQL Server 2000 only permits TOP in a SELECT statement (though it does support ROWCOUNT TOP in INSERT, UPDATE, and DELETE statements).  SQL Server 2005 permits TOP with SELECT, INSERT, UPDATE, and DELETE statements.

In this post, I'm going to focus only on some simple examples involving SELECT statements.  Let's create a small table to get started:

CREATE TABLE T (A INT, B INT)
CREATE CLUSTERED INDEX TA ON T(A)

SET NOCOUNT ON
DECLARE @i INT
SET @i = 0
WHILE @i < 100
  BEGIN
    INSERT T VALUES (@i, @i)
    SET @i = @i + 1
  END
SET NOCOUNT OFF

The plan for the simplest possible TOP query should not need any explanation:

SELECT TOP 5 * FROM T

Rows   Executes
5      1        |--Top(TOP EXPRESSION:((5)))
5      1             |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

TOP is often used in conjunction with ORDER BY.  Combining TOP with ORDER BY adds determinism to the set of rows returned.  Without the ORDER BY, the set of rows returned depends on the query plan and may even vary from execution to execution.  If we have an index to provide order, we continue to get a simple query plan (notice the ORDERED FORWARD keywords):

SELECT TOP 5 * FROM T ORDER BY A

Rows   Executes
5      1        |--Top(TOP EXPRESSION:((5)))
5      1             |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]), ORDERED FORWARD)

If we do not have an index to provide order, SQL Server must sort the data:

SELECT TOP 5 * FROM T ORDER BY B

Rows   Executes
5      1        |--Sort(TOP 5, ORDER BY:([tempdb].[dbo].[T].[B] ASC))
100    1             |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Note that without an index to find the top 5 rows, SQL Server must scan all 100 rows in the input table.  Also note that the sort is actually a "top sort."  A top sort generally uses less memory than a regular sort as it only needs to identify and sort the top few rows rather than sorting the entire input.

Now let's consider what happens if we request the top 5% of rows.  To figure out the actual number of rows to return, SQL Server must count all of the rows and calculate 5%.  This makes queries that use TOP PERCENT less efficient than queries that use TOP with an absolute row count.

SELECT TOP 5 PERCENT * FROM T

Rows   Executes
5      1        |--Top(TOP EXPRESSION:((5.000000000000000e+000)) PERCENT)
5      1             |--Table Spool
100    1                 |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Like the prior example, SQL Server must scan all 100 rows in the input table.  In this example, SQL Server uses an eager spool which reads and counts all of the input rows before returning any rows.  The top then requests the row count from the spool, calculates 5%, and proceeds like any other top.

If SQL Server must sort, the sort can also provide the count of input rows.  However, only a regular sort can provide this count.  A top sort must know the number of rows to return from the start.

SELECT TOP 5 PERCENT * FROM T ORDER BY B

Rows   Executes
5      1        |--Top(TOP EXPRESSION:((5.000000000000000e+000)) PERCENT)
5      1             |--Sort(ORDER BY:([tempdb].[dbo].[T].[B] ASC))
100    1                  |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

TOP WITH TIES is also incompatible with top sort.  TOP WITH TIES cannot know for certain how many rows will be returned until the number of ties is determined.  For this example, let's add a tie for the fifth row returned:

INSERT T VALUES (4, 4)

SELECT TOP 5 WITH TIES * FROM T ORDER BY B

Rows   Executes
6      1        |--Top(TOP EXPRESSION:((5)))
7      1             |--Sort(ORDER BY:([tempdb].[dbo].[T].[B] ASC))
101    1                  |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Although the above plan does not seem to reflect that we have a TOP WITH TIES, the argument column of SHOWPLAN_ALL or STATISTICS PROFILE includes this information:  "TIE COLUMNS:([T].[B])".  This information is also available in the graphical and XML plans on SQL Server 2005.  Note that the TOP actually returns one extra row.  When a TOP N WITH TIES reaches the Nth row, it keeps a copy of the tie columns for this row (in this example B==4) and compares each subsequent row to that row.  As long as there are rows that match, it keeps returning them.  Because the top must compare subsequent rows until it finds a non-match to the Nth row, the top retrieves one more row from the sort than it returns.

Finally, there are a couple of special cases.  The optimizer knows that TOP 0 and TOP 0 PERCENT never return any rows and replaces any query plan with a TOP 0 with a constant scan:

SELECT TOP 0 * FROM T

  |--Constant Scan

The optimizer also knows that TOP 100 PERCENT always returns all rows and removes the top operator from the query plan:

SELECT TOP 100 PERCENT * FROM T

  |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T].[TA]))

Both of these optimizations require that the number of rows is a constant.  For example, using an expression that includes a T-SQL variable or parameter will prevent the optimization.  Both optimizations also work with INSERT, UPDATE, and DELETE statements.

Note that using TOP to work around SQL language restrictions on the use of ORDER BY in sub-selects or in views or to force specific plan evaluation orders is not recommended.