TOP 100 Percent ORDER BY Considered Harmful.

(Updated 2006-27-03 9:00am Pacfiic Time - at the bottom)

SQL is a declarative language.  That means that the language declares the form of the output but not the method used to generate those results.  There are cases, however, where the language is not quite rich enough to describe what customers want.  There are also cases where it is possible to over-infer guarantees that are not really in the language at all.

ORDER BY is an example that I'd like to discuss in this post.  In ANSI SQL, the language has the ability to specify the output order of a query (the "presentation order").   This applies to query results and to cursor results, if a cursor is being used.  This is exposed via ORDER BY, and this is really only legal in the ANSI spec on the outer-most query block of a query.  Microsoft SQL Server allows this in more places than the spec indicates (which is something we can do and still be in-line with the specification).  Specifically, we allow the use of ORDER BY in sub-selects or in view definitions to help define the set of rows that qualify with a TOP operation (TOP is not in ANSI SQL, by the way).

The default plan implementation for this code happens to sort the rows as part of performing the TOP operation.  Often this meant that the results happened to be returned in sorted order, and this led customers to believe that there was a guarantee that rows were sorted.  This is actually not the case.  If you want rows to be returned to the user in sorted order, you need to use an ORDER BY on the outermost query block (per ANSI) to guarantee the output presentation order. 

In SQL Server 2005, you can see how the output order is *not* guaranteed through the following example:

use tempdb

create table t1 (col1 int, col2 int)
declare @i int
set @i = 0
while @i < 20
begin
insert into t1(col1, col2) values (@i, rand()*1000)
set @i = @i + 1
end

create view v as (select top 100 percent * from t1 order by col1 desc)

set showplan_text on
select * from v

The output from this example is:

StmtText
----------------------------------------------
|--Table Scan(OBJECT:([tempdb].[dbo].[t1]))

col1 col2
----------- -----------
0 443
1 418
2 291
3 726
4 948
5 315
6 835
7 247
8 755
9 78
10 88
11 906
12 640
13 876
14 422
15 746
16 528
17 909
18 186
19 868

You'll notice that the original table was created as a heap (no clustered index) and that no secondary indexes are defined.  So, if there were an ordering guarantee for the query based on the order by in the view, the rows should be sorted in descending order and the query plan would need to have a sort in it to make that happen.  However, you'll notice that the query plan contains only a Table Scan, which will return the rows back in the order they happen to be read from disk.  In this particular case, the optimizer recognizes that TOP 100 PERCENT qualifies all rows and does not need to be computed at all.  It gets removed from the query plan, and there is no other reason to do an intermediate sorting operation.  As such, the output isn't returned in any particular order.

So, please do not assume that nested sub-selects will guarantee order.  The optimizer will consider rewrites that invalidates this assumption.  If you need rows returned in a particular order, please add that to the outermost block of your SELECT statement.

Thanks,

Conor Cunningham

PS: Update... One comment I received seeks a bit more detail on why the ORDER BY is "ignored" in this case.  I'll try to expand a bit more to see if that helps. 

From the semantics of the query, the optimizer only really honors the ORDER BY as part of the evaluation of the TOP in that same scope.  The syntax is a bit unfortunate because it causes people to believe that things "will be ordered".  However, it really only says "I want this set of rows".  Presentation orders only apply to the output of the query, not intermediate nodes.  Since we can reorder operations, you can't actually view this as a procedural guarantee "first I sort, then I do whatever is 'above' the sort".  You don't need an optimizer if that were the case, as you aren't asking declarative questions anymore. 

The bottom line is that even if we do the sort as part of the TOP operation in a sub-select, it does NOT guarantee anything about the output order of the query. 

SELECT TOP 99 PERCENT * FROM T ORDER BY col1

is not the same as:

SELECT * FROM (SELECT TOP 99 PERCENT * FROM T ORDER BY col1) AS A

The top query guarantees the output order of the query.  The bottom query does not (even if the rows happen to come back in sorted order)