Implicit Conversions

In my last couple of posts, I wrote about how explicit conversions can lead to errors.  In this post, I'm going to take a look at some issues involving implicit conversions.  SQL Server adds implicit conversions whenever you mix columns, variables, and/or parameters with different (but compatible) data types in a single expression.  For example, if you try to compare INT and FLOAT columns, the INT must be converted to a FLOAT.  If you write "C_INT = C_FLOAT", SQL Server rewrites this expressions as "CONVERT(FLOAT,C_INT) = C_FLOAT".

When performing implicit conversions, SQL Server will try to choose the conversion that is least likely either to fail due to an overflow or to lose precision.  For example, a SMALLINT will be converted to an INT since all SMALLINTs can be converted to INTs without any data loss.  On the other hand, an INT will be converted to a REAL since all INTs can be converted to REALs but not vice-versa.  However, the conversion is potentially lossy since some INTs contain more digits than can be represented by a REAL.

Note that some data types can only be combined in a single expression by using an explicit conversion and that some data types are not compatible at all with or without an explicit conversion.  This Books Online page includes a compatibility matrix showing all of the possible data type combinations.  For the remainder of this post, I'm going to focus only on implicit conversions.

To get started, I'd like to show an example that did not work so well on SQL Server 2000:

CREATE TABLE T (C INT)
CREATE UNIQUE CLUSTERED INDEX TC ON T(C)

INSERT T VALUES (1000000000)
INSERT T VALUES (1000000001)

First, let's consider a query that uses a clustered index scan:

DECLARE @V REAL
SET @V = 1E9
SELECT * FROM T WITH (INDEX(0)) WHERE C = @V

 C           
----------- 
1000000000
1000000001

At first glance, this may seem like an incorrect result.  The second row does not appear to match the predicate.  However, if we check the plan, we see that SQL added an implicit convert to the predicate:

  |--Clustered Index Scan(OBJECT:([T].[TC]), WHERE:(Convert([T].[C])=[@V]))

Moreover, if we check the results of the conversion, we can see that it is lossy.  Both INTs convert to the same REAL value.  The above results now make more sense:

SELECT *, CONVERT(REAL,C) REAL_C FROM T

 C           REAL_C                   
----------- ------------------------ 
1000000000  1.0E+9
1000000001  1.0E+9

Now, let's consider the same query with a clustered index seek:

DECLARE @V REAL
SET @V = 1E9
SELECT * FROM T WITH (INDEX(1)) WHERE C = @V

 C           
----------- 
1000000000

Hold it!  What happened to the second row?  Let's check the plan:

  |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1002], [Expr1003], [Expr1004]))
       |--Compute Scalar(DEFINE:([Expr1002]=Convert([@V])-1, [Expr1003]=Convert([@V])+1, [Expr1004]=If (Convert([@V])-1=NULL) then 0 else 6|If (Convert([@V])+1=NULL) then 0 else 10))
       |    |--Constant Scan
       |--Clustered Index Seek(OBJECT:([T].[TC]), SEEK:([T].[C] > [Expr1002] AND [T].[C] < [Expr1003]),  WHERE:(Convert([T].[C])=[@V]) ORDERED FORWARD)

This plan looks rather complicated, but aside from unfortunately getting a wrong answer (at least according to the conversion semantics we just discussed), it is really quite simple.  To perform an index seek, SQL Server must have key values that match the data type stored in the index.  SQL Server cannot perform an index seek on an INT index using a REAL key.  So, SQL Server must convert the REAL variable to an INT.  Because, as we've already seen, conversions can be lossy, SQL Server also expands the range of values returned by the index seek by subtracting and adding one to the result of the conversion.  Basically, after substituting the value of the parameter, this plan is the same as running the following query:

SELECT * FROM T WHERE (C > 999999999 AND C < 1000000001) AND CONVERT(REAL,C) = 1E9

Notice that the original predicate with the convert is retained in case the index seek returns too many rows and some must be filtered out.  In most cases, this algorithm works just fine; I've deliberately selected a scenario that does not work so well.  In addition to potentially omitting results, this algorithm may also harm performance.  For instance, consider this example:

CREATE TABLE T (C NUMERIC(12,4))
CREATE UNIQUE CLUSTERED INDEX TC ON T(C)

INSERT T VALUES (0.0001)
INSERT T VALUES (0.0002)
INSERT T VALUES (0.0003)
...

DECLARE @P REAL
SET @P = 1.0000
SELECT * FROM T WITH (INDEX(1)) WHERE C = @P

Suppose that table T includes many values in the range 0.0000 to 2.0000.  In this example, the conversion is lossless.  Only rows with the value 1.0000 should be returned.  Unfortunately, the index seek still returns all of the rows in the range and the filter with the convert discards all of them except for those with the value 1.0000.

Now let's see how SQL Server 2005 (and 2008) handle these same queries.  Let's start with the clustered index scan:

DECLARE @P REAL
SET @P = 1E9
SELECT * FROM T WITH (INDEX(0)) WHERE C = @P

  |--Clustered Index Scan(OBJECT:([T].[TC]), WHERE:(CONVERT_IMPLICIT(real(24),[T].[C],0)=[@P]))

This is the same plan that we got from SQL Server 2000 and we get the same result (both rows).  The only difference is that SQL Server 2005 indicates that the convert is implicit and provides the type of the conversion.  Next let's try the clustered index seek:

DECLARE @P REAL
SET @P = 1E9
SELECT * FROM T WITH (INDEX(1)) WHERE C = @P

  |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1005], [Expr1006], [Expr1004]))
       |--Compute Scalar(DEFINE:(([Expr1005],[Expr1006],[Expr1004])=GetRangeThroughConvert([@P],[@P],(62))))
       |    |--Constant Scan
       |--Clustered Index Seek(OBJECT:([T].[TC]), SEEK:([T].[C] > [Expr1005] AND [T].[C] < [Expr1006]),  WHERE:(CONVERT_IMPLICIT(real(24),[T].[C],0)=[@P]) ORDERED FORWARD)

This plan is similar to the SQL Server 2000 plan with one key difference.  The conversion of the seek key from REAL to INT is performed by an internal function GetRangeThroughConvert which correctly determines the exact range of values to seek.  A little experimentation will show that, in this example, the range works out to all values between 999,999,968 and 1,000,000,479.  On the other hand, for the example with the numeric column, the "range" works out to exactly 1.0000.  This solution is both more accurate and more efficient than the SQL Server 2000 algorithm.

Finally, let's take a look at the performance implications of the new plan: 

CREATE TABLE T1 (C_INT INT, C_REAL REAL)
CREATE CLUSTERED INDEX T1_C_REAL ON T1(C_REAL)

CREATE TABLE T2 (C_INT INT, C_REAL REAL)
CREATE CLUSTERED INDEX T2_C_INT ON T2(C_INT)

SET NOCOUNT ON
DECLARE @I INT
SET @I = 0
WHILE @I < 100000
  BEGIN
    INSERT T1 VALUES (@I, @I)
    INSERT T2 VALUES (@I, @I)
    SET @I = @I + 1
  END

Here are three join plans.  I'm using a join hint to ensure that I do not get a hash join which uses a completely different algorithm.  I'm using COUNT(*) to eliminate the overhead of transferring output rows to the client and an OPTION hint to avoid parallelism.  The first plan joins the two INT columns.  The second plan joins the two REAL columns.  The third plan joins the INT and REAL columns and uses the same algorithm with the GetRangeThroughConvert function as the simpler query above.

SELECT COUNT(*)
FROM T1 INNER LOOP JOIN T2 ON T1.C_INT = T2.C_INT
OPTION(MAXDOP 1)

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1012],0)))
       |--Stream Aggregate(DEFINE:([Expr1012]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T1].[C_INT], [Expr1011]) WITH UNORDERED PREFETCH)
                 |--Clustered Index Scan(OBJECT:([T1].[T1_C_REAL]))
                 |--Clustered Index Seek(OBJECT:([T2].[T2_C_INT]), SEEK:([T2].[C_INT]=[T1].[C_INT]) ORDERED FORWARD)

SELECT COUNT(*)
FROM T2 INNER LOOP JOIN T1 ON T2.C_REAL = T1.C_REAL
OPTION(MAXDOP 1)

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1012],0)))
       |--Stream Aggregate(DEFINE:([Expr1012]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T2].[C_REAL], [Expr1011]) WITH UNORDERED PREFETCH)
                 |--Clustered Index Scan(OBJECT:([T2].[T2_C_INT]))
                 |--Clustered Index Seek(OBJECT:([T1].[T1_C_REAL]), SEEK:([T1].[C_REAL]=[T2].[C_REAL]) ORDERED FORWARD)

SELECT COUNT(*)
FROM T1 INNER LOOP JOIN T2 ON T1.C_REAL = T2.C_INT
OPTION(MAXDOP 1)

  |--Compute Scalar(DEFINE:([Expr1008]=CONVERT_IMPLICIT(int,[Expr1014],0)))
       |--Stream Aggregate(DEFINE:([Expr1014]=Count(*)))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([T1].[C_REAL]))
                 |--Clustered Index Scan(OBJECT:([T1].[T1_C_REAL]))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1012], [Expr1013], [Expr1011]))
                      |--Compute Scalar(DEFINE:(([Expr1012],[Expr1013],[Expr1011])=GetRangeThroughConvert([T1].[C_REAL],[T1].[C_REAL],(62))))
                      |    |--Constant Scan
                      |--Clustered Index Seek(OBJECT:([T2].[T2_C_INT]), SEEK:([T2].[C_INT] > [Expr1012] AND [T2].[C_INT] < [Expr1013]),  WHERE:([T1].[C_REAL]=CONVERT_IMPLICIT(real(24),[T2].[C_INT],0)) ORDERED FORWARD)

I measured the times using SET STATISTICS TIME ON.  Here are the results I got on my system:

SQL Server Execution Times:   CPU time = 750 ms,  elapsed time = 741 ms.

SQL Server Execution Times:   CPU time = 735 ms,  elapsed time = 742 ms.

SQL Server Execution Times:   CPU time = 1609 ms,  elapsed time = 1605 ms.

Notice that each of first two queries with the joins on the matching types complete in 740ms while the third query with the join on the mismatched types takes more than twice as long at 1600ms!  The moral of this story?  Try to stick to matching types whenever possible.