A common problem for some SQL Server applications are cases where data typically ascends. For example, datetime columns where the column represents a current date. SQL Server builds statistics with the assumption that the data will by in large be similar in the future. However, when data typically ascends, most new insertions are out of the previously found range. This can lead to poorly performing plans as filters selecting recent data seem to exclude the entire relation when in fact a significant number of rows are included.

Trace flag 2389 and 2390, both new in SQL Server 2005 SP1, can help to address this problem. SQL Server 2005 SP1 begins to track the nature of columns via subsequent operations of updating statistics. When the statistics are seen to increase three times the column is branded ascending. If trace flag 2389 is set, and a column is branded ascending, and a covering index exists with the ascending column as the leading key, then the statistics will be updated automatically at query compile time. A statement is compiled to find the highest value and a new step is added at the end of the existing histogram to model the recently added data.

Trace flag 2390 enables the same behavior even if the ascending nature of the column is not known. As long as the column is a leading column in an index, then the optimizer will refresh the statisitc (with respect to the highest value) at query compile time. Never use 2390 alone since this would mean that this logic would be disabled as soon as the ascending nature of the column was known.

-- enable auto-quick-statistics update for known ascending keys

**dbcc traceon( 2389 )**

-- neable auto-quick-statistics update for all columns, known ascending or unknown

**dbcc traceon( 2389, 2390 )** -- never enable 2390 alone

Trace flag 2301, available in SQL Server 2005 SP1, enhances the modelling ability of the query optimizer to better handle complex statements. Improved modelling can lead to dramatically faster performing query plans in some cases. These extensions to the query processor modelling abilities can lead to increased compile time and so should only be used by applications which compile infrequently. The model extensions are as follows:

Integer Modelling

Normally, histogram modelling assumes that values between histogram steps are equally distributed to every numerical double code point. This modelling extension remembers, for integer base types, that values can only occur on integer code points and this improves cardinality estimates for inequality filters.

Comprehensive Histogram Usage

Normally, histograms are ignored when the cardinality of a relation dropps below the number of steps in a histogram. This is a heuristic which captures the liklihood that a histogram continues to describe a relation. This modelling extension applies the histogram in cardinality estimate regardless of the cardinality estimate for the relation.

Base Containment Assumption

Normally, when two relations are joined, we assume that X distinct code points in the same key range on input relation R will join with Y distinct code points in the same key range on input relation S such that MIN(X,Y) will find matches. This assumption is called Simple Containment. We assume that the smaller number of distinct code points match with code points from the other side. This modelling ignores the relative population of distinct code points in the base forms of R and S, and also ignores any filtering that has occured to the base forms for R and S before joining. Base containment applies the containment assumption only to the base relations and uses probabilistic methods to compute the degree of joining. In addition, implied filters are modelled correctly since their behavior is very different from orthogonal filters.

Comprehensive Density Remapping

Normally, when columns are CONVERTed only a small number of densities involving such columns are remapped to the new column definitions. Note that operations like convert rarely change the density of a column. Density is the measure of the number of duplicate values for each distinct value. With this modelling extension, all such remappings are applied which makes possible subsequent density matching for the purposes of cardinality estimation. In some cases, this can lead to excessive use of memory.

Comprehensive Density Matching

Normally, densities are matched when the very same base column is filtered or joined. With this modelling extension, the notion of equivalence of columns as a result of equi-joins is applied leading to more complete density matching for the purposes of improved cardinality estimation.

These extentions all were developed to address customer found problems relation to poor performing query plans. If customers experience such poor performing plans where one or more of the above extentions may help, then trace flag 2301 may be applied. It is important to note that compile times will increase, and in some cases memory consumption can increase dramatically. Thus, it is important to apply this trace flag with care and test exhaustively before using in production.

]]>

SQL Server 8.0 did not perform cardinality estimates based on the comparion of two constants. Instead, SQL Server 8.0 guessed at the resulting selectivity. The reasoning for this is that one or more of the constants may be statement parameters, which would change from one execution of the statement to the next. However, SQL Server 9.0 reversed this behavior and does esimate operator cardinality based on the comparison of two constants, when both values are available at compile time. Constant values are avilable at compile when when 1) they are literal constants or 2) they are parameters to a stored procedure or otherwise set at the nesting level above the the compilation of the statement. This change in behavior can be problematic for statements that optimize for one set of values but run for other sets of values. Applications that find adverse plan changes resulting from the change in behvior can revert behavior to that of the previous released by enabling trace flag **2328**. For example, to revert the behavior server wide, for the current invokation of SQL Server, one could run the following command:

dbcc traceon( 2328, -1)

If running SQL Server 8.0, or SQL Server 9.0 with trace flag 2328 enabled, the optimizer will guess the selectivity arising from the comparion of two constants. The guess used is the same as comparing a column to an constant (where either the column distribution or the constant value, or both, is unknown). Since the selectivity of the comparison of two constants is either 0 or 1, one might think that a guess for such a comparion wouldbe 50%. Although there is a lot of logic in this behavior, for historic reasons, the guess used by SQL Server was very different.

Guessing the selectivity of a column-constant comparsion attempts to model several phenomena. First, when there are many rows, the selectivity is likely smaller because there could be more values from which to choose. Second, when there are many such conjuncts, it is more likely that subsequent conjuncts will be correlated with previous conjucts. SQL Server therefore reduces the selectivity from one to 0 by the exponent of the cardinality and the reducition factor is reduced with each guess. The following table shows the number of conjucts guessed and the resultant selectivity as a function of input table cardinality of N:

# Conjucts Cardinality Selectivity

1 N^(3/4) N^(-1/4)

2 N^(11/16) N^(-5/16)

3 N^(43/64) N^(-21/64)

4 N^(171/256) N^(-85/256)

5 N^(170/256) N^(-86/256)

6 N^(169/256) N^(-87/256)

7 N^(168/256) N^(-88/256)

...

175 N^(0/256) N^(-1)

]]>This behavior has changed in SQL Server 2005, and applications that are porting from SQL Server 2000 to SQL Server 2005 should re-examine whether their UDFs can be schema bound and to make them schema bound if they are not already so marked.

__Background on Halloween Protection.__

"Halloween protection" in database systems refers to a solution for a problem that can occur in update queries. The problem occurs when an update itself affects the rows selected for update. For example, imagine a company wants to give every employee a 10% pay increase. If the update query is unlucky enough to walk the salary index then the same row may be selected, updated, the update moves the row ahead in the index, and then the row will be redundantly selected again and again. This problem is corrected by isolating the rows chosen from the effects of the update itself. For example, a SPOOL operation which stores all the rows to be updated outside of the context and any index can provide the necessary isolation. SORTs are also sufficient for isolation purposes.

]]>Ascending key columns, such as IDENTITY columns or datetime columns representing real-world timestamps, can cause inaccurate statistics in tables with frequent INSERTS because new values all lie outside the histogram. Consider updating statistics on such columns frequently with a batch job if your application seems to be getting inadequate query plans for queries that have a condition on the ascending key column. How often to run the batch job depends on your application. Consider daily or weekly intervals, or more often if needed for your application. Alternatively, trigger the job based on an application event, such as after a bulk load or after a certain number of INSERT operations.

]]>When a query requires statistics on the result of a

SELECT * FROM Lineitem l WHERE EXISTS

( SELECT * FROM Region1 r1 WHERE r1.C = l.C and r1.S = l.S

UNION SELECT * FROM Region2 r2 WHERE r2.C = l.C and r2.S = l.S )

In order for the EXISTS to be optimized accurately, it may be necessary to have a multi-column statistic on columns C and S in for Lineitem, Region1 and Region2. If the multi-column statistic does not exist in either Region1 or Region2 then the remaining statistics cannot be used to optimize this query.

]]>When a query has a multi-column condition, consider creating multi-column statistics if you suspect that the optimizer is not producing the best plan for the query. You get multi-column statistics as a by-product of creating a multi-column index, so if there is already a multi-column index that supports the multi-column condition, there is no need to create statistics explicitly. Auto create statistics only creates single-column statistics, never multi-column statistics. So if you need multi-column statistics, create them manually, or create a multi-column index.

Consider a query that accesses the AdventureWorks.Person.Contact table, and contains the following condition:

FirstName = 'Catherine' AND LastName = 'Abel'

To make selectivity estimation more accurate for this query, create the following statistics object:

CREATE STATISTICS

This statistics object will be useful for queries that contain predicates on LastName and FirstName, as well as LastName alone. In general, the selectivity of a predicate on any prefix of the set of columns in a multi-column statistics object can be estimated using that statistics object.

For a statistics object to fully support a multi-column condition, a *prefix* of the columns in the multi-column statistics object must contain the columns in the condition. For example, a multi-column statistics object on columns (a,b,c) only partially supports the condition a=1 AND c=1; the histogram will be used to estimate the selectivity for a=1, but the density information for c will not be used since b is missing from the condition. Multi-column statistics on (a,c) or (a,c,b) would support the condition a=1 AND c=1, and density information could be used to improve selectivity estimation.

For a large majority of SQL Server installations, the most important best practice is to use auto create and auto update statistics database-wide. Auto create and auto update statistics are on by default. If you observe bad plans and suspect that missing or out of date statistics are at fault, verify that auto create and auto update statistics are on.

An alternative to enabling auto create statistics is enabled or make sure to manually create statistics using CREATE STATISTICS or **sp_createstats**. Note that auto-statistics will not work for read-only databases.

In some cases, a SQL statement can be simplified by using procedural logic. Instead of issuing one complex query to cover multiple cases with

Limit use of multi-statement table valued functions (TVFs) and table variables in situations where getting a high-performance plan is requied. Both multi-statement TVFs and table variables have no statistics. The optimizer must guess the size of their results. Similarly, result column do not have statistics, and the optimizer must resort to guesses for predicates involving these columns. If a bad plan results because of these guesses, consider using an in-lined TVF, or standard table or temporary table as a temporary holding place for the results of the TVF, or a replacement for the table variable. This will allow the optimizer to use better cardinality estimates. Note that this technique may also lead to an increased number of statement recompilations.