SQL Optimizations – Manual Update statistics on small tables may provide a big impact.



Problem:


When I analyzed a performance problem in an ISV data warehouse solution running on SQL Server 2000 I figured out that the query plan of a star join (join between a huge fact table and a bunch of small dimension tables) didn’t look right. Further investigation showed that certain optimizer estimates didn’t correspond to the real values during runtime (e.g. number of rows in the result of a join). So I wondered if maybe the statistics information wasn’t up to date. It looked fine for the big fact table and some of the bigger dimension tables. But it was missing on the very small dimension tables. The statistics info was just not there at all. The automatic statistics setting was turned on though and I was told by the customer that this was always the case.


Then I got the explanation from the Optimizer dev team that SQL Server 2000 ignores very small tables (normal tables not temp tables) for automatic statistics. The threshold is about 500 rows. Unfortunately this might happen quite often in relational data warehouse solutions which use star schemas. The effect of joining a few-hundred million rows fact table with some small dimensions the wrong way might be dramatic – in a negative sense.


Solution:


Run update statistics manually on the small tables. Depending on how static these tables are (frequency of updates) it might be enough to do it only once to avoid that the statistics are completely missing. Another option would be to upgrade to SQL Server 2005 where this issue is fixed related to the initial creation of statistics information.


You will find some additional and useful info about automatic statistics in the following article:


http://support.microsoft.com/kb/195565/en-us


What neither SQL Server 2000 nor SQL Server 2005 solve yet is to manage outdated or obsolete statistics on tiny tables (in terms of number of rows).


This is much easier to handle with huge tables. If you add one row to a 1-million row table it will very likely not matter regarding the data distribution. But if you add 1 row to a 1-row table you double the data. The key question is : how often should the statistics be updated ? If it happens too often (e.g. with every new single insert, update or delete) it might thrash the whole system.


Results:


Depending on the query and the amount of data the difference in response time could be almost everything from some percent up to factor 100 or even more. In my specific customer case the query response time went down from 30 minutes to three minutes after running update statistics on the small dimension tables manually.


 


Comments (7)

  1. Peter K says:

    I have a case in a star schema where the fact has many small dimensions around it and a handful of large dimensions.

    The small dimenions typically have between 10 and 100 rows in them.

    When the fact is joined out to all the dimensions, the optimizer halves its estimate of the size of the result set each time it joins to one of the small dimensions.

    If the initial result set is say 100k rows, the estimate drops to 50K after the join to the first small dimension, 25k after the second, 12k after the third… roughly speaking.

    Once the estimate of the result set gets to about 10k rows, the joins switch from hash joins to nested loop joins.

    Net result is lots of nested loop joins occurring on results sets that are 100k and bigger.

    The stats have been updated manually      update statistics <tablename> with fullscan

    Once a bit of noise is added to the dimensions the optimizer starts to get estimates of the result set correct.

    About 1,000 rows seems to do the trick.

    Is there a simple way to get the optimizer making the correct estimates of the size of the result set on these joins to the small dimensions

Skip to main content