Sampling can produce less accurate statistics if the data is not evenly distributed

 

Recently I worked with a very knowledgable customer who called in and wanting to know things about statistics.  This is because he noticed that his query would get inaccurate cardinality estimate due to ‘inaccurate histogram.    Specifically, he has questioned why AVG_RANGE_ROWS would be very high when he did 10% sampling.  But it became very low (almost distinct) when he used 100% sampling.

In order to illustrate the issue, let me create a fake table and populate data using this script:

create database dbStats
go
go
alter database dbstats set recovery simple
go
use dbStats
go
create table t(c1 uniqueidentifier)

go
set nocount on
begin tran
declare @i int = 0
declare @id1 uniqueidentifier = newid()
while @i< 8000000
begin

    declare @id uniqueidentifier
if @i % 100 = 0
set @id = NEWID()
insert into t values (NEWID())
insert into t values (@id)
if @i < 2000000
insert into t values (@id1)
set @i +=1
if (@i % 100000 = 0)
begin
commit tran

Technorati Tags: Performance

begin tran
end
end
commit tran
go

create index ix on t(c1)
go

 

If you update the statistics with 10% sampling, you will get histogram 1 (below).  But if you update statistics with 100% sampling, you will get histogram 2.  Note that one major difference is that AVG_RANGE_ROWs are much higher in histogram 1 than in histogram 2.

In fact, as you increase sampling rate from 10% to a larger number, the AVG_RANGE_ROWS  will gradually decrease.  First of all, AVG_RANGE_ROWS basically means for any value within a histogram step, how many duplicates are there.   If you have a value of 2 for AVG_RANGE_ROWS, it means for any given value within the histogram step,  it will have 2 duplicates.   SQL Server optimizer uses this to do cardinality estimate for the values falling within a histogram step.

In order to explain what’s going on, let’s take a look at data first.   The data is constructed in a way that is not evenly distributed.   The column has 8 million  distinct values that only appear once.  There is one value that appears  2 million times.  Then there are 80,000 values that appear 100 times within the table.

So overall, the data is very selective.  Out of 18 million rows, there are more than  distinct 8 million values.   The customer’s argument is that for any given value that falls within a histogram, SQL really should estimate less than 2 rows.  AVG_RANGE_ROWS should be less than 2 rows.

When you do 100% sampling, the AVG_RANGE_ROWS is 1.8 or 2.  So it’s accurate.  But with 10% sampling, most AVG_RANGE_ROWS is 18 (much higher).

The reason is that the data is not evenly distributed.   If the data is truly evenly distributed, 10% sampling and 100% sampling will produce similar results.   But if some values appear way more than other values, sampling produce less accurate results.  This is because the more frequent values will have higher chance of being selected to compute statistics. At final stage, the values are scaled up to produce the ‘inflated’ statistics.   This eventually ends up with a statistics histogram that tells SQL Server that data is less selective than really is.

What’s the solution?

There are a couple of things. If you can afford fullscan (100%) or increasing sampling, do that.   If you can’t, you may have to rely in index hints for some queries.

 

Histogram 1 (10% sampling)

RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
------------------------------------ ------------- ------------- -------------------- --------------
ADA7D301-B912-4927-BC84-0000730876EE 0 1 0 1
1D4D7A35-9E7B-46E7-BF87-01B486B2166B 113919.4 1005.627 6442 17.6839
2F22F6DE-A893-4826-A8B8-04E5A31C5819 187018.4 1005.627 11932 15.67406
2602AD30-5365-4DE6-BEFB-07E79CB221F0 193601.4 1005.627 11077 17.47708
ADC703C1-2CD6-4107-9177-09268149C7C6 73038.45 1005.627 4597 15.88914
968BE242-C378-4FDA-9A6F-0B883FCBAB52 169844.2 1005.627 9031 18.80737
87B47694-B775-4414-B638-103D45D87C84 303451.8 1005.627 17569 17.27227
7DE3FC0A-2A74-4C90-B0E3-12177F27B7F2 110587.5 1005.627 6711 16.47944
7B9DB839-7F96-46A6-821D-130BC4345A90 66637.23 1005.627 3668 18.16596
BFC1684B-EB83-4D20-B0DF-150805AC5A8E 109729.3 1005.627 7341 14.9474

Histogram 2 (100% sampling)

RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
------------------------------------ ------------- ------------- -------------------- --------------
7972667E-CC57-42F4-A414-00000190A98E 0 1 0 1
CFF7C764-41AA-4E89-B802-00AA81D1D512 39129 100 21012 1.862222
473C80D5-A9CC-4186-B4B3-093CCCEAD064 538744 100 270949 1.988359
61E3899D-74CD-4090-9E76-0A031A54EFD6 44973 100 24480 1.837132
E857ABD0-DA6D-4367-92C7-1D98186AF634 1224175 100 617701 1.981825
8863C50F-160D-4DD3-A1C4-1E24D9DE79D3 31413 100 17454 1.799759
3AF1D87E-805E-49B6-9870-301658472C85 1109208 100 565698 1.960778
20BEE350-A654-43F3-BB84-30A978964057 33231 100 17985 1.847706
443AC49E-FE4B-4E32-9FF9-47A667FF1E0F 1451375 100 725309 2.001044
5BAFE28B-6AF7-42D8-BB9E-48C3D1FB3F7B 67121 100 35342 1.899185
FE43EE54-4D31-4CCF-B231-4D6E02B3F852 295111 100 147502 2.000726
8EC2F5D1-50BC-4452-8F9A-4DFD046D3E5B 32713 100 17566 1.862291
C636084E-7EC5-48B6-A273-5451C9EE8269 393640 100 199600 1.972144
75DE7AB9-EEE9-4426-A907-54E4F00BA1C1 33538 100 18292 1.833479
5F5A4E68-3573-4EB3-8ABC-5ACA7ABE7C73 361499 100 185873 1.944871
48BC6F49-D712-4B7B-A91D-5BC822179B2F 65751 100 31497 2.087532
B22C40FE-1F99-47F8-872D-ACFB95246E52 5078911 100 2564212 1.980691

 

Jack Li | Senior Escalation Engineer | Microsoft SQL Server Support