Performance: Evaluate Data Skew

This topic applies to both Azure SQL Data Warehouse and Analytic Platform System

Data skew occurs when one distribution has more data than others.  When data is inserted into a distributed table, each row will be assigned and sent to a distribution for storage.  The distribution a row is sent to is decided by applying a hash algorithm to the value in the distribution column specified at table creation.  The same value will always go to the same distribution. 

Skew comes into play when the data has a small portion of values that have a large number of duplicates.  While you are able to utilize DBCC PDW_SHOWSPACEUSED ( " [ database_name . [ schema_name ] . ] | [ schema_name .] table_name  " )   to get the row count per distribution, it will not give you insight into the density of the data.  If you want to see the actual values the data is skewed on, you can run the following query against the instance:

 

SELECT Count([distribution_column]) AS [Row Count], [distribution_column] FROM [table_name] group by [distribution_column] ORDER BY [Row Count] DESC

 

If there is a large disparity in the row counts for a couple of values, the distributions those land on will have skew. You will have to decide if this skew is acceptable for your application. In general we say 20% is acceptable.