Ask Learn
Preview
Ask Learn is an AI assistant that can answer questions, clarify concepts, and define terms using trusted Microsoft documentation.
Please sign in to use Ask Learn.
Sign inThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This topic applies to both Azure SQL Data Warehouse and Analytic Platform System
Data skew occurs when one distribution has more data than others. When data is inserted into a distributed table, each row will be assigned and sent to a distribution for storage. The distribution a row is sent to is decided by applying a hash algorithm to the value in the distribution column specified at table creation. The same value will always go to the same distribution.
Skew comes into play when the data has a small portion of values that have a large number of duplicates. While you are able to utilize DBCC PDW_SHOWSPACEUSED ( " [ database_name . [ schema_name ] . ] | [ schema_name .] table_name " ) to get the row count per distribution, it will not give you insight into the density of the data. If you want to see the actual values the data is skewed on, you can run the following query against the instance:
SELECT Count([distribution_column]) AS [Row Count], [distribution_column] FROM [table_name] group by [distribution_column] ORDER BY [Row Count] DESC
If there is a large disparity in the row counts for a couple of values, the distributions those land on will have skew. You will have to decide if this skew is acceptable for your application. In general we say 20% is acceptable.
Ask Learn is an AI assistant that can answer questions, clarify concepts, and define terms using trusted Microsoft documentation.
Please sign in to use Ask Learn.
Sign in