TSQL - Solve it YOUR Way - Finding the percentage of NULL values for each column in a table

Introduction:

As part of the blog series TSQL - Solve it YOUR Way, today's topic will cover a question asked in the Transact-SQL MSDN Forums here, followed by three different solutions from three of the more helpful and creative contributors in the TSQL MSDN forums, Jens Suessmeyer, Naomi Nosonovsky, and Jingyang Li.

Topic: How can I find the percentage of NULL values for each column in my table?

For various reasons, people may need to determine the percentage of values in each column that are NULL.  Perhaps you are trying to calculate the percentage to determine if your table could save disk space by leveraging SPARSE columns.  In another real-world example from the forums, a user had an implementation where partial rows in a table were being updated at different times.  When all rows were updated with a value and were thus NOT NULL, the table was considered "complete" and could be processed.  For this reason, the customer was trying to determine when the percentage of NULL values hit 0%.  In yet another case, the user wanted to present a comparison chart throughout the entire database showing the ratio of NULL values to NOT NULL values.  For the final example, take a look at the explanation provided below by Jens where he solved this problem at a customer site. 

Solution #1 - Provided by Jens Suessmeyer

Code Snippet

  1. SET NOCOUNT ON

  2. DECLARE @Statement NVARCHAR(MAX) = ''

  3. DECLARE @Statement2 NVARCHAR(MAX) = ''

  4. DECLARE @FinalStatement NVARCHAR(MAX) = ''

  5. DECLARE @TABLE_SCHEMA SYSNAME = 'Production'

  6. DECLARE @TABLE_NAME SYSNAME = 'Product'

  7. SELECT

  8.         @Statement = @Statement + 'SUM(CASE WHEN ' + COLUMN_NAME + ' IS NULL THEN 1 ELSE 0 END) AS ' + COLUMN_NAME + ',' + CHAR(13) ,

  9.         @Statement2 = @Statement2 + COLUMN_NAME + '*100 / OverallCount AS ' + COLUMN_NAME + ',' + CHAR(13) FROM

  10. INFORMATION_SCHEMA.COLUMNS

  11. WHERE TABLE_NAME = @TABLE_NAME AND TABLE_SCHEMA = @TABLE_SCHEMA

  12. IF @@ROWCOUNT = 0

  13.         RAISERROR('TABLE OR VIEW with schema "%s" and name "%s" does not exists or you do not have appropriate permissions.',16,1, @TABLE_SCHEMA, @TABLE_NAME)

  14. ELSE

  15. BEGIN

  16.         SELECT @FinalStatement =

  17.                 'SELECT ' + LEFT(@Statement2, LEN(@Statement2) -2) + ' FROM (SELECT ' + LEFT(@Statement, LEN(@Statement) -2) +

  18.                 ', COUNT(*) AS OverallCount FROM ' + @TABLE_SCHEMA + '.' + @TABLE_NAME + ') SubQuery'

  19.         EXEC(@FinalStatement)

Explanation of Jens' solution:

To put a bit more context to my solution, I was at a customer site where they found out that their API was occasionally losing the information on the way from the front end to the database, therefore NULLing out specific fields which should not be NULL. While they checked through their database and wrote queries for their hundreds of tables, they asked me working for another work stream to write something they can use. The simple idea is to grab a Table Name (and of course the appropriate schema) and create a Dynamic SQL statement which will calculate the percentage of NULL values in a specific table.

From the query the following information will be returned:

In detail we see that the Color (on purpose or not) is not always entered showing a 49% of NULL values in that column. 58% for Size, 65% for SizeUnitMeasureCode, etc. The rest does have (at least NON NULL) values. If you would be also interested in the percentage of that are NULL OR Empty the query can be easily tweaked by changing the source code the following line from:

@Statement = @Statement + 'SUM(CASE WHEN ' + COLUMN_NAME + ' IS NULL THEN 1 ELSE 0 END) AS ' + COLUMN_NAME + ',' + CHAR(13) ,

to

@Statement = @Statement + 'SUM(CASE WHEN LEN(ISNULL(' + COLUMN_NAME + ',' + REPLICATE(CHAR(39),2) + ')) = 0 THEN 1 ELSE 0 END) AS ' + COLUMN_NAME + ',' + CHAR(13) ,

Leading to a change in the generated code of:

SUM(CASE WHEN LEN(ISNULL(CompanyName, '')) = 0 THEN 1 ELSE 0 END) AS CompanyName,

Feel free to include your own logic and forensic pattern to find all the black holes in your databases.

Solution #2 - Provided by Naomi Nosonovsky

Code Snippet

  1. DECLARE @TotalCount DECIMAL(10, 2)

  2.     ,@SQL NVARCHAR(MAX)

  3. SELECT @TotalCount = COUNT(*)

  4. FROM [AdventureWorks].[Production].[Product]

  5. SELECT @SQL = STUFF((

  6.             SELECT ', CAST(SUM(CASE WHEN ' + Quotename(C.COLUMN_NAME) + ' IS NULL THEN 1 ELSE 0 END) * 100.00

  7. /@TotalCount AS decimal(10,2)) AS [' + C.COLUMN_NAME + ' NULL %]

  8. '

  9.             FROM INFORMATION_SCHEMA.COLUMNS C

  10.             WHERE TABLE_NAME = 'Product'

  11.                 AND TABLE_SCHEMA = 'Production'

  12.             ORDER BY C.ORDINAL_POSITION

  13.             FOR XML PATH('')

  14.                 ,type

  15.             ).value('.', 'nvarchar(max)'), 1, 2, '')

  16. SET @SQL = 'SET @TotalCount = NULLIF(@TotalCount,0)

  17. SELECT ' + @SQL + '

  18. FROM [AdventureWorks].Production.Product'

  19. PRINT @SQL

  20. EXECUTE SP_EXECUTESQL @SQL

  21.     ,N'@TotalCount decimal(10,2)'

  22.     ,@TotalCount

Explanation of Naomi's solution:

  1. To get the percent of null values in a column we use this formula

    sum(case when ColumnName IS NULL then 1 else 0 end) * 100.00 / @TotalCount

    where @TotalCount is the number of rows in a table.

  2. We need to generate the above statement dynamically, so we use INFORMATION_SCHEMA.Columns view to do so.

  3. I use FOR XML PATH('') approach to concatenate information into one variable

  4. I use sp_executeSQL system stored procedure to run the dynamic SQL in order to supply a variable. Alternatively, of course, I could have declared and calculated @TotalCount inside the dynamic SQL.

Solution #3 - Provided by Jingyang Li

Code Snippet

  1. SET NOCOUNT ON;

  2.    DECLARE @TABLE_NAMENVARCHAR(50) = 'Product',

  3.         @TABLE_SCHEMA NVARCHAR(50) = 'Production',

  4.         @sql          NVARCHAR(4000),

  5.         @col          NVARCHAR(50)

  6. CREATE TABLE #t

  7.   (

  8.      id      INT IDENTITY(1, 1),

  9.      ColName VARCHAR(50),

  10.      [NULL%] DECIMAL(8, 2)

  11.   )

  12. DECLARE c_cursor CURSOR FOR

  13.   SELECT column_Name

  14.   FROM   [INFORMATION_SCHEMA].[COLUMNS]

  15.   WHERE  TABLE_NAME = @TABLE_NAME

  16.          AND TABLE_SCHEMA = @TABLE_SCHEMA

  17.          AND IS_Nullable = 'YES'

  18. OPEN c_cursor;

  19. FETCH NEXT FROM c_cursor INTO @col;

  20. WHILE ( @@FETCH_STATUS = 0 )

  21.   BEGIN

  22.       SET @sql = N' INSERT INTO #t (ColName, [NULL%])

  23.         SELECT TOP 1 ''' + @col

  24.                  + ''' , (COUNT(*) over() *1.0- COUNT(' + @col

  25.                  + ') OVER()*1.0)/COUNT(*) Over() *100.0 as [NULL%] FROM '

  26.                  + Quotename(@TABLE_SCHEMA) + '.'

  27.                  + Quotename( @TABLE_NAME)

  28.       -- print @sql

  29.       EXEC (@sql);

  30.       FETCH NEXT FROM c_cursor INTO @col;

  31.   END

  32. CLOSE c_cursor;

  33. DEALLOCATE c_cursor;

  34.    SELECT ColName,  [NULL%] FROM#t

  35. DROP TABLE #t

Explanation of Jingyang's solution:

Jingyang initially sent a solution very similar to that of Naomi, so in order to provide an additional technique, he also provided the CURSOR based solution above.  The cursor is built by leveraging the INFORMATION_SCHEMA table as used in all three solutions, but Jingyang uses a nice optimization technique of only querying for Is_Nullable = 'YES' to only obtain NULLable fields.  In the CURSOR loop, the solution relies on some creative use of the OVER() function to generate the percentage and results, which are displayed in a way that is very readable and concise.

Conclusion:

As you can see, all three of the above solutions provide the intended result we were looking for, but do so in creatively different styles, this time including our first CURSOR based solution.  I hope that you are able to learn a lot by trying out the problem yourself and then reading through the additional solutions.

Special thanks to Jens, Naomi, and Jingyang for their valuable forums contribution and for contributing to this series!

Hope that helps,
Sam Lester (MSFT)

Contributor Bios:

Jens Suessmeyer is an MCS Consultant in Germany and has been working with SQL Server since version 6.0.  He is very active in the forums and in his blog (https://blogs.msdn.com/b/jenss), providing outstanding real-world solutions and insight.

Naomi Nosonovsky, Senior Software Developer, has more than 15 years of enterprise experience in analysis, design, development and implementation of high-performance client/server data management solutions. She is a Personality of the Year for 2008, 2009, 2010 and 2011 at www.universalthread.com in .NET, SQL Server & VFP categories. She is also an All Star contributor/MCC Community Contributor at forums.asp.net and MSDN T-SQL forum.  She also actively blogs at https://blogs.lessthandot.com/index.php/All/?disp=authdir&author=218 and https://beyondrelational.com/members/naomi/modules/2/posts.aspx?Tab=16.

Jingyang Li has been working with SQL Server since he began his IT career as an ASP.NET developer in 2001. He enjoys working with T-SQL and recently took a full time job as a DBA.  He has been participating in the Microsoft forums under the alias Limno.