TSQL – Solve it YOUR Way – Finding the percentage of NULL values for each column in a table


Introduction:

As part of the blog series TSQL – Solve it YOUR Way, today’s topic will cover a question asked in the Transact-SQL MSDN Forums here, followed by three different solutions from three of the more helpful and creative contributors in the TSQL MSDN forums, Jens Suessmeyer, Naomi Nosonovsky, and Jingyang Li.

Topic:  How can I find the percentage of NULL values for each column in my table?

For various reasons, people may need to determine the percentage of values in each column that are NULL.  Perhaps you are trying to calculate the percentage to determine if your table could save disk space by leveraging SPARSE columns.  In another real-world example from the forums, a user had an implementation where partial rows in a table were being updated at different times.  When all rows were updated with a value and were thus NOT NULL, the table was considered “complete” and could be processed.  For this reason, the customer was trying to determine when the percentage of NULL values hit 0%.  In yet another case, the user wanted to present a comparison chart throughout the entire database showing the ratio of NULL values to NOT NULL values.  For the final example, take a look at the explanation provided below by Jens where he solved this problem at a customer site. 

Solution #1 – Provided by Jens Suessmeyer

Code Snippet
  1. SET NOCOUNT ON
  2. DECLARE @Statement NVARCHAR(MAX) =
  3. DECLARE @Statement2 NVARCHAR(MAX) =
  4. DECLARE @FinalStatement NVARCHAR(MAX) =
  5. DECLARE @TABLE_SCHEMA SYSNAME = ‘Production’
  6. DECLARE @TABLE_NAME SYSNAME = ‘Product’
  7. SELECT
  8.         @Statement = @Statement + ‘SUM(CASE WHEN ‘ + COLUMN_NAME + ‘ IS NULL THEN 1 ELSE 0 END) AS ‘ + COLUMN_NAME + ‘,’ + CHAR(13) ,
  9.         @Statement2 = @Statement2 + COLUMN_NAME + ‘*100 / OverallCount AS ‘ + COLUMN_NAME + ‘,’ + CHAR(13) FROM
  10. INFORMATION_SCHEMA.COLUMNS
  11. WHERE TABLE_NAME = @TABLE_NAME AND TABLE_SCHEMA = @TABLE_SCHEMA
  12. IF @@ROWCOUNT = 0
  13.         RAISERROR(‘TABLE OR VIEW with schema “%s” and name “%s” does not exists or you do not have appropriate permissions.’,16,1, @TABLE_SCHEMA, @TABLE_NAME)
  14. ELSE
  15. BEGIN
  16.         SELECT @FinalStatement =
  17.                 ‘SELECT ‘ + LEFT(@Statement2, LEN(@Statement2) 2) + ‘ FROM (SELECT ‘ + LEFT(@Statement, LEN(@Statement) 2) +
  18.                 ‘, COUNT(*) AS OverallCount FROM ‘ + @TABLE_SCHEMA + ‘.’ + @TABLE_NAME + ‘) SubQuery’
  19.         EXEC(@FinalStatement)

Explanation of Jens’ solution:

To put a bit more context to my solution, I was at a customer site where they found out that their API was occasionally losing the information on the way from the front end to the database, therefore NULLing out specific fields which should not be NULL. While they checked through their database and wrote queries for their hundreds of tables, they asked me working for another work stream to write something they can use. The simple idea is to grab a Table Name (and of course the appropriate schema) and create a Dynamic SQL statement which will calculate the percentage of NULL values in a specific table.

From the query the following information will be returned:

In detail we see that the Color (on purpose or not) is not always entered showing a 49% of NULL values in that column. 58% for Size, 65% for SizeUnitMeasureCode, etc. The rest does have (at least NON NULL) values. If you would be also interested in the percentage of that are NULL OR Empty the query can be easily tweaked by changing the source code the following line from:

@Statement = @Statement + ‘SUM(CASE WHEN ‘ + COLUMN_NAME + ‘ IS NULL THEN 1 ELSE 0 END) AS ‘ + COLUMN_NAME + ‘,’ + CHAR(13) ,

to

@Statement = @Statement + ‘SUM(CASE WHEN LEN(ISNULL(‘ + COLUMN_NAME + ‘,’ + REPLICATE(CHAR(39),2) + ‘)) = 0 THEN 1 ELSE 0 END) AS ‘ + COLUMN_NAME + ‘,’ + CHAR(13) ,

Leading to a change in the generated code of:

SUM(CASE WHEN LEN(ISNULL(CompanyName, ”)) = 0 THEN 1 ELSE 0 END) AS CompanyName,

Feel free to include your own logic and forensic pattern to find all the black holes in your databases.

Solution #2 – Provided by Naomi Nosonovsky

Code Snippet
  1. DECLARE @TotalCount DECIMAL(10, 2)
  2.     ,@SQL NVARCHAR(MAX)
  3. SELECT @TotalCount = COUNT(*)
  4. FROM [AdventureWorks].[Production].[Product]
  5. SELECT @SQL = STUFF((
  6.             SELECT ‘, CAST(SUM(CASE WHEN ‘ + Quotename(C.COLUMN_NAME) + ‘ IS NULL THEN 1 ELSE 0 END) * 100.00
  7. /@TotalCount AS decimal(10,2)) AS [‘ + C.COLUMN_NAME + ‘ NULL %]
  8.             FROM INFORMATION_SCHEMA.COLUMNS C
  9.             WHERE TABLE_NAME = ‘Product’
  10.                 AND TABLE_SCHEMA = ‘Production’
  11.             ORDER BY C.ORDINAL_POSITION
  12.             FOR XML PATH()
  13.                 ,type
  14.             ).value(‘.’, ‘nvarchar(max)’), 1, 2, )
  15. SET @SQL = ‘SET @TotalCount = NULLIF(@TotalCount,0)
  16. SELECT ‘ + @SQL +
  17. FROM [AdventureWorks].Production.Product’
  18. PRINT @SQL
  19. EXECUTE SP_EXECUTESQL @SQL
  20.     ,N’@TotalCount decimal(10,2)’
  21.     ,@TotalCount

Explanation of Naomi’s solution:

  1. To get the percent of null values in a column we use this formula

    sum(case when ColumnName IS NULL then 1 else 0 end) * 100.00 / @TotalCount

    where @TotalCount is the number of rows in a table.

  2. We need to generate the above statement dynamically, so we use INFORMATION_SCHEMA.Columns view to do so.
  3. I use FOR XML PATH(”) approach to concatenate information into one variable
  4. I use sp_executeSQL system stored procedure to run the dynamic SQL in order to supply a variable. Alternatively, of course, I could have declared and calculated @TotalCount inside the dynamic SQL.

Solution #3 – Provided by Jingyang Li

Code Snippet
  1. SET NOCOUNT ON;
  2.    DECLARE @TABLE_NAMENVARCHAR(50) = ‘Product’,
  3.         @TABLE_SCHEMA NVARCHAR(50) = ‘Production’,
  4.         @sql          NVARCHAR(4000),
  5.         @col          NVARCHAR(50)
  6. CREATE TABLE #t
  7.   (
  8.      id      INT IDENTITY(1, 1),
  9.      ColName VARCHAR(50),
  10.      [NULL%] DECIMAL(8, 2)
  11.   )
  12. DECLARE c_cursor CURSOR FOR
  13.   SELECT column_Name
  14.   FROM   [INFORMATION_SCHEMA].[COLUMNS]
  15.   WHERE  TABLE_NAME = @TABLE_NAME
  16.          AND TABLE_SCHEMA = @TABLE_SCHEMA
  17.          AND IS_Nullable = ‘YES’
  18. OPEN c_cursor;
  19. FETCH NEXT FROM c_cursor INTO @col;
  20. WHILE ( @@FETCH_STATUS = 0 )
  21.   BEGIN
  22.       SET @sql = N’ INSERT INTO #t (ColName, [NULL%])
  23.         SELECT TOP 1 ”’ + @col
  24.                  + ”’ , (COUNT(*) over() *1.0- COUNT(‘ + @col
  25.                  + ‘) OVER()*1.0)/COUNT(*) Over() *100.0 as [NULL%] FROM ‘
  26.                  + Quotename(@TABLE_SCHEMA) + ‘.’
  27.                  + Quotename( @TABLE_NAME)
  28.       — print @sql
  29.       EXEC (@sql);
  30.       FETCH NEXT FROM c_cursor INTO @col;
  31.   END
  32. CLOSE c_cursor;
  33. DEALLOCATE c_cursor;
  34.    SELECT ColName,  [NULL%] FROM#t
  35. DROP TABLE #t

Explanation of Jingyang’s solution:

Jingyang initially sent a solution very similar to that of Naomi, so in order to provide an additional technique, he also provided the CURSOR based solution above.  The cursor is built by leveraging the INFORMATION_SCHEMA table as used in all three solutions, but Jingyang uses a nice optimization technique of only querying for Is_Nullable = ‘YES’ to only obtain NULLable fields.  In the CURSOR loop, the solution relies on some creative use of the OVER() function to generate the percentage and results, which are displayed in a way that is very readable and concise.

Conclusion:

As you can see, all three of the above solutions provide the intended result we were looking for, but do so in creatively different styles, this time including our first CURSOR based solution.  I hope that you are able to learn a lot by trying out the problem yourself and then reading through the additional solutions.

Special thanks to Jens, Naomi, and Jingyang for their valuable forums contribution and for contributing to this series!

Hope that helps,
Sam Lester (MSFT)

Contributor Bios:

Jens Suessmeyer is an MCS Consultant in Germany and has been working with SQL Server since version 6.0.  He is very active in the forums and in his blog (http://blogs.msdn.com/b/jenss), providing outstanding real-world solutions and insight.

Naomi Nosonovsky, Senior Software Developer, has more than 15 years of enterprise experience in analysis, design, development and implementation of high-performance client/server data management solutions. She is a Personality of the Year for 2008, 2009, 2010 and 2011 at www.universalthread.com in .NET, SQL Server & VFP categories. She is also an All Star contributor/MCC Community Contributor at forums.asp.net and MSDN T-SQL forum.  She also actively blogs at http://blogs.lessthandot.com/index.php/All/?disp=authdir&author=218 and http://beyondrelational.com/members/naomi/modules/2/posts.aspx?Tab=16.

Jingyang Li has been working with SQL Server since he began his IT career as an ASP.NET developer in 2001. He enjoys working with T-SQL and recently took a full time job as a DBA.  He has been participating in the Microsoft forums under the alias Limno.

Comments (7)

  1. A CLR function could prove to be more flexible for such a task, allowing to address a wider range of Data Profiling problems. Here is an example based on ExecuteScalarToInt CLR function:

    SELECT s.name [schema name]

    , T.name [table name]

    , C.name [column name]

    , dbo.ExecuteScalarToInt('SELECT SUM(CASE WHEN ' + c.name + ' IS NULL THEN 1 ELSE 0 END) * 100.00/count(*) Percentage FROM ' + S.name + '.' + T.name) Percentage

    FROM sys.columns C

       JOIN sys.tables T

          ON C.object_id = T.object_id

            JOIN sys.schemas S

               ON T.schema_id = S.schema_id

    Function's definition can be found in Data Profiling Using CLR Functions blog post (sql-troubles.blogspot.de/…/data-profiling-part-ii-using-clr.html). The query can be easily modified to use INFORMATION_SCHEMA.COLUMNS view.

  2. Here is the query revisited, this time based on INFORMATION_SCHEMA.COLUMNS view:

    SELECT TABLE_SCHEMA

    , TABLE_NAME

    , COLUMN_NAME

    , dbo.ExecuteScalarToInt('SELECT IsNull(SUM(CASE WHEN ' + QUOTENAME(COLUMN_NAME) + ' IS NULL THEN 1 ELSE 0 END) * 100.00/count(*), 0) Percentage FROM ' + QUOTENAME(TABLE_SCHEMA) + '.' + QUOTENAME(TABLE_NAME)) Percentage

    FROM INFORMATION_SCHEMA.COLUMNS

    WHERE TABLE_SCHEMA = 'Production'

     AND TABLE_NAME = 'Product'

    If decimals are needed, the ExecuteScalarToDecimal CLR function can be used (its definition is available in the above mentioned post):

    , dbo.ExecuteScalarToDecimal('SELECT IsNull(SUM(CASE WHEN ' + QUOTENAME(COLUMN_NAME) + ' IS NULL THEN 1 ELSE 0 END) * 100.00/count(*), 0) Percentage FROM ' + QUOTENAME(TABLE_SCHEMA) + '.' + QUOTENAME(TABLE_NAME)) Percentage

  3. Naomi N says:

    Interesting, do you think using CLR will give better performance here? Can you run a quick performance test?

    Thanks.

  4. I run all the above queries, including the one based on CLR function and for the given table, and all need about a second. I wanted to see their performance for the whole database. At a first glimpse, the third solution seemed the easiest to modify for this purpose. On the first run, the solution based on CLR function solution took 51 seconds, while the third solution took 58 seconds. On a second run the CLR function took 25 seconds, while the third took 56 seconds. Similar results as in the second run after several other runs. The results might be different on other machines, though I think the performance of CLR function is comparable at least with one of the three above solutions.

    The performance of a solution is quite important, though often is needed to consider also its flexibility considered as a dimension of maintenance, reuse, generalization, readability, etc.

  5. Out of three approach,i would vote for "Jingyang 's approach" because the real catch was using IS_NULLABLE = "YES",when we are trying to get NULL percentage in a column.

  6. Kk says:

    In my query I am using cursor for getting the null counts.

    Count(*),
    Count(@ColumnNM)

    But I am getting the same count for both.
    The Count(@ColumnNm) is giving the count for entire column and not excluding the Nulls.

    Any help will be appreciated.

    1. Naomi N says:

      If you use count(@ColumnName) where @ColumnName is a variable, it is the same as COUNT(1) and it will be the same as count(*), e.g. it will count all rows in a table. You need to construct a dynamic SQL instead.