Pattern: Replacing NULL values with a Derived Column


This pattern has been out there for so long I had contemplated taking it out of my Performance Design Patterns talk, but it looks like it still managed to help some people at SQL Saturday #71 last weekend. This pattern is straight forward, and especially important since the data flow threading improvements were made in 2008.

Problem

Your source data is “unclean” – specifically, it has NULL or missing values. You’d like to replace these NULL values with hard coded defaults, and replace missing dimension values with an unknown member key.

Solution 1: Slow

This is the traditional approach (and perhaps the more natural way of writing it) is to have a series of Lookups checking for existing dimension values. If the value is not found, you map the “No Match” output to a derived column to substitute in your unknown member key. You then Union All the two streams, and continue down to the next dimension member Lookup.

image

This pattern was especially slow in the 2005 due to a bug in the BIDS designer that would cause columns to be unnecessarily compared to each other (increasing exponentially for ever pair of Lookup + Union All components).

Solution 2: Fast

The main problem with Solution 1 is that the Union All component is asynchronous – it will create a full copy of the data flow buffer. It’s also completely unneeded. Instead of using the “No Match” output, we can choose to Ignore missing values. The keys for any missing values will be set to NULL. After all Lookups are complete, we perform all of the NULL value substitutions with a single derived column transformation.

image

Comments (2)

  1. Valentino Vranken says:

    Definitely a classic one Matt, and now nicely documented!  I've been applying this technique in every Fact table ETL package since a couple of years now 🙂

    Best regards,

    Valentino.

  2. Tom Pataluch says:

    Thanks for the tip.  Keep them coming 🙂

    Sincerely,

    Tom

Skip to main content