Performing range lookups (i.e. to find a key for a given range) is a common ETL operation in data warehousing scenarios. It's especially for historical loads and late arriving fact situations, where you're using type 2 dimensions and you need to locate the key which represents the dimension value for a given point in time.
This blog post outlines three separate approaches for doing range lookups in SSIS:
- Using the Lookup Transform
- Merge Join + Conditional Split
- Script Component
All of our scenarios will use the AdventureWorksDW2008 sample database (DimProduct table) as the dimension, and take its fact data from AdventureWorks2008 (SalesOrderHeader and SalesOrderDetail tables). The "ProductNumber" column from the SalesOrderDetail table maps to the natural key of the DimProduct dimension (ProductAlternateKey column). In all cases we want to lookup the key (ProductKey) for the product which was valid (identified by StartDate and EndDate) for the given OrderDate.
One last thing to note is that the Merge Join and Script Component solutions assume that a valid range exists for each incoming value. The Lookup Transform approach is the only one that will identify rows that have no matches (although the Script Component solution could be modified to do so as well).
The Lookup Transform was designed to handle 1:1 key matching, but it can also be used in the range lookup scenario by using a partial cache mode, and tweaking the query on the Advanced Settings page. However, the Lookup doesn't cache the range itself, and will end up going to the database very often - it will only detect a match in its cache if all of the parameters are the same (i.e. same product purchased on the same date).
We can use the following query to have the lookup transform perform our range lookup:
select [ProductKey], [ProductAlternateKey], [StartDate], [EndDate] from [dbo].[DimProduct] where [ProductAlternateKey] = ? and [StartDate] <= ? and ( [EndDate] is null or [EndDate] > ? )
On the query parameters page, we map 0 -> ProductNumber, 1 and 2 -> OrderDate.
This approach is effective and easy to setup, but it is pretty slow when dealing with a large number of rows, as most lookups will be going to the database.
Merge Join and Conditional Split
This approach doesn't use the Lookup Transform. Instead we use a Merge Join Transform to do an inner join on our dimension table. This will give us more rows coming out than we had coming in (you'll get a row for every repeated ProductAlternateKey). We use the conditional split to do the actual range check, and take only the rows that fall into the right range.
For example, a row coming in from our source would contain an OrderDate and ProductNumber, like this:
From the DimProduct source, we take three additional columns - ProductKey (what we're after), StartDate and EndDate. The DimProduct dimension contains three entries for the "LJ-0192-L" product (as its information, like unit price, has changed over time). After going through the Merge Join, the single row becomes three rows.
We use the Conditional Split to do the range lookup, and take the single row we want. Here is our expression (remember, in our case an EndDate value of NULL indicates that it's the most current row):
StartDate <= OrderDate && (OrderDate < EndDate || ISNULL(EndDate))
This approach is a little more complicated, but performs a lot better than using the Lookup Transform.
The third approach uses a custom script component to perform the lookup. I wrote the script in two ways - one that simulates a "Full Cache" type lookup, and one that is similar to partial cache except it pulls back all values for a given natural key, instead of just the one for the given date range. The caching behavior is controlled by the PreCache boolean package variable.
I ran the three packages using the following environment (my laptop):
- Dual core Intel 1.8ghz
- 3gb of RAM
- AdventureWorks2008 and AdventureWorksDW2008
- ~120,000 order rows (SalesOrderDetail)
- ~600 reference rows (DimProduct)
Here are the results, in rows per second (larger being better):
At 120k+ rows per second, we can see that the custom script (or better yet, a custom transform) is the best alternative here. We can also see that even though the Lookup approach was by far the slowest (3639 rows / second), it is still a viable choice when you're processing a small number of rows.
There's a couple of reasons that the Lookup Transform performs poorly here. First, because it's not able to pre-cache any of the reference data, it has to go to the database often. Second, it matches only on actual parameter values - it doesn't have a concept of ranges. Since it will only find a cache hit if all parameters are the same, it ends up hitting the database for almost every row (120k times). By comparison, the script component will only query once per unique ProductNumber (~600 times max).
So there you have three different approaches for doing range lookups in SSIS. I'm hoping we'll be able to either enhance the Lookup component to support this functionality in the future, or perhaps provide a new transform to handle this case.
In the mean time, please feel free to post / email any alternative approaches you might have.
I've attached the packages used in this post incase you want to try out the different options for yourself.