Lookup Pattern: Range Lookups


Performing range lookups (i.e. to find a key for a given range) is a common ETL operation in data warehousing scenarios. It’s especially for historical loads and late arriving fact situations, where you’re using type 2 dimensions and you need to locate the key which represents the dimension value for a given point in time.

This blog post outlines three separate approaches for doing range lookups in SSIS:

  1. Using the Lookup Transform
  2. Merge Join + Conditional Split
  3. Script Component

All of our scenarios will use the AdventureWorksDW2008 sample database (DimProduct table) as the dimension, and take its fact data from AdventureWorks2008 (SalesOrderHeader and SalesOrderDetail tables). The "ProductNumber" column from the SalesOrderDetail table maps to the natural key of the DimProduct dimension (ProductAlternateKey column). In all cases we want to lookup the key (ProductKey) for the product which was valid (identified by StartDate and EndDate) for the given OrderDate.

One last thing to note is that the Merge Join and Script Component solutions assume that a valid range exists for each incoming value. The Lookup Transform approach is the only one that will identify rows that have no matches (although the Script Component solution could be modified to do so as well).

Lookup Transform

The Lookup Transform was designed to handle 1:1 key matching, but it can also be used in the range lookup scenario by using a partial cache mode, and tweaking the query on the Advanced Settings page. However, the Lookup doesn’t cache the range itself, and will end up going to the database very often – it will only detect a match in its cache if all of the parameters are the same (i.e. same product purchased on the same date).

We can use the following query to have the lookup transform perform our range lookup:

select [ProductKey], [ProductAlternateKey], 
     [StartDate], [EndDate]
from [dbo].[DimProduct]
where [ProductAlternateKey] = ?
and   [StartDate] <= ?
and (
    [EndDate] is null or 
    [EndDate] > ?
)

On the query parameters page, we map 0 -> ProductNumber, 1 and 2 -> OrderDate.

image

This approach is effective and easy to setup, but it is pretty slow when dealing with a large number of rows, as most lookups will be going to the database.

Merge Join and Conditional Split

This approach doesn’t use the Lookup Transform. Instead we use a Merge Join Transform to do an inner join on our dimension table. This will give us more rows coming out than we had coming in (you’ll get a row for every repeated ProductAlternateKey). We use the conditional split to do the actual range check, and take only the rows that fall into the right range.

image 

For example, a row coming in from our source would contain an OrderDate and ProductNumber, like this:

table1

From the DimProduct source, we take three additional columns – ProductKey (what we’re after), StartDate and EndDate. The DimProduct dimension contains three entries for the "LJ-0192-L" product (as its information, like unit price, has changed over time). After going through the Merge Join, the single row becomes three rows.

table2

We use the Conditional Split to do the range lookup, and take the single row we want. Here is our expression (remember, in our case an EndDate value of NULL indicates that it’s the most current row):

StartDate <= OrderDate && (OrderDate < EndDate || ISNULL(EndDate))

table3

This approach is a little more complicated, but performs a lot better than using the Lookup Transform.

Script Component

The third approach uses a custom script component to perform the lookup. I wrote the script in two ways – one that simulates a "Full Cache" type lookup, and one that is similar to partial cache except it pulls back all values for a given natural key, instead of just the one for the given date range. The caching behavior is controlled by the PreCache boolean package variable.

Conclusion

I ran the three packages using the following environment (my laptop):

  • Dual core Intel 1.8ghz
  • 3gb of RAM
  • AdventureWorks2008 and AdventureWorksDW2008
    • ~120,000 order rows (SalesOrderDetail)
    • ~600 reference rows (DimProduct)

Here are the results, in rows per second (larger being better):

image 

At 120k+ rows per second, we can see that the custom script (or better yet, a custom transform) is the best alternative here. We can also see that even though the Lookup approach was by far the slowest (3639 rows / second), it is still a viable choice when you’re processing a small number of rows.

There’s a couple of reasons that the Lookup Transform performs poorly here. First, because it’s not able to pre-cache any of the reference data, it has to go to the database often. Second, it matches only on actual parameter values – it doesn’t have a concept of ranges. Since it will only find a cache hit if all parameters are the same, it ends up hitting the database for almost every row (120k times). By comparison, the script component will only query once per unique ProductNumber (~600 times max).

So there you have three different approaches for doing range lookups in SSIS. I’m hoping we’ll be able to either enhance the Lookup component to support this functionality in the future, or perhaps provide a new transform to handle this case.

In the mean time, please feel free to post / email any alternative approaches you might have.

I’ve attached the packages used in this post incase you want to try out the different options for yourself.

 

Comments (32)

  1. From the Lookup presentation I put together for the MS BI conference in October, here is a series of

  2. This is just a place holder. Please see the post on different approaches of doing Range Lookups in SSIS

  3. The SQL Server Integration Services team added valuable new caching options (and scalability) to the

  4. ravisharvindMSBI says:

    Hi,

    Can anybody explain in brief the code used in the script component?

    I have a scenario to implement the same.

    Any help would be greatly appreciated. Thanks

  5. Which part do you need explained?

  6. ravisharvindMSBI says:

    Basically the part within the while loop.

    Why is StartDate not considered in the code?

    I am a starter, so please bear with me.

    Thanks for your help.

  7. An optimization. It assumes that everything up to the next end date belongs to the previous record.

    ex.

    Key 1, EndDate 2001/01/01

    Key 20, EndDate 2003/01/01

    Key 40, EndDate 2008/01/01

    Key 60, EndDate NULL

    If a record with a transaction date of 1999 comes in, we can determine it belongs to Key 1. If a record with a transaction date of 2004 comes in, we know it belongs to Key 40.

    Note, this will only work if there are no gaps in your history. It assumes that your data has been cleaned before hand.

  8. ravisharvindMSBI says:

    Hi,

    How do i redirect the lookup no match output in the script component if there is no dimension natural key available?

    Can you please help me out with the code?

    Thanks in advance for the help.

  9. ravisharvindMSBI says:

    How do i redirect the lookup no match output in the script component if there is no dimension natural key available?

    Even if redirection is not possible i would like to replace the corresponding key of the record like 1, which is the key for default record in my dimension?

    waiting for your response…

  10. You can redirect with the script by adding another output to the component.

    http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/66136/

    Alternatively, you can replace the key value with 1 instead of throwing an exception like the script does now.

  11. ravisharvindMSBI says:

    Hi Masson,

    Thanks for the reply.

    I tried replacing the key for the corresponding row to 1 instead of throwing the exception as you said. But it fails again. I don’t want the script component to fail at any cost, just replace the key with 1 and proceed.

    Below is the change I made to the code.

     if (ranges == null)

     {

               Row.ProductKey = 1;

    //commented

    //throw new NullReferenceException("Couldn’t find //value for product number " + productNumber);

     }

    Am I missing something here or should I do anything else.. Please have a look at it. It is very critical for me.

  12. CozyRoc says:

    There is solution based on the third-party commercial CozyRoc SSIS+ library. CozyRoc has implemented data flow destination script, which creates memory-efficient range dictionary object. The dictionary object can then be used in CozyRoc Lookup Plus component. For more information and demonstration how to use the script, check here:

    http://www.cozyroc.com/script/range-dictionary-destination

  13. pzajkowski says:

    Hi, I’d really like to try out the downloadable solutions, but it appears they are in SSIS 2008 format.

    Any chance a SSIS 2005 version can be made available too?

    Thanks in advance.

  14. Langston says:

    Is it possible to get a VB.Net version of the script component's code? Thanks in advance.

  15. JJBear11 says:

    I second that — any chance for a VB.Net version of this code?

  16. Andrew says:

    hi mmasson,

    what is the purpose of this code snippet

               if (index < 0)

               {

                   index = ~index;

               }

    Thanks

  17. Leo says:

    Did anyone ever come up with a VB.NET version of this?

  18. cgrigolini says:

    Solution  2 is easy and efective, but if we don't  use the  solution 1 (lookup component) there is a problem with the rows that don't match.  We lose them! without the posibility of ignoring the "no match" and let them pass anyway with a null or default value.

    Does anybody have an idea for a quick workaround?

  19. Mohan says:

    You can implement solution 2 with a left outer join. That way all the rows taht don't match will come through with the key on the second table as null. In the conditional split you can add a third condition by checking for second table key is null and use that output as your "no match" output.

  20. Thanks Mohan – using the Left Outer Join is a great suggestion.

    For the people asking for a VB version of the script – no, sorry, I have not coded one, but I'd be happy to post a version here if someone sends it to me.

  21. Andrew, if the BinarySearch method does not find the element, it returns a negative integer. Using the bitwise compliment operator (~) gives you the index of the first element that is larger than the search value.

    See: msdn.microsoft.com/…/w4e7fxsh.aspx

  22. Mohan says:

    Matt, My earlier suggestion of using a let outer join is incomplete. If the lookup fails on the ProductAlternateKey then the left outer join works. But to get a complete "no match" output which includes not matching date ranges the solution is a little more elaborate. For that multicast the "OrderDate and ProductNumber"  and multicast "In Range" output of the conditional split. A left outer join using ProductAlternateKey and StartDate of these multicast streams and then filtering for only rows in the original source stream and not in "InRange" will give a complete "no match" output.

  23. Chris says:

    For some reason when I got to map the parameters to an input field the list is empty.

  24. Frenk says:

    Any plans yet for adding the range lookup as an out-of-the-box component tot SSIS?

    It's really annoying to still having to rely on these creative solutions.

  25. How about joining the source to the dimension to create a cache. You can do this by having a seperate data flow that will create you cache.  In the sample code below I have a dimension called [dimension_table] and my fact SSIS source is called [fact_source].  

    SELECT DISTINCT S.transaction_date,D.Surrogate_Key,D.Business_Key

    ,D.dw_effective_start_date,D.dw_effective_end_date FROM dimension_table D JOIN fact_source S ON D.Business_Key = S.Business_Key

    AND D.[dw_effective_START_date] <= transaction_date

    AND D.[dw_effective_END_date] >= transaction_date

    Once you have your cache you can now do an equality lookup using the business key and the transaction date.

    depending on your fact, if you are loading a fact where the source has 2 million rows , then creating a cache could take upto 50 seconds but once you have your cache you will save a lot of time on the lookup task.

  26. Barry says:

    Hi

    Would it be a valid solution to perform the joining in the source adapter?  Left joining to each of your dimensions with multiple join operators?  Key, business date, end date?

    Barry

  27. Yes – if you can do the join in your source query, then that is typically the better approach. That applies for all Lookup cases, not just for doing Range Lookups. You'd want to use the Lookup Transform when you are joining data from different sources.

  28. Carmina says:

    Hi Matt,

    Can you send me the packages attached to this email, please? carmina.bernabeu@plusultra.es

    My organization has blocked some contents and I can't download it myself from here.

    Thanks in advance

  29. Peter Bontekoe says:

    I think your link to onedrive is broken. It says: the item could not be loaded at the moment.

  30. Himanshu says:

    Hi Matt,

    The link to the sample packages is broken. Can you please re enable the link.

  31. neil42w says:

    Hi Matt, I'm not sure if you'll still be notified re these comments but I'd love to ask you about this blog post, and a separate post reallifebi.blogspot.co.uk/…/ssis-scd2-script-lookup.html which is an altered version of your script.

    The link on this post is broken so I have no way of comparing your script to Miljan Radovic's version of the script, but having spent a few days of intense C# learning in order to understand what's going on I'm fairly sure (as sure as an absolute novice can be) that Miljan's script contains some flaws, potentially of the copy+paste variety. There's every chance I'm wrong and I mean no disrespect to Miljan, to the contrary I'm actually extremely grateful for the work done by the likes of you and Miljan.

    Thanks in advance for any help,

    Neil (neil42w at gmail dot com)

  32. sarah says:

    Any chance you can post the code for the script component?  Can't download from skydrive at work as third party download sites are blocked…

Skip to main content