Assigning surrogate keys to early arriving facts using Integration Services

05/13/2009

In data warehouses, it is quite common that fact records arrive with a source system key that has not yet been loaded in the dimension tables. This phenomena is known as “late arriving dimensions” or “early arriving facts” in Kimball terminology.

When you see a fact record that cannot be resolved to a dimension surrogate key, the typical solution is this:

Create a dummy member in the dimension table using the source system key
Assign a surrogate key to this dummy member
Use the newly create surrogate key and assign it to the fact record

If you use T-SQL to load the data warehouse, it means you have to pass over the input fact rows twice. First, you have to discover which keys are not present in the dimension (and create surrogates for them). Second, you will have to look at the input data again and use the newly generated surrogate keys to load the the fact table.

Using Integration Services, early arriving facts can be populated with just one pass over the source rows, which means less read I/O operations. Nice!

In project Project REAL, a script component is used to achieve this effect. If many of your dimension have early arriving facts, this creates a lot of copy/paste code. There is a cleaner solution that does not use script components.

There is a way handle early arriving facts without relying on script components. It is best illustrated with an example. Let us create these three tables:

/* The input table */ CREATE TABLE Stage_Fact
(
NK_A CHAR(10) NOT NULL /* The late arriving source system key */
)

/* The late arriving dimension table */ CREATE TABLE Dim_A
(
SK_A INT PRIMARY KEY IDENTITY(1,1) /* The surrogate key*/
, NK_A CHAR(10) NOT NULL /* The natural, source system key */
)

/* The final destination table */
CREATE TABLE Fact
(
SK_A INT NOT NULL /* Surrogate key from dimension */
)

Now, use this script to generate 16M rows in the input table and create a 9000 row dimension table:

/* Create some staging data */ INSERT Stage_Fact WITH (TABLOCK)
SELECT RIGHT(REPLICATE('0', 10) + CAST(K AS VARCHAR(10)), 10) AS NK_A
FROM (SELECT ABS(binary_checksum(*) % 10000) AS K
FROM sys.trace_event_bindings eb1
CROSS JOIN sys.trace_event_bindings eb2) AS stuff

/* Populate Dim_A with 90% of the keys from the fact table */ INSERT Dim_A WITH (TABLOCK) (NK_A)
SELECT DISTINCT NK_A FROM Stage_Fact
WHERE NK_A < '0000009000'

With this data, there will be 1000 late arriving dimension keys in Stage_Fact in around 1.8M rows. When we see a non-matched key in Stage_Fact, we want to generate a new surrogate key in Dim_A. But here is the catch: We only want to generate the surrogate once, and we do NOT want to do a roundtrip to the database the second time we see the same key.

Project Real uses a .NET hash table to track the generated keys and perform quick lookups the next time we see the key. But, we already have a fine hash table available without using script components: the lookup transformation. Let us see how we solve the early arriving fact problem with Integration Services, au natural:

The non-matched rows from Lookup SK_A go into the second lookup (New SK_A Cache). New SK_A Cache is where we want to handle the early arriving facts.

First, configure New SK_A Cache as a partial cache:

Now, we play a clever trick: Whenever a partial lookup cache first receives a non-matched row, it will call a SQL statement and fetch data to populate the lookup cache. The default is a SELECT statement, but it does not have to be a SELECT statement. We could replace it with a stored procedure that returns the same result as the SELECT. Actually, let us do exactly that:

Now, the FIRST time the partial lookup cache sees a early arriving fact, it will call Generate_SK_A. I have mapped the NK_A (the source system, natural key) column to the input parameter. To finish the trick, we just have to create a simple stored procedure that uses NK_A to lookup SK_A (the Surrogate Key), and if not found, create a new key:

CREATE PROCEDURE Generate_SK_A

@NK_A CHAR(10) /* The key to find a surrogate for */ AS
SET NOCOUNT ON

/* Prevent race conditions */ SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

/* Check if we already have the key (procedure is idempotent) */ DECLARE @SK_A INT
SELECT @SK_A = SK_A
FROM Dim_A
WHERE NK_A = @NK_A

/* The natural key was not found, generate a new one */ IF @SK_A IS NULL BEGIN
INSERT Dim_A (NK_A) VALUES (@NK_A)
SET @SK_A = SCOPE_IDENTITY()
END

/* Return the result.
IMPORTANT: must return same format is the SELECT statement we replaced */
SELECT @SK_A AS SK_A, @NK_A AS NK_A

Simple isn’t it?... No need to use any .NET script components here. Have a look at the attached files to study the technique further and you will be handling early arriving facts elegantly in no time.

SK_Example.zip

Comments

Anonymous
May 13, 2009
To get the attached ZIP file click on the title of this post to bring it up in full screen. At the bottom of the page, there is a link to SK_Example.zip. Thanks Thomas Kejser
Anonymous
May 19, 2009
You can even save the first lookup and just use the second one, when define your partial cache big enough to hold all your dimension elements. Then you also don't need the union all element and it looks very smart.
Anonymous
May 19, 2009
Marcel: If you try to profile your solution, you will see that even the KNOWN keys result in roundtrip to the database. If you have a 10.000 row dimension (like this example), then you will get 10.000 roundtrips, even though only 1000 keys are early arriving. By having the first lookup catch most rows with a FULL cache, you save 9000 roundtrips. A performance gain worth taking I think :-)
Anonymous
May 19, 2009
You are right, because the partial cache starts empty and then you have all those round trips over the StoredProcedure. But anyway a nice solution. :)
Anonymous
May 19, 2009
Everybody who is familiar with Kimballs data warehouse design (Kimball Group) also knows about the early arriving facts problem. In SSIS 2008 the solution for achieving inferred members is quite a little bit easier. Thomas Kejser wrote a nice blog entry
Anonymous
February 03, 2010
How did you add a second output to your first Lookup? Is this a new feature in 2008? Will this work in 2005 using the Lookup Error output?
Anonymous
February 03, 2010
TabAlleman: The second output is indeed a new features in SQL Server 2008. In 2005 you have would have to use the error output and distinguish between lookup errors and truncation errors
Anonymous
February 09, 2010
The comment has been removed
Anonymous
July 14, 2010
TabAlleman - I have got a similar process working in 2005, but using an oledb command to call a stored procedure which does the lookup/insert/scope identity to return the new key. I have yet to test if this can be done with a lookup as outlined above.
Anonymous
July 14, 2010
TabAlleman - I have got a similar process working in 2005, but using an oledb command to call a stored procedure which does the lookup/insert/scope identity to return the new key. I have yet to test if this can be done with a lookup as outlined above.
Anonymous
August 09, 2010
@TabAlleman Hi! For me it is working within the 2005 enviroment.If you like I can send you an example in order to help you.
Anonymous
August 09, 2010
@TabAlleman Hi! For me it is working within the 2005 enviroment.If you like I can send you an example in order to help you.
Anonymous
June 28, 2016
Great example. Thanks.I found one use case that where I'm struggling to use this approach. Let's say I have one more column that's not part of business key but I still want it to be populated, Dim_A.TypeA. It seems that I can't pass it over to the stored procedure as it's not used in the join.

Share via

Comments

Ask Learn

Share via

Assigning surrogate keys to early arriving facts using Integration Services

Comments

Additional resources