Lookup cache modes

Over the past couple of months I've been putting together a presentation on the Lookup Transform. I presented most of it as a Chalk Talk at the MS BI Conference last week, and from the evaluation scores, it seems like it was pretty well received. I'll be splitting up some of its content into a series of blog posts over the next little while. If you're interested in seeing the whole talk, it will also be shown at the SSWUG Virtual Conference in November.


The most important setting of the Lookup Transform is the Cache Mode - it can greatly impact your data flow performance, and affects overall package design. Because of its importance, we made it the first thing you see in the new 2008 Lookup UI. I feel this is a great improvement over 2005, where the cache mode was abstracted away - see Michael Entin's post for more details.

2008 Lookup UI

This blog post describes the three cache modes, how they work, and best practices around using them. Note that these cache modes apply when you're using an OLE DB connection manager - using the new Cache connection manager is similar to using a Full Cache mode.

Full Cache


The default cache mode for the lookup is Full cache. In this mode, the database is queried once during the pre-execute phase of the data flow. The entire reference set is pulled into memory. This approach uses the most memory, and adds additional startup time for your data flow, as all of the caching takes place before any rows are read from the data flow source(s). The trade off is that the lookup operations will be very fast during execution. One thing to note is that the lookup will not swap memory out to disk, so your data flow will fail if you run out of memory.

When to use this cache mode

  • When you're accessing a large portion of your reference set
  • When you have a small reference table
  • When your database is remote or under heavy load, and you want to reduce the number of queries sent to the server

Keys to using this cache mode

  • Ensure that you have enough memory to fit your cache
  • Ensure that you don't need to pick up any changes made to the reference table
    • Since the lookup query is executed before the data flow begins, any changes made to the reference table during the data flow execution will not be reflected in the cache

Partial Cache


In this mode, the lookup cache starts off empty at the beginning of the data flow. When a new row comes in, the lookup transform checks its cache for the matching values. If no match is found, it queries the database. If the match is found at the database, the values are cached so they can be used the next time a matching row comes in.

Since no caching is done during the pre-execute phase, the startup time using a partial cache mode is less than it would be for a full cache. However, your lookup operations would be slower, as you will most likely be hitting the database more often.

When running in partial cache mode, you can configure the maximum size of the cache. This setting can be found on the Advanced Options page of the lookup UI. There are actually two separate values - one for 32bit execution, and one for 64bit. If the cache gets filled up, the lookup transform will start dropping the least seen rows from the cache to make room for the new ones.

In 2008 there is a new Miss Cache feature that allows you to allocate a certain percentage of your cache to remembering rows that had no match in the database. This is useful in a lot of situations, as it prevents the transform from querying the database multiple times for values that don't exist. However, there are cases where you don't want to remember the misses - for example, if your data flow is adding new rows to your reference table. The Miss Cache is disabled by default.

When to use this cache mode

  • When you're processing a small number of rows and it's not worth the time to charge the full cache
  • When you have a large reference table
  • When your data flow is adding new rows to your reference table
  • When you want to limit the size of your reference table by modifying query with parameters from the data flow

Keys to using this cache mode

  • Ensure that your cache size setting is large enough
  • Use the Miss Cache appropriately
  • If the cache size isn't large enough for your rows, sort on lookup index columns if possible

No Cache


As the name implies, in this mode the lookup transform doesn't maintain a lookup cache (actually, not quite true - we keep the last match around, as the memory has already been allocated). In most situations, this means that you'll be hitting the database for every row.

When to use this cache mode

  • When you're processing a small number of rows
  • When you have non-repeating lookup indexes
  • When your reference table is changing (inserts, updates, deletes)
  • When you have severe memory limitations

Keys to using this cache mode

  • Ensure that the partial cache mode isn't the better choice


To find out more on how to implement the look up transform, please see these books online entries:

Comments (12)

  1. Good news – a couple of Information log events where added to the Lookup Transform in SQL 2008 to help

  2. Well, you saved a ton of money, but missed a lot of fascinating presentations. Each hour brought another

  3. As discussed before if you have a type 2 dimension and late arriving facts you are in a messy situation

  4. Using the Cache Connection Manager (CCM) is a new option for the Lookup transform in SSIS 2008. The CCM

  5. The cascading lookup pattern uses two lookup transforms with different cache modes . A common use of

  6. The Lookup Transform does case sensitive string comparisons. This means that you need to a little bit

  7. The SQL Server Integration Services team added valuable new caching options (and scalability) to the

  8. Simon says:

    What's difference between 'Reference set' and 'Reference table'

    Because you mentioned

    •When you're accessing a large portion of your reference set

    •When you have a small reference table

    Are they same ?

  9. Enrique says:

    Whats the solution if the lookup table has million records (they dont fit in a full cache) and you dont have OLEDB driver (but ado.net instead) ?

  10. Rogerfrench says:

    when you say large refernce table,does tha contain large number of rows or it wide in column

  11. Garry says:


    I am using three lookups(FULL CACHE MODE) in one data flow task and (SSIS 2012). The first lookup is able to find the match but the other two are not. Is there a limit on using llokups in FULL CACHE MODE.

    I am 100 percent sure that there is data that should match, and if i use only one lookup in full cache mode then it works fine.

    I have tested this on many other packages in my solution , still the same result.

    Can any one help me on understanding this.



  12. Monday Gandanzara says:

    I was testing this out as a means to optimize my package design in SQL 2012 environment (trying to figure out the best option: Merge Join which requires the use of sort component vs. Lookup)

    Here is my situation:

    I have 1million rows in reference table (APS )

    My pipeline is using a CDC component to get net changes (so smaller changes, about 2k – 10k rows per hour)

    My package will be executed hourly

    My Destination is in APS (distributed  with clustered columnstore index)

    I executed the packages in VS and below are the results:

    Using the Merge Join  — it processed 61k rows in 2:34 minutes

    Using the Lookup with Partial Cache — Over an hour before I finally stopped it (Lookup was processing about 100 rows per minute )

    Using Lookup with No Cache — Over an hour before I finally stopped it (Lookup was processing about 248 rows per minute)

    Using Lookup with Full Cache — it processed 61k rows in 2:20

    For this test, I loaded my reference table with 1 million rows as a sample, the actual table have over 40 million rows.

    Reading your post, I was expecting the Partial Cache mode to be the best option however base on this numbers, it doesn't seem that way.

    For my case, I am considering the Merge Join Option because I can use parameters in the my dataset to limit the source pull.

Skip to main content