Good news – a couple of Information log events where added to the Lookup Transform in SQL 2008 to help you better understand your lookup cache. When you’re running in Full Cache mode, the message will tell you the number of rows in the cache, its total size, and how long it took to create it. When running in partial cache mode you don’t get the cache size, but you do get the number of database hits vs. number of cache hits, which can be helpful in determining whether you should be using a full cache lookup instead.
Here are some equations you can use to estimate the amount of memory a cache will use (I say estimate because given memory boundaries, pages, etc, it will always vary). There are separate (but similar) equations for Full and Partial cache modes, as they handle things a little differently internally.
For each row, in bytes:
<Row size> + 20 + (4 * # of used columns)
<Row size> + 36 + (4 * # of columns in reference query)
Row size is the total data length of the index and values columns.
The 20/36 number is a constant representing the size of the hash used for comparisons.
The last part is 4 (technically, the size of an int) times the number of used columns for full cache, and total number of columns in the reference query for partial cache.
Our lookup query is:
select ProductKey, ProductName from [Products]
ProductKey is an int – 4 bytes
ProductName is an nvarchar(25) – which comes out to 52 bytes (two bytes per nchar, + 2 for null)
Plugging this into our Partial cache equation, we get:
56 + 36 + 8 = 100 bytes per row
If our reference table had 100,000 rows, we’d need ~10mb to hold the entire data set.
(100,000 * 100) / 1024 / 1024 = ~10mb
(divide by 1024 twice to go from bytes to kilobytes, kilobytes to megabytes)
In my example, I’ve edited my lookup query to contain only the columns I need – ProductKey, which is my index column (the column I’m matching on), and ProductName, which is my value column (the column I’m adding to my data flow). If I had done a select * (or picked a table/view name from the drop down in the UI, which results in a select *), I’d be increasing the number of columns in my reference query, and my partial cache row size would change. I’d end up with an additional 4 bytes for every column in my query, even though they aren’t being used directly in the lookup.