The Cache Transform isn’t currently able to update (i.e. append to) an existing persistent cache file. This pattern presents a way to incrementally build up your lookup cache if your data flow process is responsible for adding new rows to your reference table. For an alternative approach, please see the Other Resources section at the end of this post.
The process is split up across multiple packages. Because a Cache Connection Manager reads its cache file once, and keeps the cache in memory until package execution is complete, you can’t read and update a cache file in the same package.
This is the control package which uses a script task to check if the cache file already exists. If it doesn’t, it executes the “Create Cache” package. Once the cache is there, the main data flow package which performs your general ETL logic is executed. The “Data Flow” package will write out any new rows for the reference cache to a temporary file, and update a variable in the control package to indicate how many rows were written out. The parent checks if any rows were written, and if so, runs the “Update Cache” package to update the persistent cache file
The cache creation package reads the source reference data, and then writes it to two separate locations. First it uses the Cache Transform to create a persistent cache, and then it writes the same data out to a RAW file. This RAW file will be incrementally updated by the Data Flow package to store the latest data for our reference set.
In the main ETL package, we use a key generation pattern to generate and insert new values into our reference table. Instead of rejoining the main flow right away, we multicast it out to a Row Count, followed by a RAW File Destination.
The Row Count writes to a local variable. After this data flow, we’ll use a script task to copy this variable’s value over to a variable in the parent package. You can see Jamie Thomson’s post for more about how this works.
The RAW File Destination is able to append to an existing RAW File. Here we use it to add the new rows to the “incremental update” file we created in the “Create Cache” package.
If the main ETL package wrote any new rows to the incremental update RAW file, the control package will run the Update Cache package. This package simply reads the incremental update file (which will contain the entire cache, since it’s simply being appended to each time), and recreate the cache file.
The sample packages used in this post can be downloaded from my SkyDrive share.
- John Welch has an alternative approach to incremental updates