As part of our DQS CTP3 offering, we are releasing a new DQS SSIS component. This component incorporates the DQS cleansing functionality into an SSIS data flow.
When should I use the SSIS DQS Cleansing Component?
The DQS Cleansing component can add value when:
1. Cleansing should be performed as a batch process.
2. The cleansing functionality is used as part of a larger data integration scenario.
3. The cleansing process has to be automated, or run periodically.
Sounds great. I want to cleanse my data with the DQS cleansing component. What do I do now?
DQS is a knowledge driven data quality product. It means that first you have to create a knowledge base (KB) that is relevant to the data sources which you would like to cleanse. You build this KB by acquiring knowledge from samples of your data (a process we call “knowledge discovery”), by configuring the KB to use external knowledge from Windows Azure Data Market, or by manually adding knowledge to the KB. The knowledge is stored in the context of data entities that we call data domains. Some examples for domains include City, State, Email Address, etc. The knowledge in the domains consists of good values, bad values, relations between values, validation rules, etc.
Creating and managing KBs are done through the DQS client. Click here for additional information on how to create a KB.
So I Have built a good KB, what now?
Once you feel that the KB contains sufficient relevant knowledge for your cleansing tasks, you can create an SSIS data flow that contains the DQS Cleansing component, configure it to work with the prepared KB, and run the package.
Sometimes, it can help to run an interactive cleansing project through the client before running a batch project through SSIS. In this way, you can conveniently evaluate the results through the DQS client UI and decide whether you are ready to perform batch cleansing, or whether you still need to do some work to enhance and improve your KB.
How do I configure and use the DQS Cleansing component?
In general, the DQS Cleansing component is a standard SSIS transformation component, so familiarity with SSIS is required. To read about this component specific configuration and usage, please refer to this post in the SSIS team blog.
Note that while the DQS Cleansing component is installed as part of SSIS, it is a part of the DQS product and requires the DQS Server installation to function.
Please note that the actual DQ work is done in the DQS server, so from DQS point of view, the component is served as a client. You can install SSIS and DQS server on the same machine, but you can also operate the DQS Cleansing component with a remote DQS server.
If your domains are attached to external reference data services (RDS), your records will be further sent to the cloud. A post about how to use RDS will be published soon.
The following diagram shows how all the components and entities that were mentioned above relate to each other:
Best Practice: using DQS Cleansing in conjunction with Conditional Split for optimized handling of results
The DQS Cleansing component takes input records, sends them to a DQS server, and gets them back corrected. The component can output not only the corrected data, but also additional columns that may be useful for you. For example - the status columns. There is one status column for each mapped field, and another one that aggregated the status for the whole record. This record status column can be very useful in some scenarios, especially when records are further processed in different ways depending on their status. Is such cases, it is recommended to use a Conditional Split component below the DQS Cleansing component, and configure it to split the records to groups based on the record status (or based on other columns such as specific field status).
By using this best practice, “good” records can be immediately sent for further processing downstream, while “bad” records can be isolated or redirected for appropriate handling (automatic or manual).
Watch this video in order to understand how this best practice works in reality.
This is for now - feel free to contact us with comments and feedback.
The DQS Team