In the previous blog article, I described how to use the free SSIS DQS Domain Value Import from OH22 data to automate importing values into a DQS Knowledge Base domain. In this article, I will go over the optional configuration to import invalid/error value types as well as to set synonyms.
First, let's quickly recap the domain type and synonym.
In DQS, you can define a domain value with different types:
- Correct: This is a valid value that belongs to the domain. For example, in the color name domain, red is a valid color name.
- Error: This is a value that is incorrect. For example, 'tbd' is not a correct color name.
- Invalid: This is a value that may be correct but does not belong to the domain. For example: oliver is a correct person name but it is not a correct color name.
Note: The difference between Error and Invalid is often subtle. Whether you set a domain value as Error or Invalid, DQS cleansing gives the same result and marks records containing those values as Invalid. Many DQS implementation combine the Error and Invalid and only use Correct or Invalid type to simplify.
When setting up domain value type to Error or Invalid, you can specify Correct To value. DQS cleansing then automatically replace the value with the Correct To (i.e. synonym).
Importing domain values with different types
Consider that we would like to insert new values to the domain Name from a SQL table:
CREATE TABLE DQSCOLORS2 (
-- Note: that the column containing domain value must have NVARCHAR/NCHAR and Domain type must be set to INT data type.
INSERT INTO DQSCOLORS2
The domain type column must be set to one of the following integer value:
- 0 (correct)
- 1 (error)
- 2 (invalid)
In this example, 'white' is a correct color name, 'tbd' is an error, and 'oliver' is an invalid color name.
Configure the SSIS DQS Domain Value Import as follows:
- Specify connection to the DQS server
- Select the Data Quality Knowledge Base from the drop down (e.g. Colors).
- Select the Domain from the drop down (e.g. Name)
- Specify input column for the Leading Value (e.g. Name)
- Specify input column for the type (e.g. DomainType). Note: The drop down only shows column with INT data type. Check the data type if you do not see the column name from the drop down.
- Accept all other default setting (In part 3 of this series, I will explain how to use error handling options)
- Click OK to finish
Make sure the DQS Knowledge Base Colors is published before executing the SSIS package. After the SSIS package completed, open the Knowledge Base to review the result:
The domain values are added with the specified type.
A couple of things to consider when using the Domain Type option:
- Domain type source column must be set to INT and contained only 0,1,2 and does not have NULL -- any violation to this will result in error during package execution. See the next article for more information on error handling.
- You can't update an existing domain type (e.g. you want to change green from Correct to Invalid). This must be done through DQS Client.
Importing domain values with synonym
In this example, I set up a new table with the following:
CREATE TABLE DQSCOLORS3 (
INSERT INTO DQSCOLORS3 VALUES
('white', 'bright white')
,('DQS_NULL', 'not specified')
The first row sets synonym to an existing valid value (e.g. 'white') in the domain. The second row sets synonym to a special domain value called DQS_NULL (this value can be used to specify how to handle NULL during DQS cleansing). The last row set synonym where both the leading value and its synonym are not currently in the domain.
Configure the SSIS DQS Domain Value Import as follows:
Make sure the DQS Knowledge Base Colors is published, then execute the SSIS package and review the result:
The synonyms are successfully added to the DQS domain.
'bright white' and 'not specified' are set as Correct to 'white' and DQS_NULL accordingly. This means, for example, that when you are cleansing records containing 'not specified', it will be corrected to null. Lastly, both 'clear' and 'transparent' are added with 'clear' set as the leading value. Note that when importing leading value and synonyms using SSIS DQS Domain Value Import, the synonyms are always set to invalid type (e.g. 'transparent') while the leading value are always set to correct (e.g. 'clear').
A few things to consider when using the Synonym option:
- Synonyms are created as new value with domain type always set to Invalid
- You can set synonym to an existing Correct domain values (e.g. 'white" and 'DQS_NULL') -- If you try to set synonym to an Invalid/Error domain, you will get an error during package execution. See the next article for more information on error handling.
- You can set synonym to BOTH a new domain value and a new synonym pair (e.g. 'clear' and 'transparent' are new and not previously exists in the Color Name domain)
Importing domain values with both different types and synonym
In this last example, we will import data that include both type and synonym:
CREATE TABLE DQSCOLORS4 (
INSERT INTO DQSCOLORS4
('red',NULL ,'dark red')
,('blue',0 ,'dark blue')
,('blue',1 ,'light blue')
,('blue',3 ,'bright blue')
,('yellow',0 , NULL)
DQS Domain Value import is configured to map all three input columns:
The result of the SSIS package execution shows the following:
A few things to consider when using both Domain Type and Synonym option:
- Domain type is ignored when you have synonym. The new synonym always have Invalid type. (e.g. see 'dark red', 'dark blue', 'light blue' examples above)
- When the synonym is NULL, then the Domain Type must have non NULL valid of either 0,1,2. In another word, you can't have both domain type and synonym to be NULL. This will result in an error during package execution.
Let's proceed to the next article where I describe the options you can use to handle error during import.