SQL Server Data Quality Services (DQS) enables you to match records using your matching rules defined in DQS Knowledge Base. There are two common use cases where you would like to match related records. One is to identify duplicate in your records, such as same customers appear as separate records in your dataset. Another common case is to identify related customers. For example, you would like to identify two separate customer records that belong to the same parent company. In this blog article, I am going to show how to create a simple matching rules and use them to match duplicate and related records.
For this article, I am using the same data set and the knowledge base created in the previous blog article: How to cleanse Customer Data using SQL Server Data Quality Services.
The following are the high level steps:
A. Add matching policy to DQS Knowledge Base
B. Create and execute a DQS matching project
A. Add matching policy to DQS Knowledge Base
- Start Data Quality Client. For information about this step, refer to Run the Data Quality Client Application.
- In the Data Quality Client home screen, under Knowledge Base Management:
- Click Open Knowledge Base
- Note: this article builds on the an existing Knowledge Base created from the previous article to illustrate how domain definition for cleansing helps matching. However, it is not required that you create matching policy from an existing Knowledge Base. You can start by creating a new Knowledge Base.
- Select MyCustomerKB then select Matching Policy under Select Activity then click next
- Note: If you accidently click open without changing activity to matching policy, you will open the domain management screen. Unfortunately, there is no way to navigate to matching policy screen from domain management screen, so you have to first publish the Knowledge Base and start over step 2 to go to Matching Policy screen.
- In the Map screen:
- Connect to a data source to test your matching policy. Select the columns in your data set to the domains in your Knowledge Base by selecting from the drop down. Click Next to finish mapping.
- You only need to map the columns needed for matching purposes. In my example, I only select CustomerName, City and Province and I left out CustomerID (unique row identifier) and last update columns as I do not need those columns for matching
- If you start with a new Knowledge Base or your Knowledge Base does not have the domain needed for your matching, you can click the Create Domain icon () to create the domain directly from this screen.
- In the Matching Policy Screen:
- Create rule for matching by Customer Name
- Click icon to Create a matching rule () and give a name (e.g. Match By Customer Name). In this example, we would like to identify related customer by assuming that customer which has similar/same name belong to the same organization (This is a simplified example to illustrate the capability, in reality, you would use many more domains to achieve the goal-- more about using multiple domains for matching rules later).
- Click Add a new domain element icon to add domain for the matching policy. Specify weight = 100% (more about this later).
- Click Start button to test your rule.
- When performing matching, DQS automatically applies all domain definition in the Knowledge Base. As we are using an existing Knowledge Base which includes term definition for "co" as the same as "Company", DQS uses it to match "Consolidate Co Ltd" and "Consolidation Company Ltd". If you start with a new Knowledge Base or do not have the term based relation defined for CustomerName matching, then the 2 records will not be matched.
- Create rule for matching by City and State
- Click the icon to Create a matching rule again to create a second rule and give a name (e.g. Match by City State). In this example, we would like to identify duplicate customer recording assuming that a customer has only 1 store for a given city in the same state (It is not a realistic example, but you can imagine extending this to an address where if two businesses records have the same address, then the records could be a duplicate).
- Click Add a new domain element icon to add domain to the matching policy. Click down arrow next to the domain name and from the drop down, select City domain.
- Click Add a new domain element icon again to add the second domain, and select State from the drop down. When you have more than 1 domain in your rule, you will need to distribute weight across the domains. DQS will calculate matching score for each domain then use the weight to calculate overall matching score for the rule. For our example, you can distribute equal weight (50-50) across the 2 domains. Click Start button to test your rule.
- As the case of the CustomerName, DQS use the domain value definition in the city and state for matching purposes. As "LA" was defined as synonym for "Los Angeles", DQS match the 2 records as 100% match
- In the rule editor, you can also specify how the matching should be performed: similar (which we have been using in our example) or exact. The exact matching requires the value to be identical. This is typically used when you want to match things such as BirthDate or Identifier (e.g. SSN, CustomerID, etc.). If you have domains that can use exact matching, it is recommended that you mark it as prerequisite to improve performance. In order to match records from a dataset, DQS compare a record with all others, specifying prerequisite, effectively partition your dataset base on the value of the column. This may significantly improve performance.
- Click Start to test your rule.
- Click Next to continue.
- In the matching results screen:
- Click Start to test. While in the previous step, you test against specific matching rule, in this step, you can see how matching results from ALL of the rules. All the records that are matched together have the same Cluster value.
- The following is what you could interpret from the results:
- The first cluster shows the two related companies which have similar name (87% matching score). The name is similar enough that the difference in name may be due to human error during data entry.
- The second cluster shows the two companies that have the same address although they are entered as different names (one name in Chinese and another in English) -- the two customer records appears to be duplicate.
- The last cluster match records based on the 2 rules ("Match By Customer Name" and "Match By City State" -- the Rule column provides different color for different rule, you can click the Matching Rules tab at the bottom for the rule color legend). "Big Cheese, The" and "THE BIG CHEESE" appears to be related companies in a different city. The second one "To be filled later" appears to be a placeholder as the name may not available when first entered into the system. Later, a different person may enter the record as separate new record rather than updating an existing one and resulting in a duplicate.
- Click Finish to publish your Knowledge Base.
B. Create and execute a DQS matching project
- In the Data Quality Client home screen, under Data Quality Projects, click New Data Quality Project.
- Name your new Project (e.g. MyCustomer Matching Project), select Knowledge Base created in the previous step (e.g. MyCustomerKB) and select Matching Activity
- Note: Make sure to select the matching activity, once you click Next, you can't go back and change the activity, so you have to create a NEW project and start over.
- In the Map screen:
- Select data source, database, and table for your data (e.g. MyCustomers table and data generated from the script at the beginning of this article).
- Note: in this example, we are using the same data set when we create the knowledge base, but it is not required. In more realistic example, you build the rule with a representative sample of your data set, then use the full data set for the project. You can reuse the same Knowledge Base for different data set by creating a different project.
- Select columns to be matched and map to the domains in your knowledge base (you must map all domains used by all the rules to continue).
- You should have your data source columns mapped as below. Click Next to continue.
- In the Matching screen:
- Click start to begin the matching process
- Review the results and click next to continue.
- In the Export screen
- Select destination (e.g. SQL Server)
- Select database name
- Click the checkbox next to the Matching Results and specify table name
- Click Export to continue
- Optionally, you can also export survivorship results, for DQS to export only one record from the same cluster. You can specify how the record should be picked from the following option:
- Pivot record : the first record in the cluster
- Most complete and longest record
- Most complete
- Longest record
- Unfortunately, DQS does not support the option to select record based on value in a field (such as updated date or the originating system). For scenario such as picking the most recent record or record from a trusted system, you have to export the entire matching results then implement your own logic for survivorship.
- Review the results
- Open SQL Server Management Studio and query the export table
- Review the result. Use the CLUSTER_ID column to identify cluster of matched records.
In this blog article, I discussed how to create a Knowledge Base to define matching policy. You can then refer to the Knowledge Base to create Data Quality Projects to match your customer data. In this article, I use a simple example to walk through the most common functionalities. For more detailed explanation of DQS Matching, refer to the blog article wrote by my colleague, Gadi Peleg:
- Introduction to Data Quality Services Data Matching
- Matching Policy - A closer look into Data Quality Services Data Matching
It is also recommended that you run a separate cleansing project on your data set before you run matching project. Although DQS matching process uses the same domain definition for the cleansing, it does not make the correction nor does it flag invalid values. In addition, DQS matching only use the internal domain definition and does not automatically send your data to external third party providers. You must explicitly run a cleansing project to send the data externally to the Reference Data Service providers. However, the Service Provider may include additional information about the data you send. For example, Dun & Bradstreet includes DunsNumber in the output. This may be another useful information to use as part of your matching rule. Refer to my previous article on Cleansing your Customer data using Dun & Bradstreet DQS Service for more information on how to use DQS to cleanse data using Reference Data Service.
In the next blog post, I will describe how to automate the DQS matching project using SSIS.