Introduction to Data Quality Services Data Matching

It's all about your data...

Whether you are a small shop, mid-size or large organization, one of your most important asset is your data; organizations often spend hundreds of thousands or millions of dollars, a large portions of their information technology (IT) budgets on implementing sophisticated applications, data synchronization systems and data warehouses.
Making large investments of money, time and resources to collect large volumes of data without ensuring the quality of the data is futile and will almost certainly lead to a substantial waste of valuable resources. However, if high-quality data can be achieved and maintained, the business value of these IT investments can truly be exploited, allowing, for example:

  • Better customer service/satisfaction
  • More effective sales campaigns and improved
  • Web-based sales
  • More efficient business processes
  • Reduced time-to-market for new product introductions
  • Less expensive procurement
  • More effective business performance management
  • Better compliance reporting
  • Reduced business risk

One of the most damaging data problems is duplicate data records. Duplication occurs when the same record entity (product, customer, etc.) is represented by multiple similar, though not identical,records in a company’s databases. These differences could be due to typos, differences in how different employees perform data entry, differences in terminology or naming standards used among departments, differences in language and so forth. Whatever the cause, the results are highly detrimental to the successful function of the company, as the following examples illustrate.

  • Billing – An invoice goes out to ‘Contoso Mfg. Corporation’, but the payment received is posted to ‘Contoso Manufacturing Corp.’ For the next six months, collection letters are automatically sent to ‘Company Mfg. Corporation’ despite the company’s repeated claims that payment was made. The customer is lost and never orders again.
  • Sales – A customer calls to order 100 Panasonic KX-TA824 telephones and is told that there are none in stock. The customer orders from a competitor and the revenue is lost. Meanwhile, the company actually has 200 of the requested units, but they were not found because they were listed in the inventory system as Panasonic A824.

What are the challenges of data matching?

In the business world, effective matching requires more than identifying identical records (or records containing unique identifiers such as SKU or social security number), but also similar records which, in fact, refer to the same entity. The type of similarity varies tremendously between records and applications. Human beings are good at this task, but the amounts of data involved make this time- and cost-prohibitive.

Developing software to accurately ascertain similarity by means of “conceptual closeness” among unstructured data records is a great challenge, whether in the world of corporate data, web search engines, military intelligence or any other field in which matching between similar informational entities is critical. To a human, being familiar with the context of a particular data set, identifying very similar records is not very difficult. For example, it is not difficult for a person to identify that the same person lives in two different addresses based on the person’s identifiers; however, developing computer software with the “intelligence” to compare pieces of unstructured data and to ascertain whether or not they are essentially the same item is quite challenging.

This challenge is made easier with Data Quality Services where you can create and train you’re matching rules in an easy and interactive process and use this knowledge along with other knowledge stored in the knowledge base such as data domains values in a matching project for identifying similar records.

So how do we make sure that our data sources are duplicates free?

To successfully automate the identification of duplicate records, a sophisticated technological solution is necessary. To achieve this purpose, it is necessary to correctly identify duplicated records (i.e., those which actually represent the same physical entity) and then “de-duplicate” all redundant records.

How can Microsoft’s Data Quality Services help?

Data Quality Services (DQS) performs data matching by comparing each row in the source data to every other row, using the matching policy defined in the knowledge base, and producing a probability that the rows are a match. Matching is one of the major steps in a data quality project, it is best performed after data cleansing, so that the data to be matched is free from error.

DQS provides functionality to reduce data duplication and improve data accuracy in a data source. The DQS matching process has the following benefits:

  • Matching enables you to ensure that values that are equivalent, but were entered in a different format or style, are in fact rendered uniform.
  • Matching identifies exact and approximate matches, enabling you to remove duplicate data as you define it. You define the point at which an approximate match is in fact a match. You define which fields are assessed for matching, and which are not.
  • DQS enables you to create a matching policy using a computer-assisted process, modify it interactively based upon matching results, and add it to a knowledge base that is reusable.

A matching project is a 3 step task and is available in the Data Quality Client application. The following figure illustrates the Matching steps, showing results of a matching

Stay tuned for more posts about matching and start removing duplicates today 🙂


The DQS team 


Comments (5)

  1. Dylan Jones says:

    Interesting guys, nice to see you adding these kind of developments.

    Do you have any video footage of an actual scenario?

    Our members would find that useful.


    Dylan Jones

    Editor – Data Quality

  2. Filipe Henriques says:


    I'm looking forward to finally have Data Quality Services in SSIS.

    Is it possible to use defined matching policies in a SSIS package?

    I saw that DQS Cleansing Transformation performs cleansing activities.

    Can also performs matching activities based on KB defined matching policies?

    Or will Fuzzy Group Component now connect to DQS Server?


    Filipe Henriques

  3. Gadi Peleg says:

    Hello Dylan,

    Thank you for your post and interest in the DQS technology. You can find videos about creating a matching policy and matching project on Technet:…/denali_resource_center.aspx (under Videos -> Data Quality Services).

    It would be great to get your feedback.


    Gadi Peleg

  4. Gadi Peleg says:

    Hello Filipe,

    DQS in SSIS is indeed exciting, you will find many ways of utilizing a knowledge base with the DQS Cleansing data flow component in SSIS.

    We do not provide a Matching component for SSIS in this release, and the existing matching components (Fuzzy Lookup and Fuzzy Grouping) are not integrated with the DQS server therefore they cannot leverage a DQS knowledge base; however you can use the DQS Cleansing SSIS component to cleanse your data, then use this output as your input to a Fuzzy Lookup or Fuzzy Grouping components.


    Gadi Peleg

  5. Suggestion... says:

    One suggestion to the Microsoft DQS team: You show the score for matching records, It would be really beneficial to see the score for even non-matched records. For example, I have a simple matching policy where the SSN is a prerequisite, 70% weight for similar First Name, 10% weight for similar Last Name and 20% wait for similar phone number. I have duplicate records with everything identical except first name. First name is similar, except duplicate records have one less character. My matching policy does not pick up this duplicate and I am scratching my head as to why. Hence my request to the DQS team to show the score even for unmatched records.


    Jagannathan Santhanam

Skip to main content