It's all about your data...
Whether you are a small shop, mid-size or large organization, one of your most important asset is your data; organizations often spend hundreds of thousands or millions of dollars, a large portions of their information technology (IT) budgets on implementing sophisticated applications, data synchronization systems and data warehouses.
Making large investments of money, time and resources to collect large volumes of data without ensuring the quality of the data is futile and will almost certainly lead to a substantial waste of valuable resources. However, if high-quality data can be achieved and maintained, the business value of these IT investments can truly be exploited, allowing, for example:
- Better customer service/satisfaction
- More effective sales campaigns and improved
- Web-based sales
- More efficient business processes
- Reduced time-to-market for new product introductions
- Less expensive procurement
- More effective business performance management
- Better compliance reporting
- Reduced business risk
One of the most damaging data problems is duplicate data records. Duplication occurs when the same record entity (product, customer, etc.) is represented by multiple similar, though not identical,records in a company’s databases. These differences could be due to typos, differences in how different employees perform data entry, differences in terminology or naming standards used among departments, differences in language and so forth. Whatever the cause, the results are highly detrimental to the successful function of the company, as the following examples illustrate.
- Billing – An invoice goes out to ‘Contoso Mfg. Corporation’, but the payment received is posted to ‘Contoso Manufacturing Corp.’ For the next six months, collection letters are automatically sent to ‘Company Mfg. Corporation’ despite the company’s repeated claims that payment was made. The customer is lost and never orders again.
- Sales – A customer calls to order 100 Panasonic KX-TA824 telephones and is told that there are none in stock. The customer orders from a competitor and the revenue is lost. Meanwhile, the company actually has 200 of the requested units, but they were not found because they were listed in the inventory system as Panasonic A824.
What are the challenges of data matching?
In the business world, effective matching requires more than identifying identical records (or records containing unique identifiers such as SKU or social security number), but also similar records which, in fact, refer to the same entity. The type of similarity varies tremendously between records and applications. Human beings are good at this task, but the amounts of data involved make this time- and cost-prohibitive.
Developing software to accurately ascertain similarity by means of “conceptual closeness” among unstructured data records is a great challenge, whether in the world of corporate data, web search engines, military intelligence or any other field in which matching between similar informational entities is critical. To a human, being familiar with the context of a particular data set, identifying very similar records is not very difficult. For example, it is not difficult for a person to identify that the same person lives in two different addresses based on the person’s identifiers; however, developing computer software with the “intelligence” to compare pieces of unstructured data and to ascertain whether or not they are essentially the same item is quite challenging.
This challenge is made easier with Data Quality Services where you can create and train you’re matching rules in an easy and interactive process and use this knowledge along with other knowledge stored in the knowledge base such as data domains values in a matching project for identifying similar records.
So how do we make sure that our data sources are duplicates free?
To successfully automate the identification of duplicate records, a sophisticated technological solution is necessary. To achieve this purpose, it is necessary to correctly identify duplicated records (i.e., those which actually represent the same physical entity) and then “de-duplicate” all redundant records.
How can Microsoft’s Data Quality Services help?
Data Quality Services (DQS) performs data matching by comparing each row in the source data to every other row, using the matching policy defined in the knowledge base, and producing a probability that the rows are a match. Matching is one of the major steps in a data quality project, it is best performed after data cleansing, so that the data to be matched is free from error.
DQS provides functionality to reduce data duplication and improve data accuracy in a data source. The DQS matching process has the following benefits:
- Matching enables you to ensure that values that are equivalent, but were entered in a different format or style, are in fact rendered uniform.
- Matching identifies exact and approximate matches, enabling you to remove duplicate data as you define it. You define the point at which an approximate match is in fact a match. You define which fields are assessed for matching, and which are not.
- DQS enables you to create a matching policy using a computer-assisted process, modify it interactively based upon matching results, and add it to a knowledge base that is reusable.
A matching project is a 3 step task and is available in the Data Quality Client application. The following figure illustrates the Matching steps, showing results of a matching
Stay tuned for more posts about matching and start removing duplicates today 🙂
The DQS team