MDM and EAG and CDI Oh My! (Part 1)

MDM and EAG and CDI Oh My! (Part 1)

One of the benefits often mentioned in SOA articles is data integration. When all your services communicate through messages with well-defined formats and schema, exchanging data between systems is easy right? Unfortunately, this is a necessary but not sufficient condition for data integration. In addition to having all your systems agree on what a customer message looks like, they are going to have to agree on customer identifiers before the systems can exchange data in a meaningful way. For example, if I appear as a customer in three of your systems but one system uses my SSN as an ID, one uses may email address and another uses my phone number, chance are I am going to show up as three different customers so it will be nearly impossible for your company to understand me as your customer. As new systems are built and more customer data is acquired through mergers and acquisitions, this problem becomes more difficult to solve.

The traditional approach to obtaining a unified view of your customers is to pull all the customer data together into a data warehouse. This involves agreeing on a common format for a customer record and using some kind of data cleansing software to figure out which customers from different systems are actually the same customers. A quick scan of my junk mail indicates I show up as Roger Wolter, R Wolter, RL Wolter, Roger Wolters, Roger Walter and Walter Rogers in various customer databases. The new Fuzzy Lookup feature in SSIS can be used to decide whether these names represent the same person or different people. Some systems use other data like addresses, phone numbers, email addresses, etc. to help make a decision on whether different customer records represent the same customer. Systems that help match customers are generally known as customer data integration (CDI) systems. Some of them have been around for a long time and use very sophisticated algorithms for matching customers.

While the data warehouse approach is good at presenting a unified view of history of your customer relationships, it doesn’t solve the problem of obtaining a unified view of current activity. This is what many people are looking to SOA to provide. Obviously, a current customer data system is going to require agreement among the systems on customer identities. Without this agreement, integration is actually harmful because it gives a false picture of the customer. When you look at the customer from one system at a time then you expect a limited view but when you think you are getting an integrated view of the customer and not all the systems agree on the customer identities, you won’t be getting the integrated view you expect from your new SOA system.

One obvious approach to resolving this is to create a customer master list and modify all your applications to use this list instead of their current list. The tools you use to create the original list would be the same tools that the data warehouse approach uses. While this is definitely the surest way to ensure that all your systems agree on the customer list, it’s almost never a practical approach because it requires significant changes to almost all your systems. It also means that your systems will all have to deal with a superset of all the customer data in your organization. This may mean that a system that was designed to manage a few hundred customers is suddenly exposed to several million customers from your web site with possibly disastrous consequences.

A variation of this approach would be to keep a master customer list which is a subset of all the attributes of your customer systems and a superset of all the customers. A basic set of customer related queries could be satisfied by retrieving data directly from the master list while more detailed queries would have to drill through to the data maintained by the aggregated applications. A mechanism would have to be developed to keep this master list synchronized with the application databases. If your SOA system is going to wrap these applications with service logic, that logic might be a good place to ensure that customers are added to both the master list and the application database. If customers are added directly to an application, some mechanism would have to ensure that the customer is tied to the master list also – especially if the customer already exists in the master. In some cases, it might be possible to do this in a database trigger but it also may be necessary to change the application logic to search the master list for a customer before adding it. You would also have to come up with a mechanism for propagating changes made to the master back to the systems that share the updated record.

Some systems just maintain a cross reference list that maps the customer id’s in the different systems without actually maintaining any customer data in the master list. This makes the master list easier to manage but it means that most queries are going to involve distributed joins across multiple – probably heterogeneous – databases. This means you maintain a smaller and simpler customer master at the expense of more complex logic to query it and keep the databases synchronized.

This is the customer problem at a very high level. I started with customers because that is generally the hardest problem to solve. In Part 2 of this series, I will talk about these same issues as they apply to other database data.