A Maturity Model for Data Integration

Integration is an odd thing.  You have to make many people agree on common things in order for it to work.  If you look at the things that people agree upon, you can see where you are in data integration, and where you still have to go.

I'd like to lay out a maturity model for data integration:

Level 0 - no elements in common

Languages, reference data, assumptions... all different.  Integration can only occur in the data warehouse and then only after each data source goes through a mapping and normalization process. 

Level 1 - simple reference data in common

A shared set of reference data is produced by individual systems and other systems agree to use that reference data.  For example: a standard list of countries and their codes, a standard list of products in the sale catalog.  Notice that I said 'simple reference data.'  I literally mean flat files, since this is the most common way to share simple reference data. 

Level 2 - single instance shared database

One common form of integration is done using shared databases.  In this mechanism, multiple different applications produce updates and/or reports from a single database.  This is the most clear case of 'shared data model' integration.  Developers who work in this environment are required to learn a great deal about the 'data environment' created by that shared database in order to write valid applications.  Under this scenario, a single process model is nearly always evident, with well understood state transitions.  That said, the developers are not required to treat the system as 'event driven' in any particular way. They are aware of shared data, include the state machine(s), and use that information in their work. 

Level 3 - synchronized database

The next step in 'shared data model' integration, similar to the previous style but which leverage some form of data replication or synchronization under the covers.  The database itself not only has common tables, but those tables are kept in sync through back end processes. 

Synchronized database integration is an interesting bird.  There is often no formal mechanism for knowing "when" a data element will be copied from one instance of the database to another, so you have to be careful to maintain integrity.  When I've seen this kind of integration performed, the coders had to be very careful to insure that every database write was wrapped in a reasonable transaction.  Otherwise, it was possible for the synch mechanism to pick up a partial transaction and attempt to ship it to the other database instance... and that data transfer, when it didn't fail, created a loss of data integrity.  Declarative Referential Integrity (DRI) helps a great deal in this situation.  If your database does not have DRI, then use this mechanism at your peril. 

Level 4 - synchronized referential events

In this model, two or more systems share a logical data model, and event driven design begins to emerge.  Each system may introduce their own events into their own database.  The architect selects an event that is useful and decides 'on this event, we share data.'  He or She publishes event data at that time and the consuming system may subscribe.   Note that the databases are not synchronized at the database level.  It is at the logical level.  This decouples the databases from one another.

Synchronized referential events is so named because the events are usually tied to complete transactions, but they are not necessarily events that the business would recognize, and the vast majority of 'recognizable business events' are not used or made available for integration.  It is a one-off integration mechanism used for point-to-point integration. 

It is very useful, however, to help systems scale.  In high performance environments, this form of integration occurs frequently, because it makes sense.  Unfortunately, because the events are selected to be convenient to the business processes that the two systems decide to agree upon, you are very likely to end up coupling the integration mechanism to the business process itself, often at a point in the business process that the business itself wouldn't recognize. 

As such, it is very difficult to create a 'network' of databases that each use synchronized referential events to communicate with one another, because the data will appear to transfer at nearly random times. 

Level 5 - synchronized business events

In this model, we take our scalable event-driven model from the prior level of maturity and we add a constraint: the events that we use to synchronize the data between systems derive from a standardized list of business events

The business WILL recognize this list of events, because they helped to define it. These are the events that occur across multiple business processes and which indicate a key change in state for a key business object.  (for example, a sale may change state from 'proposed' to 'executed', which implies a great deal about how data is handled and what obligations have been transferred from one business party to another.  The business object is the sale record.  The change in state is the point at which an obligation has been placed on the business.  This one cuts across business processes fairly firmly).

A system that participates in a synchronized business event integration model must publish every one of the business events that it knows about, even if there are no known subscribers.  The list of business events is not huge, but it is not random, and that is the biggest value.  Everyone knows not only when data is available, but what it means when data is available, because there is a semantic understanding of what state the data is in when it arrives.

This form of integration allows Complex Event Processing (CEP) to occur.  You can have a component that notices a series of events on a particular subject, events that may be detected and published by different systems.  Your component may make a determination that would not otherwise have been visible to the participating systems. 

For example, a customer who follows a specific behavior shortly after showing up in the system for the first time could be an indicator for a type of customer that has a high potential for becoming a highly valuable customer in the future.  Right now, they look like anyone else.  Without CEP, you may not see them.  With CEP, you can tag them for specific followup. 

The flipside is also true.  Some folks have used complex event processing to find fraudulent transactions or 'sham' customer orders placed by less-than-honest trading partners in order to 'game' the system.  Detecting these transactions can be very valuable to the business.

The most sophisticated integration level, Level 5, requires that IT folks do something they have not often done... ask the business about shared points in their myriad of business processes.  Sometimes the finance team knows about a set of events that they helped drive.  Other times, the legal department may impose a common event for the sake of a regulatory requirement.  Many of the rest of the most common events, especially those in procurement, supply chain, and shipment, have been discussed in B2B settings.   There are many sources.  Go find them.

The most difficult part of building this list of events is a fundamental role reversal.  IT is no longer sitting on the sidelines waiting for the next 'order' to arrive ("do you want fries with that?").  We have to become a proactive, value-added part of the business.  We have to become the ones who reach out, educate, and create value.

And that can be a bit scary for some folks. 

My opinion: it's about time.