Getting the Enterprise Canonical Data Model right

What is the correct level of abstraction for the Enterprise Canonical Data Model (ECDM)?

As I blogged before, the ECDM is used to decide what data should be passed through the integration infrastructure in the notifications that occur on business events.  The canonical schema that define "things" are all subsets of the ECDM (or extensions as well will see).

In some organizations, there are fairly few variations in basic 'things' like order, product, and agreement.  In other organizations, including Microsoft, the need for independent variation is more apparent.  As we move more toward "Software as a service," the number and types of products will only grow.  And what exactly is an order if we are using click-stream billing for a service call?  This will be fun.  So we need lots of flexibility as the business grows and changes.  An ECDM that is too prescriptive or too large can end up constraining the business' ability to grow and change.

There are basically two types of messages that need to rely on the ECDM: event notifications and full data entities.  Both are transitory, in that they state a fact at a particular point in time, but the event notifications are more transitory because they are only sent once across the infrastructure.  We need to be able to replay them, but (with the exception of BAM), we don't often query them.

In general, I'd say the rule for event notifications should be:

Communicate sparingly, communicate clearly, allow for questions.

Communicate sparingly: Define your entities to the minimum level needed to share "concepts" and "relationships" across the enterprise.  If an order happens from "company ABC" for 10,000 licenses of "product XSP" under marketing program "VLR", then the canonical schema for that order needs to be pretty short, and the event notification even shorter, so that receiving systems can decide if they even care.  Remember that your event system will send a LOT of events.  Keep them small but provide enough information for the recipient to decide if they need to know more.  So, perhaps the "order placed" notification has things like order id, customer id, partner id, reseller id, program id (sales are made under marketing programs) and a list of product categories that the items in the order represent.  That's it.  The receiving system can decide if they need to know more.

Communicate Clearly: The id's must be generic and enterprise wide.  If a receiving system gets a notification or a canonical element (like the full order), they have to be able to interpret it consistently.  That means that the systems listening for the events have to know what the ids mean and how to get more information on an id if they don't already have it.

Allow for Questions: the infrastructure needs to provide a generic way to ask the question: I need to know more about order 1234 to customer ABC on program VLC.

So if the needs of the event notification are for brevity and consistency, what are the needs for full data entities?

When a system gets an event notification, it will look at the event and decide if it cares.  Most of the time, it won't, and our use case ends.  Sometimes it will.  When it does, it needs to ask for full details of that data entity.  Perhaps it wants to store data.  Perhaps it wants to calculate something to append to the records for the customer, the partner, the reseller, the sales team that made the sale, or the product group that made the product.  Lots of reasons why the system getting the message will need more data.  We have the ability to 'ask questions' listed above, but that one comes to full data entities as well.

I'd say the rule for full data entities is:

Provide a complete document, at a point in time, allow for questions

Provide a complete document - the full data entity contains all of the data that the source system can share about it, including denormalized details about related entities.  For example, if I get an order as stated above, for 10,000 licenses for product XSP, we would provide the full "legal name" for the product and some attributes for the product (like the fact that it is a license, what country it is sold in, languages, product family id, etc). On the other hand, we don't want to constrain the business, so allow for optional fields in the semantics of the canonical object.  Allow a system that doesn't have a data element (like a price or even a quantity) to send the order anyway.  Also allow the system that is sending data to append 'system specific' data elements.  That way, a team can use the canonical model to send data to another closely related system in the same business stream, where those 'system specific details' can be understood and used.

At a point in time - Recognize that your documents are not static.  Provide dates and version numbers for each and every document and allow a document to be called back up on the basis of those dates and version numbers.  This is key to being able to recreate a data stream later in time, an operational necessity that is often overlooked.  So, yes, your order has a version number. 

Allow for questions: as complete as your order document is, it will still need to have codes in it referring to other things.  For example, each product may have a product family.  By including the product family code, you are stating this: "At the time this order was placed, product "Sharepoint" was part of the "Office Family" of products".  For some products, this may not change much, but for others, this could.  So you include the product family, but there is no need to include attributes of the product family.  The receiving system can ask for product family details of the same infrastructure if it needs to follow up.

Hopefully, with these simple guidelines, we can build the ECDM at the right level of abstraction.