Clarifying the Concept of Metadata

Metadata is a difficult word to define, or so it would appear.  After all, why is it that the best that Wikipedia can do is:

Metadata (meta data, or sometimes metainformation) is "data about data", of any sort in any media. An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema. (Source: Wikipedia)

Ewww. 

OK, to be fair, that's not the best definition I could find, but I did want to point out one problem with the way that Metadata is perceived in the world of computing.  Metadata is usually defined in some lame way (data about data) and then described through examples, to help people to understand the meaning of the term.  Those examples are often incomplete or even misleading, like the example above or this example from further down in the same Wikipedia article:

In the context of an information system , where the data is the content of the computer files, metadata about an individual data item would typically include the name of the field and its length. Metadata about a collection of data items, a computer file, might typically include the name of the file, the type of file and the name of the data administrator. (Source: Wikipedia)

What is so bad, you might say, about this example?

After all, it conveys the concept of "data about data" using terms that the reader may find familiar.  That is true, but there are much more powerful, complete, and correct examples available, ones that would provide a richer context to an understanding of metadata, and perhaps an understanding of how important metadata can be.  In addition, by supplying a small set of metadata elements, the example tells such a small part of the story, that it errs by omission.  It would be akin to describing a political leader as "human" and supplying their date of birth.  There is considerably more information that is both useful and simple to collect.

For example, let's say that someone sends you an e-mail, and in that e-mail is a document.  The name of the document is "Functional Specification for Vista." You could draw all kinds of conclusions from that bit of metadata (the title). 

Now, you open the document and you find that it is a short document (one page).  On that page is a list of the people, and their business functions, who proofread text submitted to a commercial printing company called Vista Printing.  Or even better, it turns out it is a Powerpoint deck, created by a salesman, that shows pictures of how Vista Printing functions!

After examining the metadata, what was missing from our understanding of this document?

We could (and I argue, should) know a great deal more about an artifact like this one, before we can say that we understand it.  We need to know, at least, who, what, when, where, why, and how. 

  • Who created the artifact?
  • Who is the intended recipient?
  • Who has accessed it?  (And when did they access it, and what process were they performing when they did?)
  • What business process called this artifact into existence?
  • What business outcome was it intended to support?
  • When was it created (date)?
  • When was it created (in what relation to the beginning of the process instance in which it was created)?
  • Where was it created (physical systems used to create it)?
  • Where is its address (URL?) for it's 'official' storage location on a network?
  • Why was it created (name of the process activity that required this artifact as input)?
  • Why was it created (description of the personal objective of the creator in creating it)?
  • How was it created (using what tools and techniques)?
  • How was it created (using what thinking / creative / collaborative process)?
  • How was it created (using what audit / change control / approval process)?
  • How was it paid for (which goes to the motivation of the person who desired it's creation)?

While it is true that the creation date and file name are, technically metadata, they are far from reasonable examples to help people understand the concept of metadata.

To this end, I'll suggest an alternate definition: one that I believe is simple, easy to read, and provides a better understanding than 'data about data.'  It goes like this:

Metadata is the surrounding contextual information required for a person or system to "understand" an element of information in the context for which it was intended.  

Metadata answers fundamental questions about a bit of information, such as who created it, who may access it, what does it refer to, why it was created, and how it should be used. 

A sufficient amount of metadata is captured when a consumer of a data element is able to correctly place the information in context, even if that consumer is using the information for a different purpose than it was originally intended.  The list of fields considered 'sufficient' for one purpose may not be sufficient for another.

Of course, you may not want, or need, all of those questions answered.  It can be difficult to capture everything, and some of those difficult items may not be beneficial for any of the consumers of that information. 

On the other hand, it would be very simple, in many cases, to capture quite a bit of metadata, nearly always more information than we normally capture.  Capturing this information, and using it appropriately, can help in the automation of business processes, the correct categorization of information for retrieval (and use), demonstration of compliance to a standard or business rule, and the maintenance of appropriate information security.

The data that we frequently collect, like the user who created it, or the date it was updated, is not interesting if there is not a business process that ties to that data, either as a producer or consumer.  How many database tables have columns for 'last modified date?'  How many business processes use that information?

So, whether you are working on a repository of information, or just creating the schema for a database, consider carefully what metadata you want to capture and how you want to capture it.  Ask yourself who, what, when, where, why, and how, both for fields and tables.  Ask these questions for all artifacts, including the documents, source code, and test execution logs.  Then, consider carefully if that information would be useful to a business process somewhere else in your system. 

Capture useful metadata.  You'd be surprised how valuable it can be.