Custom Defined Schemas

I've talked a lot about the value of "Custom Schema" support in Office. Anytime I give talks on the file formats, I make sure to spend some time also talking about the support for custom schema. I don't think I've really given the basic intro though on the difference between reference schemas and custom defined schemas in my blog, so if you haven't seen one of my presentations you may not know what I'm talking about when I refer to custom defined schemas. Each Office application has different levels of support, and it's good to start investigating the functionality...

There are of course a infinite number of valuable uses for Office documents. Obviously the Office applications are more than just a better typewriter. If you look at Word for example, one of the big investments have really been around making better looking documents . We've done a ton of work this release for instance to allow you to add great looking content to your document easily.

Making a great looking document though is only one part of making a document valuable. I think it's best to use a really simple example. Let's take the following document:

Using the basic Word functionality, it's pretty easy to format this document so that any human can easily look at it and understand the information that's being conveyed. It's clear that this is a conference report that was made on July 17th. You can quickly find out what the summary was as well as who attended the conference. This information can all be saved out as XML thanks to the reference schemas. In Word, we defined an XML schema called WordprocessingML that fully represents all of that formatting and layout information as XML. The reference schemas are used for conveying all the application specific information:

So the information that says "John Doe" is bold text and "Health Agency" is italic text can be represented as XML thanks to the reference schemas:

<w:p>
<w:r>
<w:rPr> <w:b /> </w:rPr>
<w:t>John Doe</w:t>
</w:r>
<w:r>
<w:rPr> <w:i /> </w:rPr>
<w:t>Health Agency</w:t>
</w:r>
</w:p>

The reference schemas are great for representing all of the application's information. Everything you do in Word, you want to be saved out and persisted, and the reference schemas allow for that (you don't lose anything when saving as XML). In a wordprocessing document the reference schemas are used to convey all the display-oriented information like bold; italics; paragraphs; tables; styles; etc. Reference schemas enable long term archive-ability of the formats as well as interoperability with other applications and solutions. This is provided there is good documentation around the reference schemas, which we will have via the Ecma process.

The thing that you don't get with reference schemas though is the ability to easily structure content using your own semantics. In the above example, if you wanted to quickly search for all conference reports where "John Doe" had been an attendee, you'd be kind of stuck. Any type of business logic you wanted to run on these documents would be extremely difficult, because the reference schemas are there to allow humans to easily read the content, but not programs. Let's say you wanted to write a solution that took all the conference reports that "John Doe" had attended, and create a single document that was a list of all those conferences and the summary of each. If the application you are using doesn't support custom defined schemas, then your stuck using features like style names, bookmarks, tables, or some other type of hack. Those approaches don't allow for any real hierarchy, and there isn't really a good way of specify the style structure so that the right type of validation can be done. Up until the introduction of the custom defined schema support in Word 2003 though, those hacks were the only options people had. I've seen plenty of solutions people have built using all of those methods, some of which were extremely impressive given the constraints. Unfortunately though, they all fall short of the goal.

This is where the custom schema support comes in. If you really want to treat these documents as a source of data and integrate them with your business processes, you need the ability to structure them in your own schemas. You want to specify what the date was, who the attendees where, and even what department they worked for:

This is the advantage XML can give you. The combination of namespaces; XSDs; and even XPath allows you to add your own structures to the documents; validate those structures; and even navigate them so that they can integrate better with your solutions. With the custom defined schema support, you can get this kind of information out of the document:

<ConferenceReport>
<Date>3/24/2004</Date>
<Attendees>
<Attendee Name=“John Doe”>
<Department>
Health Agency
</Department>

<Potential>
<Sales>100</Sales>
<Growth>25%</Growth>

</Attendee>

That's much more useful when you care more about the data than the presentation information. It represents the business information that's stored in the document rather than just the display information. This really helps to enable system integration.

It's really important to realize the potential of documents in your organization. Too often people think of documents as just being a bunch of formatted text, and don't see document collections as the valuable databases that they are. People spend a lot of time producing that valuable content, and you want to make sure that you can fully leverage that content. You'll see that in Office '12' we've done even more work to help you integrate with the data in your documents. In my post earlier this month I talked a bit about the new content controls in Word, and how you could bind those controls to your own XML data. There is a ton of momentum in this area that's been building up since Office 2000, and it just keeps getting better! I really love this stuff if you can't tell :-)

-Brian