Custom Defined Schemas


I’ve talked a lot about the value of “Custom Schema” support in Office. Anytime I give talks on the file formats, I make sure to spend some time also talking about the support for custom schema. I don’t think I’ve really given the basic intro though on the difference between reference schemas and custom defined schemas in my blog, so if you haven’t seen one of my presentations you may not know what I’m talking about when I refer to custom defined schemas. Each Office application has different levels of support, and it’s good to start investigating the functionality…

There are of course a infinite number of valuable uses for Office documents. Obviously the Office applications are more than just a better typewriter. If you look at Word for example, one of the big investments have really been around making better looking documents . We’ve done a ton of work this release for instance to allow you to add great looking content to your document easily.


Making a great looking document though is only one part of making a document valuable. I think it’s best to use a really simple example. Let’s take the following document:



Using the basic Word functionality, it’s pretty easy to format this document so that any human can easily look at it and understand the information that’s being conveyed. It’s clear that this is a conference report that was made on July 17th. You can quickly find out what the summary was as well as who attended the conference. This information can all be saved out as XML thanks to the reference schemas. In Word, we defined an XML schema called WordprocessingML that fully represents all of that formatting and layout information as XML. The reference schemas are used for conveying all the application specific information:



So the information that says “John Doe” is bold text and “Health Agency” is italic text can be represented as XML thanks to the reference schemas:


<w:p>
    <w:r>
        <w:rPr><w:b /></w:rPr>
        <w:t>John Doe</w:t>
    </w:r>
    <w:r>
        <w:rPr><w:i /></w:rPr>
        <w:t>Health Agency</w:t>
    </w:r>
</w:p>


The reference schemas are great for representing all of the application’s information. Everything you do in Word, you want to be saved out and persisted, and the reference schemas allow for that (you don’t lose anything when saving as XML). In a wordprocessing document the reference schemas are used to convey all the display-oriented information like bold; italics; paragraphs; tables; styles; etc. Reference schemas enable long term archive-ability of the formats as well as interoperability with other applications and solutions. This is provided there is good documentation around the reference schemas, which we will have via the Ecma process.


The thing that you don’t get with reference schemas though is the ability to easily structure content using your own semantics. In the above example, if you wanted to quickly search for all conference reports where “John Doe” had been an attendee, you’d be kind of stuck. Any type of business logic you wanted to run on these documents would be extremely difficult, because the reference schemas are there to allow humans to easily read the content, but not programs. Let’s say you wanted to write a solution that took all the conference reports that “John Doe” had attended, and create a single document that was a list of all those conferences and the summary of each. If the application you are using doesn’t support custom defined schemas, then your stuck using features like style names, bookmarks, tables, or some other type of hack. Those approaches don’t allow for any real hierarchy, and there isn’t really a good way of specify the style structure so that the right type of validation can be done. Up until the introduction of the custom defined schema support in Word 2003 though, those hacks were the only options people had. I’ve seen plenty of solutions people have built using all of those methods, some of which were extremely impressive given the constraints. Unfortunately though, they all fall short of the goal.


This is where the custom schema support comes in. If you really want to treat these documents as a source of data and integrate them with your business processes, you need the ability to structure them in your own schemas. You want to specify what the date was, who the attendees where, and even what department they worked for:



This is the advantage XML can give you. The combination of namespaces; XSDs; and even XPath allows you to add your own structures to the documents; validate those structures; and even navigate them so that they can integrate better with your solutions. With the custom defined schema support, you can get this kind of information out of the document:


<ConferenceReport>
    <Date>3/24/2004</Date>
    <Attendees>
        <Attendee Name=“John Doe”>
        <Department>
            Health Agency
        </Department>

        <Potential>
            <Sales>100</Sales>
            <Growth>25%</Growth>
            …
    </Attendee>


That’s much more useful when you care more about the data than the presentation information. It represents the business information that’s stored in the document rather than just the display information. This really helps to enable system integration.


It’s really important to realize the potential of documents in your organization. Too often people think of documents as just being a bunch of formatted text, and don’t see document collections as the valuable databases that they are. People spend a lot of time producing that valuable content, and you want to make sure that you can fully leverage that content. You’ll see that in Office ’12’ we’ve done even more work to help you integrate with the data in your documents. In my post earlier this month I talked a bit about the new content controls in Word, and how you could bind those controls to your own XML data. There is a ton of momentum in this area that’s been building up since Office 2000, and it just keeps getting better! I really love this stuff if you can’t tell 🙂


-Brian

Comments (16)

  1. Nas Hashmi says:

    Do you really think it is worth organizing a document like that? If there could be a way word automatically recognizes letters and applies labels such as address or signature or position, then it will be useful. I think it is a waste of time because as far as I can see, I would rarely use it.

    Simply searching thru my documents for John Doe is enough. I can look thru them for what I need. I will only get so many results that it will not be hard to search thru them.

  2. BrianJones says:

    Nas, in the above example, it’s true that it may not be worth adding the additional structure. I used the above example because it was a really simple document which made it easy to show the difference between Reference schemas and Custom schemas.

    The value of the additional structure often relates to the value of the document itself. This isn’t a scenario where you as an end user would want to do it so that you could search through your files easier. It’s a scenario where your company would want to set up templates so they could manage hundreds of thousands of documents. The approach people have taken up until now is to use meta-data; but that isn’t tied directly to the content, so you have to force the user to take an additional step of filling out the meta-data. Pushing data back into the document for document generation scenarios is also a lot easier with the custom XML support.

    Seperately though, the smartTag technology that we’ve shipped since OfficeXP does the type of autorecognition you mention. There are smartTag recognizers that we ship out of the box, but it’s completely extensible so anyone could write a recognizer that automatically flags names; addresses; phone numbers; etc. The smarttags are marked with XML, so you get the same type of benefits.

    -Brian

  3. Keith Soltys says:

    Does this mean that Word will now support DocBook or DITA, two schemas that are widely used in technical writing? I had high hopes for WordML, but it turned out to be pretty much useless for technical documentation, though it’s useful for other things. But it would sure be nice to be able to use Word with a DITA or DocBook schema, and run XLST on the XML.

  4. BrianJones says:

    Hi Keith, the key to making WordML useful in technical writing is to just understand the constructs of the format. If you want to work in a schema like DocBook, you should find the structures in DocBook that can be mapped to similar structures in WordML, and just transform back and forth. Then, for any additional structures that don’t map, leverage the custom defined schema support. I posted back in the summer about how you can get started working with custom schema in WordML: http://blogs.msdn.com/brian_jones/archive/2005/07/26/443572.aspx

    It’s important to note though that our goals with the custom schema support was not to turn Word into an XML editor like XMetal. In stead we wanted to bring structure to existing Word scenarios. We’ve seen a large number of customer solutions where they were using things like styles and bookmarks to imply semantics to certain portions of their documents, and we wanted to make that easier and more robust. You can work with any schema you want in Word, but you’ll find that the more complex the schema is, the harder it will be to work with (as you probably expected).

    -Brian

  5. Yawar says:

    This is an amazing coincidence for me because I’ve been working at a bank where they have a whole truckload of documents which they manually type in and format every time it’s needed — e.g. a monthly report on deposit accounts — and I’ve just recently grasped, reading Word 2003’s documentation, how Word is able to arrange data like this.

    The potential of this technology for businesses which need to produce many different types of documents, but all based on information pulled out of a database, is huge. Imagine servers which generate your documents for you in Word XML format. All you have to do is specify some variables (like a range of dates for which data is needed, a particular branch, or a particular client) and you’re served a tailor-made Word (or Excel or PowerPoint …) document. Simply amazing.

    I will be following this technology very closely for the foreseeable future.

  6. Anisotropo says:

    Maybe with an Optimus Keyboard it’ll be more comfortable and thus it’ll b eworth.

    Anyway I think a really good equation editor is a plan with a top priority. And I mean a equation editor with a quality similar to LaTeX. Not pictures-equations, please.

  7. Randy Brown says:

    Brian, could you elaborate on how you’ve expanded Word 12 to allow programmatic infusion of data within a Web Application?  I am currently doing this in a Web app with Word 2003 XML docs and custom XML schemas.  I load the XML into a DOM and use custom class that allows me to throw any business object or reader at the class and file the Word XML document with the data from the class or reader.  

    It really works well but the code is somewhat involved in that I have to load up the WordML and my custom schemas and call a ProcessNodes function that uses XPath queries to find nodes named similar to the class or reader field name.

    Will this get easier in Office 12 ?

    Also, would love to save this out to PDF programmatically, what are the chances of that?  Will there be an class that I can use from .Net to do this?

    Sorry for the cross post…found these XML articles after the fact.

    Thanks

  8. BrianJones says:

    Hey Randy,

    It should get easier, but that will depend on your scenario. If you are able to leverage the content controls and map them to your data, then all you have to do is access your data directly (since it’s stored seperately from the rest of the content). It make it much easier than in Office 2003.

    As far as saving out as PDF programatically, Office 12 supports PDF output, but it requires Office on the machine (which isn’t supported on the server).

    -Brian

  9. Links to blog posts that contain useful technical information for developers.  Open XML is a new standard, but there’s some good information already available if you know where to look.

  10. This is the third post by Zeyad Rajabi who owns the XHTML output from Word’s new blogging feature. In…

  11. If you’re heading out to TechEd this week like I am, you should definitely plan on attending Tristan…

  12. Thomas Olsson says:

    We have developed an application that creates charts with time series. The application has support for OLE and users frequently embedd charts from our application in Word or PowerPoint.

    We have added support for "Document Summary Properties" in order to make indexing of our own documents available to Desktop Search. This works fine and is very popular.

    But users also want to search for charts that are embedded using OLE in Office documents. Is there any way we can make that possible? Can we implement IFilter on the OLE objects, supply Office with some searcable properties or perhaps we can make use of XML somehow?

  13. Oleg Krupnov says:

    Brian,

    Is there a way I could modify the custom schema dynamically, while it is attached to an open document?

    Suppose I have created a custom schema which defines a set of possible elements in a document. Then the user wants to extend that set and add his specific elements. In order the document keeps validating OK against the schema, the latter needs to be modified. Can it be done? If no, what other options you’d suggest for my task?

    What I tried so far was disconnecting and re-connecting a schema, but I lose all tags in that case, it won’t do. Also, you can close Word and purge its cash (somewhere in Local Settings) so that the new schema took effect. This is tedious either…

  14. lesbian rape says:

    Best of all people w can talk…

  15. I posted earlier this year on the support for custom defined schema in wordprocessingML via the new content

  16. I was just looking at Karel De Vriendt&#39;s ODEF (Open Document Exchange Formats) Workshop Conclusions