Word XML's Context Free Chunks: Building a document from multiple pieces of content

This is a bit of a more obscure feature that I like to point out every now and then. It's great if you are interested in building up a Word document from multiple pieces of content. This is a common scenario I've talked with a number of folks about over the years. In the design of WordprocessingML it was clear that we should make it easier to bring document fragments with rich formatting into an existing Word XML document without having to do a bunch of extra clean up work around style name conflicts, etc. This is why we created the context free chunk element, to allow people to insert a block of content where all the style and list definitions were defined locally for that chunk rather than for the entire document.

Example Scenario:

The simplified version of the scenario is that you want the ability to dynamically generate a document bringing in content from other sources. The contents of the document could be based on a number of outside factors, such as who the user is that is authoring it, what they are writing about, what the conditions of the market are, etc. For example, imagine you work for an investment bank and you want a solution that automatically generates a report template based on the company, industry, and analyst that is going to write the report. In order to do this, you need to bring in content from all over the place and use it to create the document. There may be disclosure clauses that you want to insert; a chart that shows historical financial figures; boilerplate description of the company, etc. 

Basics

This is just one example of something I've talked to a number of people about supporting. Rather than dig into the scenarios more though, I want to talk about one bit of functionality we had in the WordprocessingML schema for Office 2003 that was designed to address this case. I was planning to talk about this a bit later as I wanted to talk about more intro-level stuff first, but thanks to a great post last week by John Durant, I figured I would briefly describe it now.

One of the difficult problems with almost any document format is deciding how to move new content in. This sounds like it should be easy, but there are a number of issues to deal with. In the WordprocessingML schema, all styles and list definitions are declared at the beginning of the file. Then in the content below, the various objects (tables, paragraphs, text runs) can reference those styles. This is a very common model (similar to CSS in HTML), which means there isn't a ton of repeated data. The problem comes with adding new content somewhere in the body though. When you add new content, if you can't define everything local to that content then you need to go parse the style definitions of the source document to make sure that all the styles referenced in you new content are already declared in the target, and guarantee there are no collisions.

Target Document:

Let's say you had a file that looked like this:

Introduction

  1. First item in the list
  2. Second item in the list

The XML for that might look something like this (I'm just going to use shorthand so this isn't really a valid Word XML file):

<wordDocument>
<styles>
<style styleId="H1">
<fontSize val="18"/>
</style>
</styles>
<lists>
<list id="1" type="1, 2, 3"/>
</lists>
<body>
<p style="H1">Introduction</p>
<p list="1">First item in the list</p>
<p list="1">Second item in the list</p>
</body>
</wordDocument>

Source Document:

Now let's say we want to have a solution that adds the following content:

Disclaimer

It is important to understand the following issues:

  1. Something confusing
  2. Something else confusing

The XML file for that would probably look something like this:

<wordDocument>
<styles>
<style styleId="H2">
<fontSize val="16"/>
</style>
</styles>
<lists>
<list id="1" type="a, b, c"/>
</lists>
<body>
<p style="H2">Disclaimer</p>
<p list="1">Something confusing</p>
<p list="1">Something else confusing</p>
</body>
</wordDocument>

Getting the result we want:

So, if our goal is to add the content from the source into the content of the target to create a new document, we need to worry about a couple things. The first thing is that our source document uses the "h2" style, but that isn't defined in our target document. This means we'll need to update the style information for the target. The second problem is that our source document uses a list with id "1" that has "a. b. c." styled numbers. In the target document, there is also a list of id "1", but it's number style is different. If we don't fix this problem up, then the two list items in the source would belong to the same list that's already in the target, and you would end up with this:

Introduction

    1.    First item in the list
    2.    Second item in the list

Disclaimer

It is important to understand the following issues:

    3.    Something confusing
    4.    Something else confusing

Obviously that isn't what we want. To correct this, we would need to modify the source document so that the list uses a different id, and then add the list definition to the top of the target document.

Easier way:

Of course there is an easier way to do all of this. Building up a document from multiple parts was an important scenario for us. Because of that, we created an element in our schema called a cfChunk. The cfChunk allows you to create a temporary "mini-document". You can place a cfChunk within an existing WordprocessingML file, and then within that cfChunk you can make new style and list definitions that apply locally to that chunk. When Word opens the file, we'll then merge that content with the rest of the file, and take care of any conflicts. This is similar to what happens when you copy content from one document and paste it into another. If the style names match, then we'll inherit the definitions from the target document. If the style from the source doesn't yet exist, we'll create it. The cfChunk is one of those pieces of functionality that's rarely talked about but it's extremely useful. I think the main reason it isn't talked about is that for someone to see the benefits of it, they need to already understand that basics of WordprocessingML.

So, in order to get the file looking like we want:

Introduction

  1. First Item
  2. Second Item

Disclaimer

It is important to understand the following issues:

  1. Something confusing
  2. Something else confusing

We just do this:

<wordDocument>
<styles>
<style styleId="H1">
<fontSize val="18"/>
</style>
</styles>
<lists>
<list id="1" type="1, 2, 3"/>
</lists>
<body>
<p style="H1">Introduction</p>
<p list="1">First item in the list</p>
<p list="1">Second item in the list</p>
<cfChunk>
<styles>
<style styleId="H2">
<fontSize val="16"/>
</style>
</styles>
<lists>
<list id="1" type="a, b, c"/>
</lists>
<body>
<p style="H2">Disclaimer</p>
<p list="1">Something confusing</p>
<p list="1">Second item in the list</p>
</body>
</cfChunk>

</body>
</wordDocument>

Go ahead and try it out for yourself. You can imagine taking a template with a bunch of placeholder XML tags and posting it up on the server. Then your solution could just grab all the pieces of content you need, wrap them in a cfChunk tag, and swap them out with the placeholder XML tags in your template. I'm really pushing to extend this functionality for the new schemas in Word 12, so let me know if you find it useful or if there is some other kind of behavior you'd like to see added.

-Brian