Creating a rich Wordprocessing Document from a database

I've been planning to start pulling together some real world examples around the ways in which people are using the Office Open XML formats to make document generation much easier. There are a lot of examples from Office 2003, and we're starting to see folks building even more powerful things with Office 2007. It's even something we're using for the Ecma spec generation.

Ever wonder how it's possible to generate and maintain a 5000 page spec that documents almost 10,000 elements, attributes, simple types, and enumerations? Well for us, I don't think it would have been possible without the Open XML formats themselves. It's kind of funny, but the methods we are using to generate the Ecma TC45 spec for the Office Open XML formats is probably the ultimate bootstrap of the Office Open XML formats. There is a SQL database that stores every element, attribute, simple type, complex type, and enumeration for all the different schemas. That results in about 10,000 rows of data. The documentation for each one of those pieces is then stored in the database as separate chunks of WordprocessingML markup. So when we want to produce a specific portion of the spec for review, we just compile those specific rows of data to generate a rich .docx file. The entire group can then collaborate on that file (either as a .docx; .doc; or .pdf), and when we have finished reviewing it, we just shred it back into the database.

This approach allows us to generate just pieces of the spec, or the entire 4000 pages, depending on what we want to review. It also gives the committee a lot of flexibility in terms of moving pieces of the spec around, tweaking element or attribute names, and updating documentation (we just change some properties in the database). This is all possible because of the Office Open XML formats, and the new content controls support in Office 2007. The content controls allow us to structure the spec so that as the documentation is updated, we can still easily shred it back into the database. The content controls allow us to easily specify (and modify) each element's name and description.

I'll definitely provide a lot more information on this process over the coming months since it's a pretty good case study and I think there are a lot of reusable components. In the mean time though, there is are a couple great articles by Erika Ehrli that describes the basics of how you can leverage content controls in Word 2007 in combination with the Open XML formats to create rich documents based on content from a database:

  1. Part 1 - One of the most common requirements for applications that work with data is "data-driven document generation." No matter what the data source is -could be an Access database, SQL database, Web service, SharePoint list, Excel spreadsheet, XML file, Word document, or multiple sources. The typical "export data and create Word documents" feature is a common need.
  2. Part 2 - You can build a server-side application using Visual Studio to generate data-rich documents using the Office XML File Formats and the .NET Framework 3.0 (aka Microsoft WinFX ).

For those of you with access to the Beta 2 build of Office 2007, I strongly suggest you take a look at the content controls. They are a really powerful set of tools that when combined with the Open XML formats give you a ton of control in building solutions.


Comments (3)

  1. A says:

    I was actually wondering about that, after all I was getting lost just reading the thing 🙂

    Are there any plans to remove duplications from the documentation?  For example, the "spPr" element in the PresentationML is used by a large number of entities, but instead of having it documented only once, it gets duplicated for almost every entity that uses it.  At the very least, removing duplicates would cut down on the number of pages.

    Keep up the good work!  The documentation has proven to be invaluable so far and a huge step up from working with the binaries.

  2. BrianJones says:


    Actually, we did talk about that, and in the latest version of the spec we actually only write out an element once (if it has a matching name and complex/simple type). It has definitely cut back on a lot of the duplication.


Skip to main content