Tour of the Office Open XML draft spec

I recently mentioned on this blog that the Ecma TC45 committee had released Working Draft 1.3 of the Office Open XML Document Interchange Specification. Today, I'd like to give a little more information about what's covered and how it's organized. It's an intimidating document — over 4,000 pages — but it's much more polished and accessible than the version I wrote about on OpenXmlDeveloper.org last month.

Brian Jones's Blog

Brian Jones has posted some great information about the draft spec recently, so if you haven't read these posts you should check them out:

5/18: Draft 1.3 of the Ecma Office Open XML formats standard
5/24: 4000 pages of documentation

Note in particular the comments on that second post. Brian has been putting a lot of time into careful explanations of the details, and there's a great deal of information in that thread on all sides of the ODF/Open XML debate. The dialog you see there is a great example of how interest in Office Open XML is starting to grow rapidly since the release of the draft spec and Office 2007 Beta 2.

Getting Your Copy

Anybody can download a copy of the draft spec from this page. It's available in both PDF and Office Open XML format (of course). The DOCX version is much smaller than the PDF version, so download that one if you have Word 2007 installed.

How big is this document? The best way to summarize that is to observe that the Table of Contents itself is 97 pages long. That's a big document! In the remainder of this post, I'll cover a few highlights of specific areas you might want to focus on, for those of us who aren't likely to actually read it all the way through. In other words, everyone except Brian. :-)

Page Numbers

In a document this size, you'll want to jump to specific page numbers as indicated in the Table of Contents, rather than scrolling through the document. Here's a tip, if you're opening the draft spec in Word: wait until the entire document has been paginated (you'll know it's done when the page count in the lower left corner reaches 4,081). Then you can use Ctrl-F/Alt-G (corresponding to Find/Goto) and enter a page number and press enter. If the entire document is paginated, you can use the page numbers from the TOC, rather than adding 98 to them to allow for the TOC itself.

Friday 5/26, 8:40AM: added this section and corrected the page numbers referenced below.

Tour of the Draft Spec

The subject matter of the draft spec really starts on page 10, with the Overview (section 8). If you're entirely new to Office Open XML, you'll want to read the overview in its entirety. Don't worry, it's only 5 pages, and it covers all the basics of how an Office Open XML document is structured in a ZIP archive full of parts, all tied together by relationships.

After the overview, you'll find three sections on the three main markup languages (MLs) in Office Open XML: WordprocessingML (as used by Word), SpreadsheetML (as used by Excel), and PresentationML (as used by Powerpoint). These sections are clear, straightforward discussions of the key concepts, without getting bogged down in all the thousands of details. For example, the WordprocessingML section covers the basic concepts such as stories, paragraphs, runs, and sections. These sections are extremely thorough in their explanations, with lots of cool diagrams to make things crystal-clear. Here's an example, from the SpreadsheetML section.

Next comes section 12, a discussion of the supporting MLs. There's some information about DrawingML, but most of the other topics in this section are marked "Yet to be supplied."

Section 13, Packages, is a great 10-page overview of the structure of a package, and must reading for developers who will be working with Office Open XML documents. Note that this section (as opposed to the next three) is only concerned with the aspects of the packaging convention that apply to all three types of Office Open XML documents.

Sections 14/15/16 cover the specific relationship items and parts that are used by WordprocessingML documents, SpreadsheetML documents, and PresentationML documents, respectively. So this information, when combined with section 13 (Packages), completes the description of the architecture of each type of Office Open XML document.

Section 17 covers how DrawingML is incorporated into the parts/relationships architecture of the package. As with the three sections that preceded it, this is all about implementation of the packaging convention and doesn't cover the details of the ML itself.

Section 18 covers some of the shared package-level concepts that apply to all types of Office Open XML documents: how objects are embedded, how hyperlinks are handled, and so on.

Section 19 is where the actual documentation of the MLs begins: every tag, every attribute, every little detail. The amount of information here is overwhelming, so you'll want to search for specific terms that you're most interested in. The WordprocessingML section is most complete, with over 1300 pages of documentation. SpreadsheetML (section 20) is over 600 pages, and PresentationML adds another 250+ pages. I'd imagine these last two sections will grow quite a bit before the spec is complete.

Note that the documentation of the MLs follows the same structure we've seen above: WordprocessingML, then SpreadsheetML, PresentationML, supporting MLs (such as DrawingML), and other related topics. This sequence, in this order, is consistently used throughout the draft spec.

The documentation in these sections is mostly text, with hundreds of tables, and some embedded images where appropriate. For example, page 1618 shows a few of the images that can be used to create an art border. Looks like there was a proud parent involved in the design of some of those options. :-)

Continuing with the detailed documentation of the MLs, section 22 offers over 800 pages on DrawingML, followed by 200 pages on VML in section 23. Section 24, "Package Files Reference Material," covers things like document properties and how package relationships are specified. In other words, the low-level details of the package itself, as opposed to the various MLs that are contained within the package parts.

Section 25, "Shared MLs Reference Material," appears to be incomplete, but there's a bit of information starting on page 3965 that may be of special interest to developers: how how to validate custom schemas used in an Office Open XML document. Custom schema support is an area where Office Open XML is especially flexible.

The remaining 100+ pages includes sections to cover other schemas, interoperability issues, undefined behaviors, bibliography, and index. These sections are clearly incomplete in this version of the draft.

Summary

It's a lot of information to digest, but I hope this overview helps give you a feel for where to look for the topics that you're most interested in. Like most people, I'm new to Office Open XML, so I figured I'd share my learning experiences here and maybe that will save somebody else a little time. It took several hours of looking around in the draft spec just to get a feel for what's there, frankly.

So dive in and start learning, and when you run into something you don't understand post a question on OpenXmlDeveloper.org.