Where is the documentation for Office's docx/xlsx/pptx formats? Part 1: Office 2007

Since we started using XML-based file formats, we’ve seen a huge uptick in the number of customers who want to read and write those files from applications that aren’t Office. That’s great, because that was one of the things we had hoped for when we first started using them.

The first question customers ask when starting to read and write Office files is: Where is the documentation for the Office 2007 and 2010 file formats? It’s on MSDN, with the rest of your file format documentation, right?

Well… wrong. Let me start with Office 2007 first, and I’ll cover Office 2010 in the next post. I do this to build my audience into a frenzy of excitement. It’s like Eastenders.

Office 2007 and ECMA-376

Office 2007’s default file format is not something that belongs to Microsoft – it’s actually an international standard, ECMA-376 (also called “Ecma Office Open XML”). Microsoft were very active in the creation of this standard, and we’re still the biggest implementers of it, but the documentation of this standard is owned and controlled by the standards organisation Ecma International, not Microsoft, and so it doesn’t appear with our file format documentation on MSDN.

The good news is that ECMA-376 is freely available and almost everything Office 2007 writes is covered by that standard. The standard is split into five parts:

Part

Title

Contents

1

Fundamentals

Each Office file contains many other segments (“Parts”). These contain things like the text of a document; the picture on your presentation and the PivotTable on your spreadsheet. This section of the standard describes each of these at a high level, and what they may contain. The way in which they get into the file in the first place is covered in Part 2.

2

Open Packaging Conventions

How Office files are composed – how to break a file into its constituent parts, and how those parts can relate to one another.

3

Primer

A plain-English summary (or, erm, “primer”) on how the various pieces fit together. My advice is to read this first – it has some great worked examples of what files contain and how to understand them and is easier bedtime reading than the reference-style material in the other parts.

4

Markup Language Reference

The nitty-gritty on how everything works at a low level. This is by far the largest part of the standard, and will tell you exactly what types of data can go in a spreadsheet cell; what data constitutes an embedded font; what order your styles have to go in; et cetera. If this was a programming language, think of Part 4 as the function reference.

5

Markup Compatibility and Extensibility

ECMA-376 contains an inbuilt mechanism for adding arbitrary extensions to files. Office 2007 doesn’t use this mechanism, so you should only read this part if you’re wanting to read Office 2010 files, create your own arbitrary extensions or read someone else’s extensions. More about this in the next post.

 

The standard contains the documentation for all three spreadsheet, wordprocessing and presentation formats. In reality they are very similar – the packaging and extension mechanisms are almost identical, and so it makes sense to have them in the one standard. Once you get into the details in Part 4, they’re broken up into separate sections.

If you’re starting with nothing, you might have to read all of the separate parts of the standard. You can implement code to read and write files from scratch, of course, but as always I’d encourage you to build on the shoulders of giants and use some of the existing tools to read and write files. Eric White has done a number of blog posts about such tools, so that's a good place to start. If you do build upon stuff created by others, the chances are Parts 3 and 4 will be where you spend the most time as someone else will have done the grunt work on the rest of it.

Implementer Notes

With our other file formats (XLS, DOC et cetera), there’s just one document to read. This is because Microsoft is the keeper of both the product and the documentation – if we find a discrepancy, we’ll change the documentation to fix the error or add a footnote pointing out the product behaviour. Because ECMA-376 is an international standard, we aren’t able to simply edit it, and so we maintain a second document explaining where Microsoft’s implementation differs from that standard. These are our “implementer notes”, and they’re freely available here. Developing these types of notes is a common courtesy for implementers of standards - for an example in a completely different world, here are British Telecom's "Supplier Information Notes" about aspects of their xDSL implementation.

If you’re generating ECMA-376 documents for consumption by others, we’d encourage you to create some similar notes for your implementation, so that your customers can interoperate with you easily too.

ECMA-376 fairly closely represents what Office 2007 writes, and I have to admit that it’s been quite a while since I ever had to look up an implementer note when delving into files created by Office 2007. However, if interoperating with MS Office is your top priority, you should take a look at the implementer notes before relying heavily on a feature. One example that I’ve mentioned in the past is grouping and outlining on spreadsheets.

 

ECMA-376 allows up to 256 levels of these, while Office 2007 will only use seven or fewer. This behaviour is documented on page 276 of the implementer notes:

The standard defines the outlineLevel attribute without specifying its maximum value.
Office specifies the outlineLevel attributes maximum value is 7.

If interoperability with Office 2007 is a priority for you, it would be wise to read the implementers’ notes for each ECMA-376 element you’re thinking of using before doing a lot of development work.

The standard, ECMA-376 1st Edition, is now frozen in time and will not change. If Microsoft find any other ways that our implementation differs from that standard, we’ll publish new implementer notes, but we can never change the version of the standard that we targeted to be compliant with. In some ways this is great news for you as it means you’ll never have to update your implementation in order to remain conformant to the standard. However, to interoperate with Office, you’ll need to keep an eye on the implementer notes in case there are any changes. In reality we think we’ve found most of the notes we’ll need for Office 2007, but every so often we do find a bug that needs a new one.

But what if you want to interoperate with Office 2010 features? Well, you’ll have to wait for the next post. How am I doing with this cliff-hanger thing?