Alex Brown's research, AbiWord enhancements

In a recent blog post, Alex Brown looks at how well Office 2007 supports Open XML. As he explains,

I was excited to receive from Murata Makoto a set of the RELAX NG schemas for the (post-BRM) revision of OOXML, and thought it would be interesting to validate some real-world content against them, to get a rough idea of how non-conformant the standardisation of 29500 had made MS Office 2007.

It's an interesting question. Office 2007 supported the ECMA-376 standard, but many changes were made during the evolution from ECMA-376 to IS29500. How many of those changes affect the content in a typical large document?

Strict vs. Transitional

One of the changes made at the BRM a few weeks ago was to delineate two types of conformance for Open XML documents: strict and transitional. As it said in the first paragraph of Canada's proposal at the BRM on conformance classes,

Requests have been made to implement a more formal separation of “deprecated” features and to avoid the term “deprecated”. Canada proposes meeting these requirements by introducing strict and transitional conformance classes. Strict and transitional conformance classes determine verifiably different types of documents. The primary difference is that features from the proposed Annex A, “Selected Transitional Migration Features,” are prohibited from strict documents, but are allowed in transitional documents.

In other words, strict conformance requires that a document not use those features that are present for backward compatibility and are not recommended for new documents in the future. And transitional conformance allows for the use of anything and everything defined in the spec.

So Alex decided to take the main body of Part 4 of the ECMA-376 specification, which is available for download in DOCX form on the Ecma web site, and test it for strict and transitional conformance against the IS29500 Relax NG schemas.

Test Results

The results were predictable: the document was not conformant to either class. Changes made at the BRM are not yet reflected in any existing implementations, and in this case the Ecma spec was created over a year before those changes were made. Here are the totals:

  • Validation against the strict schemas: 122,000 errors
  • Validation against the transitional schemas: 84 errors

Office 2007 was designed to be highly compatible with existing documents, so it uses features of Open XML that provide backward compatibility, including many of the elements and attributes that were moved to "transitional status" as a result of the BRM. So the test of strict conformance, although interesting, is a bit abstract: it's testing whether a document conforms to a subset of the spec that was defined after the document was created.

The second number is the more meaningful one. Those are places in the test document where something is done in a way that doesn't match the final IS29500 spec. Alex provides one specific example, to show the types of changes caught by that test: an attribute with a value of "on" that should say "true" instead, due to "one of the many tidying-up exercises performed at the BRM."

To put that second number in perspective, there were 84 total errors in a document of 60,299,969 characters, which works out to about one error in every 700,000 characters or so.

Alex's research is an interesting first step in understanding conformance for IS29500. Another interesting step may eventually appear in the form of a test suite, a suggestion from Italy and other countries. The existence of such a test would be useful as more implementations become available.

Alex's post ends with a note that he intends to "repeat the exercise with ISO/IEC 26300:2006 (ODF 1.0) and a popular implementation of OpenDocument." He also asks "Will anybody be brave enough to predict what kind of result that exercise will have?" So far, no takers. Stay tuned.

Speaking of Implementations

Google recently unveiled the winning entries in Google's Summer of Code 2008, a program that offers student developers stipends to write code for various open source projects. Two of this year's winners are enhancements to the Open XML implementation in AbiWord.