Implementing document-format specifications


A few folks have pointed out that implementing every detail of the Office Open XML specification would be very difficult. And that’s certainly true — implementing 100% of a document-format specification is a daunting task.


A good example of the complexity of this task can be found in the Intel-sponsored ODF test suite developed by the University of Central Florida. In the Summary section, you’ll find links to over 300 specific issues regarding partial or missing implementation of ODF in OpenOffice and KOffice, with screen shots and descriptions of the issues.


In most situations, of course, a developer isn’t trying to implement 100% of a spec. For example, Mindjet’s integration of MindManager and Word 2007 through the use of Office Open XML only uses a tiny portion of the Office Open XML spec and went from concept to completion in just a few weeks.


Last night I saw another great example: a simple Open XML spreadsheet editor, developed by a college student here in Delhi. It allows the user to open an Open XML shreadsheet, edit values in a grid control or add new rows, and save the result as a valid Open XML spreadsheet. And although it’s written in C#, it doesn’t use the .NET 3.0 System.IO.Packaging API, instead opening the document as a simple ZIP archive. (I’ll write up that application in more detail later when I have a little time, and we’ll be covering it on the OpenXmlDeveloper site as well.)


The thoroughness of the Office Open XML specification gives developers all of the information they need to get the job done, and that’s a good thing. And there is functionality in the Open XML spec that no other document format provides, such as compatibility with billions of existing Office documents and a variety of ways to support custom-schema interoperability in documents. All of that functionality adds complexity, but most of the details are optional, so implementers don’t need to read or understand them. As the creator of the spreadsheet editor mentioned above told me, “I haven’t read 6000 pages in my entire life!” Kids these days. :-)


For those who criticize the size of the spec, an interesting rhetorical question — which I’ve not seen adressed anywhere — is “precisely which sections of the spec would you recommend be ommitted?” That would probably lead to an interesting discussion of document-format priorities in general — to state the obvious, a spec can’t offer functionality that isn’t specified.


4/28/2008: updated link to ODF test suite.


Comments (7)

  1. Sean.McLellan says:

    Actually, could Microsoft start an effort to create an open or shared source implementation of Office Open XML?

    There John Tunnicliffe’s great effort over on CodePlex on SpreadsheetML, but it would be interesting if there was a funded effort going on with clear deliverable goals — similar to the P&P teams.

  2. Stephane Rodriguez says:

    Doug,

    Perhaps you can realize (or it’s already the case since you are a smart person), that the situation of some non-Microsoft people out there able to implement scenarios would be exactly the same should the "specs" not be available at all.

    In fact, the only difference between the new and old file formats is that they are ZIP based (except if they are password-protected). That alone allows quick read/write either by hand, or with code.

    At this point, I should add that the "innovation" came from the ODF guys first. Who themselves borrowed it from elsewhere. Microsoft simply levelled up the playing field by adopting a good thing : ZIP containers.

    The availability of the specs has nothing to do with those scenarios you are so proud to list.

    Make no mistake. Thanks.

  3. dmahugh says:

    Hi Stephane,

    I understand your point about developers being able to to implement the formats without the spec.  And, frankly, many of the developers who have already done work around the formats have done it by reverse-engineering the details or copying and modifying code samples on OpenXmlDeveloper and other sources.  The spec covers some of the details that would be hard to figure out on your own, but most of those details aren’t relevant or necessary for most applications.

    The XML-in-ZIP packaging is rapidly becoming a common approach for a variety of formats, and there are ISVs who have been using that approach for years in addition to ODF (which implemented it first as you mention) and Open XML.  It’s a flexible approach, since every mainstream dev environment can handle ZIP packages and read/write XML.  The more the merrier!

  4. Stephane Rodriguez says:

    Doug,

    You ignored my point. I said that if the specs were not available at all, you would have the same developers doing the exact same thing they are doing now.

    The specs, which are bad and lame, have a different goal.

    Never mind there is no developer story in Office 2007. In fact, openxmldeveloper.org is the developer story. You know why. It’s because if there were some proprietary APIs that you guys had shipped, then in no way would have it been possible to submit the so-called specs for submission to an international standard org. That would fly back in your face since you can’t call a standard something that would require the use of a runtime made by Microsoft.

    We know now that ECMA simply rubber-stamped this thing, and that to this date there is no independent implementation of OOXML out there.

    This is a clear evolution from past Microsoft strategies. Microsoft tried to play with proprietary APIs and this worked quite well to kill Java (never mind the fact that .NET 1.0 is actually a rebranded version of the Windows-optimized Java runtime + BCL). Only because in the face of DOJ that did not play too well, Microsoft has been convicted, you guys have moved to markup language that secretly hides the API requirements behind it.

    Therefore the WYSIWYG HTML in Office 2000 which came with the IE5 proprietary extension, drumroll, known as VML.

    Nevermind that Office 2007 has many more VML parts than older versions of Office ever had. Nevermind that if you simply rely on the specs, this is not obvious the total chaos that it is in current implementations of Word/Excel/Powerpoint where some stuff uses VML some other DrawingML some others a combination, and nothing makes sense at all. Nevermind that the rendering of VML is exclusive and proprietary to Microsoft. And that, you even had the balls to add VML parts where there was not such thing before : for instance comments and OLE objects in spreadsheets.

    All this under the disguise of XML markup. There we can fast forward to Rick Jetliffe and other activists from the main XML portals out there.

    That’s where we are today. XML Markup getting abused with hidden semantics that are left for one to reverse engineer. Not only that will take years upon years of work for one to come up with something that replicates Office’s actual run-time, i.e. rendering purposes, if it succeeds, it will de facto be the evidence of a replication of some proprietary semantics therefore the basis for a lawsuit (the covenant not to sue does not apply to whatever is not described in said specs).

  5. dmahugh says:

    Stephane,

    I don’t feel like I ignored your point.  I pointed out that developers are doing what they’re doing regardless of the spec.  I’m really not sure how I could have echoed your point any more clearly, frankly.  I not only didn’t ignore you, I very explicitly agreed with you.

    And I’m not sure what to make of the litany of complaints you’ve mentioned.  These comments have nothing to do with the subject of my post, which was the general difficulty of implementing 100% a document-format specification, regardless of whether that specification is Open XML, ODF, or anything else.

    If you want to start a new thread on a new topic, you need to do that somewhere else.  I just don’t have the time this week.  I’m packing my bags in my hotel room right now, heading for the airport in an hour, and will be on the 24-hour routine to get from Delhi back to Seattle after that.  Then I’ll catch a cab straight to the office from the airport, for a long day of work related to the conference we have going on this week, then home for a few hours sleep before heading over to the Washington State Convention Center to give presentations on Wednesday.  I just don’t have time for the emotional debate you apparently want to have right now, sorry.

    – Doug

  6. Stephane Rodriguez says:

    Doug,

    If that feels emotional, it should not.

    Don’t forget the title of your post : "Implementing document-format specifications". It is, I think, the first time that you make a statement  like this, this causal assertion between the existence of specifications, and subsequent implementations.

    It certainly is a subject of debate. I see no evidence of that to be true. In my case, I had to develop diffopc just to make significant progress in my Excel 2007 generation component.

    I then took the example of VML, part of the specs, to explain what happens in practice.

    Anyway, happy journey.

  7. dmahugh says:

    Sorry for the tone there, Stephane, perhaps the long hours the last few days are wearing me down.  It may have been me that was getting emotional.  :-)

    Your point is well taken, and you’re right it is a subject of debate.

    Peace.