"Standardizing on XML" is far from useless!

The war of words over Massachusetts' proposal to standardize on the OASIS Open Document Format continues to rage, stimulating people on all sides to some quite remarkable feats of rhetoric, analysis, speculation, paranoia, and a bit of good ol' FUD.  I have made some small contributions to the debate,not addressing the substance of the dispute so much as picking up on the implications for what XML standardization means.  The theme running throughout my posts is that:

  • XML itself provides most of what Massachusetts is looking for in the way of openness and standardization, liberating information from the applications that created it, and making that information accessible far into the future.
  • A specification is a "standard" when people really can use products that implement it to send information around and safely assume that the recipients can receive and consume it.  Some standards come from vendors (PDF and the MS Word binary formats come to mind), from standards organizations (IETF TCP/IP and W3C HTML come to mind), and from nobody in particular or a community in general (RSS  -- which is described by about 10 separate and sometimes competing specs -- and SAX come to mind).  Standards organizations, on the other hand, frequently produce purported standardards that aren't the basis for real interoperability, e.g. XLink 1.0 and XML 1.1. In short, industry standards come from industries, not standards committees; calling something like OASIS ODF an "industry standard" is premature at best.

I was somewhat surprised to Tim Bray, see one of the editors of the XML specification itself, disagreeing with the first point.

...just standardizing on “XML” is laughably inadequate. XML just labels parts of files; it doesn’t tell you what they mean; by itself, it doesn’t do semantics at all. But interoperability and business value are all about shared semantics; for example, once everyone publishing documents onto the Internet decided to agree to use HTML, the Web revolution was born.

So “standardizing on XML” is useless; the business benefits are in standardizing on an actual individual set of tags and what they mean. For example, in Massachusetts’ case, OpenDocument 1.0.

To be sure, XML tags are just labels without any intrinsic meaning.  In the case at hand, however, the tags mostly describe how  wordprocessors display text or spreadsheets processes data; almost all modern software uses the same semantics for headings, comments, formulae, rows/columns, etc. I don't think there is any dispute that the OpenOffice operational semantics were carefully designed to be compatible with MS Office semantics as much as possible.  So, whatever the differences in XML markup between various formats that MS Office supports and the OASIS ODF, they mostly reflect the same semantics, so the true but abstract point about XML markup not having semantics is irrelevant to this particular situation. 

More importantly, the real semantics of a document are those conveyed between the author and the reader, and the software and markup knows almost nothing about them. This is why I believe that Massachusetts is being overly restrictive in standardizing on a single XML document specification --- standardizing on the fundamental XML technologies (including XSLT and perhaps CSS) would allow any XML document with an associated stylesheet to be read with real semantic interoperability by anyone with a modern browser or XML-capable editor, irrespective of vendors, platforms, applications, or time.

Jon Udell, as is often the case, makes a similar point far better than I can:

Whether or not Microsoft decides to officially support the OpenDocument format, it's hard for me to imagine that transformations between it and Excel's SpreadsheetML -- for all the basic functionality, at least -- won't be trivial and ubiquitous.

Granted, I've internalized the idea of XML transformation to the point where I tend to regard two formats related by a transformation as, effectively, the same thing. But that's precisely the point. It's just data. Exposing it as XML matters more than how exactly we do that.

Read the rest of Udell's post to see a nice example of how someone saved data from Excel in XML format and processed it with an array of open source tools to solve a real information management problem.  That's exactly what Massachusetts wants to have the freedom to do; but XML makes it possible today, and Microsoft's commitment to XML in products such as Office  (and Sun's commitment in OpenOffice) makes it easy for our customers to reap and share these benefits, with or without universal document format standards such as ODF purports to be.

Tim Bray appears to miss another important point in his post when he writes (responding to an anti-ODF talking points memo that says "the proposed standard for documents is quite narrow, preferential, and may not enable optimal use of the data-centric standards.")

...this, I suppose, is code for “lets you use InfoPath ”. They might have a point here, except for I’ve seen outfits like Propylon build the same kind of application using OpenOffice data, and there are all sorts of standard approaches like XForms and so on just coming over the horizon. Abandoning the benefits of standardization to enable the use of a frankly experimental and unproven product from one vendor strikes me as a lousy trade-off.

I believe that both sides are referencing the fact that ODF is defined in terms of RELAX NG, a schema language that many believe is superior for human readable documents but does not even attempt to define programming-language datatypes or mappings to programming objects. MS Office defines its own formats, and accepts custom formats, in the W3C XML Schema language, which explicitly supports datatypes and many tools support object mappings for it.  I think Tim is quite mistaken that this is really about InfoPath: MS Word, Excel, InfoPath [1], and for that matter the core tools that our group builds all support real-world business documents that are both human-readable and machine-generated / machine-processable.  The MS Office people can point to multiple success stories of how W3C Schema, which is often maligned by XML specialists, is crucial to solving real customer problems involving mapping XML documents to the databases or applications where their information lives.  There's no doubt that some people [2] can make ODF, RELAX NG, and real-world databases and applications work together, but most of the XML industry has focused on making this happen much more automatically by leveraging the mainstream XML and XML Schema specs.

Microsoft has joined many companies, including our current opponents on the OASIS ODF issue, in a campaign to transform the computing landscape that started when XML was developed at W3C in the late 1990s.   The first phase of that campaign is almost over: The distinction between data and document is fast disappearing, data is now born as XML, exchanged as XML,  then queried, transformed and persisted  in XML-capable databases. XML now really is the basis of much of the interoperable data on the Web and within enterprise firewalls.

Microsoft is about to announce the details of our vision for the next phase. We've talked about some of it and dropped hits about others: User interfaces and workflow objects are moving to XML, which  will let us weave workflow and layout with the data.  With new tools and techniques  we will give millions of mainstream developers (not just XML geeks!) the tools to build amazing new applications that take advantage of the pervasive XML infrastructure we have built together.

"Standardizing on XML" is far from useless.  It is "just" labelled data, but that simple idea is the basis for lots of real-world utility today.  The world hasn't needed to standardize on the One True Office Document format, or settle the RSS wars, or resolve the REST vs Web Services controversy to build on XML's foundation to get the benefits of XML. We  all innovate as best we can on our separate platforms, using and improving our favorite programming tools, using different business models to pay the bills, but tying it all together with XML so that the customers don't have to care whose vision wins or loses.   

[1] BTW, the "frankly experimental and unproven" InfoPath went into beta testing in late 2002, if I remember correctly, and was released in 2003.  The version of OpenOffice that supports OASIS ODF  is still in beta AFAIK.  

 [2] Tim mentions the Irish company Propylon, which as near as I can tell is completely staffed by geniuses.

Mike Champion (who liberally borrowed ideas and text from Soumitra Sengupta )