"Standardizing on XML" is far from useless!

The war of words over Massachusetts’ proposal to standardize on the OASIS Open Document Format continues to rage, stimulating people on all sides to some quite remarkable feats of rhetoric, analysis, speculation, paranoia, and a bit of good ol’ FUD.  I have made some small contributions to the debate,not addressing the substance of the dispute so much as picking up on the implications for what XML standardization means.  The theme running throughout my posts is that:

  • XML itself provides most of what Massachusetts is looking for in the way of openness and standardization, liberating information from the applications that created it, and making that information accessible far into the future.

  • A specification is a “standard” when people really can use products that implement it to send information around and safely assume that the recipients can receive and consume it.  Some standards come from vendors (PDF and the MS Word binary formats come to mind), from standards organizations (IETF TCP/IP and W3C HTML come to mind), and from nobody in particular or a community in general (RSS  — which is described by about 10 separate and sometimes competing specs — and SAX come to mind).  Standards organizations, on the other hand, frequently produce purported standardards that aren’t the basis for real interoperability, e.g. XLink 1.0 and XML 1.1. In short, industry standards come from industries, not standards committees; calling something like OASIS ODF an “industry standard” is premature at best.

I was somewhat surprised to Tim Bray, see one of the editors of the XML specification itself, disagreeing with the first point.

…just standardizing on “XML” is laughably inadequate. XML just labels parts of files; it doesn’t tell you what they mean; by itself, it doesn’t do semantics at all. But interoperability and business value are all about shared semantics; for example, once everyone publishing documents onto the Internet decided to agree to use HTML, the Web revolution was born.

So “standardizing on XML” is useless; the business benefits are in standardizing on an actual individual set of tags and what they mean. For example, in Massachusetts’ case, OpenDocument 1.0.

To be sure, XML tags are just labels without any intrinsic meaning.  In the case at hand, however, the tags mostly describe how  wordprocessors display text or spreadsheets processes data; almost all modern software uses the same semantics for headings, comments, formulae, rows/columns, etc. I don’t think there is any dispute that the OpenOffice operational semantics were carefully designed to be compatible with MS Office semantics as much as possible.  So, whatever the differences in XML markup between various formats that MS Office supports and the OASIS ODF, they mostly reflect the same semantics, so the true but abstract point about XML markup not having semantics is irrelevant to this particular situation. 

More importantly, the real semantics of a document are those conveyed between the author and the reader, and the software and markup knows almost nothing about them. This is why I believe that Massachusetts is being overly restrictive in standardizing on a single XML document specification — standardizing on the fundamental XML technologies (including XSLT and perhaps CSS) would allow any XML document with an associated stylesheet to be read with real semantic interoperability by anyone with a modern browser or XML-capable editor, irrespective of vendors, platforms, applications, or time.

Jon Udell, as is often the case, makes a similar point far better than I can:

Whether or not Microsoft decides to officially support the OpenDocument format, it’s hard for me to imagine that transformations between it and Excel’s SpreadsheetML — for all the basic functionality, at least — won’t be trivial and ubiquitous.

Granted, I’ve internalized the idea of XML transformation to the point where I tend to regard two formats related by a transformation as, effectively, the same thing. But that’s precisely the point. It’s just data. Exposing it as XML matters more than how exactly we do that.

Read the rest of Udell’s post to see a nice example of how someone saved data from Excel in XML format and processed it with an array of open source tools to solve a real information management problem.  That’s exactly what Massachusetts wants to have the freedom to do; but XML makes it possible today, and Microsoft’s commitment to XML in products such as Office  (and Sun’s commitment in OpenOffice) makes it easy for our customers to reap and share these benefits, with or without universal document format standards such as ODF purports to be.

Tim Bray appears to miss another important point in his post when he writes (responding to an anti-ODF talking points memo that says “the proposed standard for documents is quite narrow, preferential, and may not enable optimal use of the data-centric standards.”)

…this, I suppose, is code for “lets you use InfoPath”. They might have a point here, except for I’ve seen outfits like Propylon build the same kind of application using OpenOffice data, and there are all sorts of standard approaches like XForms and so on just coming over the horizon. Abandoning the benefits of standardization to enable the use of a frankly experimental and unproven product from one vendor strikes me as a lousy trade-off.

I believe that both sides are referencing the fact that ODF is defined in terms of RELAX NG, a schema language that many believe is superior for human readable documents but does not even attempt to define programming-language datatypes or mappings to programming objects. MS Office defines its own formats, and accepts custom formats, in the W3C XML Schema language, which explicitly supports datatypes and many tools support object mappings for it.  I think Tim is quite mistaken that this is really about InfoPath: MS Word, Excel, InfoPath [1], and for that matter the core tools that our group builds all support real-world business documents that are both human-readable and machine-generated / machine-processable.  The MS Office people can point to multiple success stories of how W3C Schema, which is often maligned by XML specialists, is crucial to solving real customer problems involving mapping XML documents to the databases or applications where their information lives.  There’s no doubt that some people [2] can make ODF, RELAX NG, and real-world databases and applications work together, but most of the XML industry has focused on making this happen much more automatically by leveraging the mainstream XML and XML Schema specs.

Microsoft has joined many companies, including our current opponents on the OASIS ODF issue, in a campaign to transform the computing landscape that started when XML was developed at W3C in the late 1990s.   The first phase of that campaign is almost over: The distinction between data and document is fast disappearing, data is now born as XML, exchanged as XML,  then queried, transformed and persisted  in XML-capable databases. XML now really is the basis of much of the interoperable data on the Web and within enterprise firewalls.

Microsoft is about to announce the details of our vision for the next phase. We’ve talked about some of it and dropped hits about others: User interfaces and workflow objects are moving to XML, which  will let us weave workflow and layout with the data.  With new tools and techniques  we will give millions of mainstream developers (not just XML geeks!) the tools to build amazing new applications that take advantage of the pervasive XML infrastructure we have built together.

“Standardizing on XML” is far from useless.  It is “just” labelled data, but that simple idea is the basis for lots of real-world utility today.  The world hasn’t needed to standardize on the One True Office Document format, or settle the RSS wars, or resolve the REST vs Web Services controversy to build on XML’s foundation to get the benefits of XML. We  all innovate as best we can on our separate platforms, using and improving our favorite programming tools, using different business models to pay the bills, but tying it all together with XML so that the customers don’t have to care whose vision wins or loses.   

[1] BTW, the “frankly experimental and unproven” InfoPath went into beta testing in late 2002, if I remember correctly, and was released in 2003.  The version of OpenOffice that supports OASIS ODF  is still in beta AFAIK.  

 [2] Tim mentions the Irish company Propylon, which as near as I can tell is completely staffed by geniuses.

Mike Champion (who liberally borrowed ideas and text from Soumitra Sengupta )

Comments (9)

  1. Stephane Rodriguez says:

    "To be sure, XML tags are just labels without any intrinsic meaning. In the case at hand, however, the tags mostly describe how wordprocessors display text or spreadsheets processes data; almost all modern software uses the same semantics for headings, comments, formulae, rows/columns, etc."

    To take an analogy, it seems obvious from what you say that you have never written an html renderer. The semantics is everything. Describing the layout of objects or relations between objects always leaves its share of degrees of freedom. Just take very simple html, and render it in several different major web browsers : you’ll notice that the rendering is not the same (and that’s not just lazy coders who ignored the specs).

    Xml is not everything, it’s just the way that Tim Bray and others have found to fix the encoding and carriage return problems. So by any mean, I would recommend to store content as Xml rather than plain text for that reason. But to extend the reasoning to highly descriptions of rich objects (including objects that have to be 64-base encoded, obviously) like those found in word processors is quite a stance.

  2. Stephane Rodriguez says:

    Also on Excel. Excel’s XML is very limited, does not support charts, shapes and other objects. Only about data you said? I don’t think so. It’s more about proper round-trip, period.

    I happen to be writing Excel generators for a while already, and that’s not simple. For instance, I have been writing an extensive generator for almost two years, and still am not out of that yet. Whether I had decided instead to write my own spreadsheet format, it would have taken time a couple weeks to define whatever xml describing an "Excel" spreadsheet alternative. So what I am saying here is that you seem to rather ignore the fact that new formats like Xml back from Office 2000 and up, just like the new Xml in Office 12 has been the opportunity for the Office team to work on new codebase. Very sexy since in comparison a lot of people are simply maintaining code.

    I am pretty sure that Office 12 is just one step and expect Office 13 or 14 to overtake the existing native run-times with Avalonized and XPSes run-times. So perhaps Massachussets understands that the semantics of the said formats are not going to settle anytime soon.

  3. Simon Phipps says:

    Is there a way to trackback here? I cited you in http://blogs.sun.com/roller/page/webmink?entry=a_study_in_framing

    Your arguments would all be fine if the users of office productivity suites were developers, but they are not. Applications need a single vocabulary – transforms are the domain of programmers, not document creators. Insisting that a developer technology gives end-users freedom from lock-in is a profound (if common) error.

  4. Alex says:

    What a load of Hogwash. Microsoft just can’t admit that they want to lock customers in!

  5. The complaint I keep hearing from the OSS avociates is that the Microsoft XML document formats are patented or something and this restriction would preclude an open source appllication from properly implementing the Microsoft format due to incompatibilities between the two licenses.

    If that’s true it seems like they have a legitimate complaint, if it’s not and OpenOffice or any application covered by the GPL could fully implement the Microsoft XML document format it seems they don’t.

  6. Michael Rys says in a comment on the previous post "I personally think

    that XQuery is not bad for having…

  7. Eduardo says:

    Mike, suppose that a significant portion of Office users decided that they would like Microsoft to add the optional ability to import and export OpenDocument files.

    Do you think Microsoft should do this? Do you think they would?

  8. taotao says:

    In regards to Tim Bray’s comment "…interoperability and business value are all about shared semantics…", I think that he is referring to "…the real semantics of a document…conveyed between the author and the reader". That’s what "shared semantics" intrinsically means. I think that any one with common sense would understand that.

    Shared semantics also means that we don’t need more XSLT interfaces to translate between data types in what has become a sea of incompatible data types that exists today.