People are reflecting on XML after 10 years

Although XML wasn't officially released as a W3C Recommendation until February 1998, the effort that led to it began about 10 years ago.  One recent anniversary is the first public working draft released on November 14, 1996.    Another is its first public presentation at the SGML 1996 conference; the XML 2006 conference is being held and the same location and will include a retrospective by Jon Bosak, chair of the original XML WG.   The IBM Systems Journal even has as special issue Celebrating 10 Years of XML. Yet as the Slashdot thread that dissects it makes clear, not everyone is  celebrating.  (The most memorable bit is Tim Bray's annual denial that he invented XML: "There were 150 people in the debating society and 11 people in the voting cabal and 3 co-editors of the spec. Of the core group, I (a) was the loudest mouth, (b) was independent so I didn't have to get PR clearance to talk, and (c) don't mind marketing work." I personally suspect b) was the most important consideration!)

A more balanced assessment of the special issue is from  Uche Ogbuji.  There is quite a nice summary of the very different points of view about what XML is good for and how it can be used. He reminds us that today's  blog debates about simplicity/complexity, tight/loose coupling, static/dynamic typing, etc.  reflect debates that go back to the very beginning. I particularly like his pushback on one article's assertion that XML leverages the value of "information hiding" in OO design. 

XML is perhaps better viewed as the inverse of the classic interface technology. It emphasizes opening up data rather than hiding it. Rather than representing an extension of interface methodologies that define process while hiding data, XML is suited for data-driven interchange where the content being exchanged is the basis of the contract, and each side is free to apply application semantics in their own way.

For me, the most interesting result of reading all these articles and posts over the last few days was the link back to the Design Principles for XML document that guided the original XML working group.  How have these principles stood up to the test of 10 years of real-world experience?  A few [disclaimer: personal, not team!] thoughts:

1. XML shall be straightforwardly usable over the Internet. - The working group clearly achieved this objective, and there is no doubt that XML and the Web have been good for one another. Still, arbitrary XML on the Web is usually considered to have failed; non-wellformed (X)HTML still is the norm for web pages.  XML per se gets the most use on the Web as a syndication format (mostly RSS variants), but a significant percentage of that purportedly XML data is not yet well formed. Most notably, Tim Berners-Lee recently announced that the classic HTML standard would move ahead independently of XML-based XHTML, once groomed to be its replacement.  So, this principle has been a success, if not a knockout success, in practice.

2. XML shall support a wide variety of applications.   XML has been an unqualified knockout success with respect to this principle.

3. XML shall be compatible with SGML.   I'm tempted to say, "yes, but who cares anymore?"  That probably understates the importance of the principle, however:  XML didn't get to be a success completely on its own merits, but largely pulled itself up by SGML's bootstraps.  Still, I think (and I believe I express the XML Team consensus) that SGML's DTD heritage has been an anchor on XML's potential for a lot of reasons that probably deserve a post of their own.  The most painful for us is that stark fact that DTDs are insecure - there's no way to for software that fully conforms to the XML spec to securely accept data from untrusted sources.  So, a Pyrrhic victory for this principle.

4. It shall be easy to write programs which process XML documents... Note: For this purpose, "easy" means that the holder of a CS bachelor's degree should be able to construct basic processing (parsing, if not validating) machinery in less than a week, and that the major difficulty in the application should be the application-specific functions; XML should not add to the inherent difficulties of writing such applications. - I quote this one at length because it's fairly clear that this principle has not been realized in practice. In about 2000, I actually worked with a holder of a CS bachelor's degree who was trying to develop an XML parser from scratch.  Let's just say that it took a lot longer than a week; there's a lot of "folklore" one needs to understand to actually work with the XML spec.  Furthermore, my experience in several companies that non-specialists who implement XML parsers/serializers almost inevitably get some little details wrong.  Even once the basic tools are implemented, there is a rather steep learning curve for real world  developers to become facile with XML technologies. Indeed, the whole tenor of the Slashdot snarkfest linked to above is that XML does add greatly to the inherent difficulties of writing applications (although not, IMHO, nearly as much difficulty as the lack of a data interoperability standard would add!). 

5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.   The XML 1.0 spec appears to achieve this principle, since there is only one optional feature: DTD validation.  Unfortunately, however, there are a few legal permutations of what a non-validating parser can do with respect to entity expansion, external entity resolution, etc.  In practice, this has not had a serious impact on XML's success, although it is an annoyance for implementers.  What has had a detrimental effect on XML interoperability are the numerous permutations of the specs layered on top of XML 1.0 - the namespaces spec (which is incompatible with DTDs!); the different data models defined by XPath, DOM, the InfoSet, etc.; XML 1.1; and especially the various schema languages. 

6. XML documents should be human-legible and reasonably clear. Yup, "reasonably" being the operative word.

7. The XML design should be prepared quickly.   Quickly enough, and much more quickly than the working groups that came along afterwards!

8. The design of XML shall be formal and concise.   Probably formal and concise enough, but see the comment above on point 4.

9. XML documents shall be easy to create.  Hmmm.  Easy for SGML geeks to create by hand, perhaps.  Still,  it's not that hard for developers to write products that create correct XML (by leveraging the XML serialization libraries available for every platform), and we may be at an inflection point where most desktop software at least can create XML versions of their formats and many do so by default.  Once again, we haven't seen a clear cut knockout success for this principle, but XML is easy enough to create.

10. Terseness is of minimal importance.   This may be the principle that has been the least useful of the 10.  XML's verbosity assists when hand-authoring XML, but the overhead makes XML significantly less efficient to store and transmit than raw text ... and relatively little XML is hand authored and thus can take advantage of the verbosity. Compression with a ZIP-like algorithm reduces this problem in many environments (e.g. office documents -- both OpenXML and ODF use zipped XML to minimize document size), but XML's lack of "terseness" is a real problem in other domains and has led to the development of many binary XML variants and a standardization challenge.) 

Overall, it's clear that the XML working group 10 years ago laid down principles that for the most part have been fundamental reasons for XML's success.  Some of the principles that achieved less success, e.g. usability by authors and programmers, are being addressed today with better APIs, tools, and applications.  Looking forward another 10 years, which of these principles might need reconsideration to guide a possible refinement / refactoring of the XML specs to solidify XML's long term success?  A few thoughts [again my personal opinions, not the team consensus!]:

  • The goal of having one standard for all and no optional features was laudable, and probably helped keep XML from forking too much in its critical early years.  (It has forked to some extent, e.g. SOAP defines a subset without doctypes, DTDs, or processing instructions).  Going forward, however, XML needs to have a manageable number of options / profiles to suit the needs of different audiences; better IMHO to be profiled in a manageable way than to have the XML world fragment completely, perhaps into JSON for the Web, a strongly typed binary XML for enterprise databases and services, and classic XML (perhaps with RELAX NG rather than DTD as the schema language) for loosely typed documents.
  • DTDs are clearly something that needs to be taken out of the core and put into a "dochead profile." The SGML legacy is not relevant to the overwhelming majority of actual XML users today.  DTDs are insecure, and somehow have to be deprecated from the essential core of XML.  That doesn't mean that they shouldn't be part of some XML standard, just that it should be possible to write a fully conformant "XML 2.0" processor that somehow ignores them, errors on them, or whatever.
  • The original XML design principles didn't say anything about a mechanism for linking XML trees into graphs.  The original idea that an XLink standard would be written on top of XML didn't really pan out; in practice there are a number of mechanisms including  id/idref, simple HTML links, XLink / XPointer, RDF, and topic maps that are used in different communities.  Coming up with a simple, truly standard, reasonable powerful way of representing non-hierarchical models in XML seems like a priority for any future rework of the XML standards.
  • Efficiency also needs to be addressed.  I remain skeptical about current efficient XML standardization efforts, but it is clear that different usage domains can come up with significantly "better" (in size, processing performance, application data fidelity, whatever their primary concern is) serialization formats than XML 1.0.  I don't think we'll come up with a single efficient XML standard that suits all use cases, but defining some way to handle multiple binary XML formats, much as XML currently handles different character encodings via the encoding declaration, would help extend XML to domains where it is not getting traction because of efficiency concerns but with minimal cost to interoperability. 

XML is our group's bread and butter, of course, but we try to take an objective look at its strengths, weaknesses, and alternatives.    We keep an open mind about the evolution of XML and its relationship to such partially complementary / partially competitive technologies such as JSON, microformats, RDF/OWL, the entity data model, etc.  We'd love to hear from you about whether XML is living up to your needs and expectations, what more we could do to make it usable and efficient for your data programmability needs, and whether you'd like to see it expanded, stripped down, refactored, or replaced over the next 10 years.

Mike Champion