I can just hear sound of the long-time XML users’ mice clicking to get this horrible topic off the screen ASAP … but maybe people who are newer to XML will get something out of revisiting one of the oldest controversies: what should be done with text that purports to be XML but doesn’t meet the well-formedness criterion?
A bit of history. XML is decended from SGML, which requires every document to have a DTD and that the document be valid according to the DTD. SGML, however, is quite a bit more flexible and doesn’t have some of XML’s constraints, e.g. requiring a closing tag for every open tag. A VERY clever SGML developer can define a DTD that makes almost any particular clearly structured text format be valid SGML, whether or not it has what we have come to know as tags. That is, even the angle bracket format that defines XML is merely the “concrete reference syntax” in SGML. XML per se is essentially a subset of the concrete reference syntax that proved useful in practice, which allowed it to drop the requirement that every document has a DTD. Instead, it introduced a new concept of well-formedness which is a necessary property for all documents claiming to be XML even though validity is not.
When the details of XML were being debated in 1996-1997, one of the sharpest disputes was whether the spec should be tolerant of well-formedness errors or be “draconian” and immediately stop processing once a well-formedness violation is detected The draconians won. Unfortunately, most of the arguments noted in favor of the draconian approach in the message cited above ….
– well-formedness is so easy that it isn’t a significant burden on anyone,
– well-formedness is so much cheaper than compensating for its lack that
compensation can never be a good trade-off,
– 15 minutes after the draconian browsers ship, everyone
will have forgotten gratefully about the bad old days,
… turned out to be wrong in hindsight: Well-formedness is a significant burden for non-experts to get right, “compensation” is widely used in practice today, and the debate continued for years. Even today, pure draconian processors only work well in well-managed environments where producers of bad XML can be made to feel the pain in their wallets. Some sort of fixup or compensation scheme is almost always needed when processing XML targeted at humans.
So, developers of XML tools are in a bit of a bind: If they implement the spec as written they get complaints from those who want it to be easier to recover from and fix up XML errors. If they add the features the customers are asking for, test suites break and they would be unambiguously non-conformant to an important industry standard.
I don’t have a solution to propose. The “right thing” is for everyone to understand the details of how to produce well-formed XML and just get us into the nice clean world that the draconians predicted. Unfortunately, however, that the Right Thing approach usually loses out to “worse is better.” Compensation / fixup schemes (e.g. the famous HTML tidy program) work OK for human readable documents but have scary implications for automatically processed data. I shudder to think of the implications if Microsoft supported technologies that made it easy for the unwary to fall into some horrific scenarios where the silent fixup of ill-formed pseudo-XML dramatically changed its intended meaning. But maybe we should make it easier for experts who know what they are doing (and take responsibility for it in their application scenario) to process XML-like text? That worries me too, but we hear the request from people who presumably are going to do some sort of fixup anyway … perhaps we should help them do it as well and as efficiently as possible?
I’d like to hear what you think: Is the problem of dirty XML data significant for you? Alternatively, does the draconian error processing rule mean that you can’t use conformant XML tools in the way you thought you could with “semi-structured” data? After this little history lesson, what do you think we could realisitically do better to meet your needs for safe and easy to use data processing tools?