XML and Draconian Error Handling

I can just hear sound of the long-time XML users'  mice clicking to get this horrible topic off the screen ASAP ... but maybe people who are newer to XML will get something out of revisiting one of the oldest controversies:  what should be done with text that purports to be XML but doesn't meet the well-formedness criterion?

A bit of history. XML is decended from SGML, which requires  every document to have a DTD and that the document be valid according to the DTD.  SGML, however, is quite a bit more flexible and doesn't have some of XML's constraints, e.g. requiring a closing tag for every open tag.  A VERY clever SGML developer can define a DTD that makes almost any particular clearly structured text format be valid SGML, whether or not it has what we have come to know as tags.  That is, even the angle bracket format that defines XML is merely the "concrete reference syntax" in SGML.   XML per se is essentially a subset of the concrete reference syntax that proved useful in practice, which allowed it to drop the requirement that every document has a DTD.  Instead, it introduced a new concept of well-formedness which is a necessary property for all documents claiming to be XML even though validity is not.

When the details of XML were being debated in 1996-1997, one of the sharpest disputes was whether the spec should be tolerant of well-formedness errors or be "draconian" and immediately stop processing once a well-formedness violation is detected   The draconians won.  Unfortunately, most of the arguments noted in favor of the draconian approach in the message cited above ....

- well-formedness is so easy that it isn't a significant burden on anyone,
- well-formedness is so much cheaper than compensating for its lack that
compensation can never be a good trade-off,
- 15 minutes after the draconian browsers ship, everyone
will have forgotten gratefully about the bad old days,

... turned out to be wrong in hindsight:  Well-formedness is a significant burden for non-experts to get right, "compensation" is widely used in practice today,  and the debate continued for years.  Even today, pure draconian processors only work well in well-managed environments where producers of bad XML can be made to feel the pain in their wallets.  Some sort of fixup or compensation scheme is almost always needed when processing XML targeted at humans.

So, developers of XML tools are in a bit of a bind:  If they implement the spec as written they get complaints from those who want it to be easier to recover from and fix up XML errors.  If they add the features the customers are asking for, test suites break and they would be unambiguously non-conformant to an important industry standard.

I don't have a solution to propose.  The "right thing" is for everyone to understand the details of how to produce well-formed XML and just get us into the nice clean world that the draconians predicted.  Unfortunately, however, that the Right Thing approach usually loses out to "worse is better." Compensation / fixup schemes (e.g. the famous HTML tidy program) work OK for human readable documents but have scary implications for automatically processed data. I shudder to think of the implications if Microsoft supported technologies that made it easy for the unwary to fall into some horrific scenarios where the silent fixup of  ill-formed pseudo-XML dramatically changed its intended meaning.  But maybe we should make it easier for experts who know what they are doing (and take responsibility for it in their application scenario) to  process XML-like text?  That worries me too, but we hear the request from people who presumably are going to do some sort of fixup anyway ... perhaps we should help them do it as well and as efficiently as possible?

I'd like to hear what you think: Is the problem of dirty XML data significant for you?  Alternatively, does the draconian error processing rule mean that you can't use conformant XML tools in the way you thought you could with "semi-structured" data? After this little history lesson, what do you think we could realisitically do better to meet your needs for safe and easy to use data processing tools?

Mike Champion

Comments (11)

  1. Draconian error processing rulez!

    That’s what makes XML so good WRT interop and security. I never had any problems with it, never.

    I think "the right thing" is to get vendors to develop easy to use and conformant XML API (you know, XmlTextReader/XmlTextWriter in the .NET 1.X aren’t so conformant) and developers to use these tools instead of that nasty

    "<part>" + invoice.PartNo + "</part>".

  2. Bill Crosbie says:

    Mike, in order for it to be XML then, for me, it *has* to be well formed. End of story. It needs to validate. I need to be able to rely on the parser to do *its* job so that my code can focus on business logic.

    Don’t break this.



  3. Nick says:

    I agree with the idea that creating well-formed XML is so easy that it’s not a burden to anyone. (if you can’t ply by the rules, then please leave the game) If you enforce the rules CONSISTENTLY (and there is emphasis on the word consistently) the people, even those normal, people who don’t "know" anything that you need to have an end tag to create well-formed XML, will come to understand that in-order to get the desired results you need to add an end tag for every start tag.

    Then we programmers can focus on the processing of XML and making sure that it is VALID and not worrying about every possible well-formed exception to trap and look for.

  4. Asbj&#248;rn Clemmensen says:

    The way I see it, draconian error handeling is *necessary* when XML parsing takes place. Without it, and when allowing "semi-XML" to be parsed, very important ideas behind XML start falling apart.

    Tolerating non-wellformedness is the worst thing that could happen to the XML community. If MSXML started supporting not-quite-XML it’d be fatal since so many major products use MSXML. This’ll in turn affect other parsers, and thereby make non-XML acceptable anyway.

    If someone needs to process almost-XML, then they’d have to use more cumbersome methods – even allowing a "gentle"-mode in an XML parser is a bad idea, since it’ll sacrifice a necessary property of XML.

    I respect that you consider the issue, and pay attention to users and their requests, but this is an important stand on the whole wellformedness issue, and as you state it’s an old debate and it’s still going on. The reason is, of course, that it’s an important debate.

  5. MCChampion says:

    Thanks for the useful feedback. There is no question – we would NEVER accept non-wellformed text as "XML", and I don’t think anyone asks for that. There are some people asking for tools to make it easier to work with semi-XML so that developers can make it well-formed. I’m not hearing any demand for that from this audience!

    By the way, there is some speculation as to what the Longhorn people will do with ill-formed RSS. See the comments thread in


    That’s a bit more of a dilemma because almost all existing RSS aggregators are not "draconian".

  6. Jon Gladstone-Gelman says:

    If by ‘well-formed’ you mean close your tags and don’t leave a stray ‘&’ or ‘>’ flying in the wind, I think you’re making too much of it: I have non-technical people pick this stuff up quickly.

    Encoding is a bigger issue, however, as I use the MSXML 4.0 Dom to generate most of our XML, and regardless of the source or the XML declaration, it always seems to save in ANSI format instead of utf-8. For some odd reason, when we use the Dom through Word VBA we get utf-8, but the same code in any other VBA (or VB6 or vbScript) produces ANSI. Can’t find anything in the KB about this, either.

  7. Tomer Gabel says:

    I actually tend to agree with what the proponents of the ‘draconian’ approach have been saying. I’ve been working intensely with XML for years, and weren’t for its well-formedness I would’ve had a lot more trouble integrating and interoperating with existing infrastructure. If well-formedness was not a prerequisite for an XML document we would have another HTML-like mess on our hands: a standard that _mostly_ works, but for a lot of systems ‘mostly’ just isn’t good enough.

    Besides, modern development environments (.NET in particular) are so tightly coupled with XML that non-wellformed documents are extremely unlikely, not to mention will probably wreak havoc in a program. I constantly use XmlSerializer, XML renditions of DataSets and other XML-based systems for a huge range of applications, and I can only imagine how much work I would have to do if I couldn’t use XSLT or SOAP because the documents were not well-formed.

Skip to main content