Hindsight is 20/20: Three Things XML Got Wrong


Derek Denny-Brown, the dev lead for both MSXML & System.Xml, who’s been involved with XML before it even had a name has finally started a blog. Derek’s first XML-related post is Where XML goes astray… which points out three features of XML that turn out to have caused significant problems for users and implementers of XML technologies. He writes

First, some background: XML was originally designed as an evolution of SGML, a simplification that mostly matched a lot of then existing common usage patterns. Most of its creators saw XML and evolving and expanding the role of SGML, namely text markup. XML was primarily intended to support taking a stream of text intended to be interpreted as a human readable document, and delineate portions according to some role. This sequence of characters is a paragraph. That sequence should be displayed with a link to some other information. Et cetera, et cetera. Much of the process in defining XML based on the assumption that the text in an XML document would eventually be exposed for human consumption. You can see this in the rules for what characters are allowed in XML content, what are valid characters in Names, and even in “</tagname>” being required rather than just “</>”.

Allowed Characters
The logic went something like this: XML is all about marking up text documents, so the characters in an XML document should conform to what Unicode says are reasonable for a text document. That rules out most control characters, and means that surrogate pairs should be checked. All sounds good until you see some of the consequences. For example, most databases allow any character in a text column. What happens when you publish your database as XML? What do you do about values that include characters which are control characters that the XML specification disallowed? XML did not provide any escaping mechanism, and if you ask many XML experts they will tell you to base64 encode your data if it may include invalid characters. It gets worse.

The characters allowed in an XML name are far more limited. Basically, when designing XML, they allowed everything that Unicode (as defined then) considered a ‘letter’ or a ‘number’. Only 2 problems with that: (1) It turns out many characters common in Asian texts were left out of that category by the then-current Unicode specification. (2) The list of characters is sparse and random, making implementation slow and error prone.

Whitespace
When we were first coding up MSXML, whitespace was one of our perpetual nightmares. In hand-authored XML documents (the most common form of documents back then), there tended to be a great deal of whitespace. Humans have a hard time reading XML if everything is jammed on one line. We like a tag per line and indenting. All those extra characters, just there so that our feeble minds could make sense of this awkward jumble of characters, ended up contributing significantly to our memory footprint, and caused many problems to our users. Consider this example:
 <customer>  
           <name>Joe Schmoe</name>  
           <addr>123 Seattle Ave</addr> 
  </customer>
A customer coming to XML from a database back ground would normally expect that the first child of the <customer> element would be the <name> element. I can’t explain how many times I had to explain that it was actually a text node with the value newline+tab.

XML Namespaces
Namespaces is still, years after its release, a source of problems and disagreement. The XML Namespaces specification is simple and gets the job done with minimum fuss. The problem? It pushes an immense burden of complexity onto the APIs and XML reader/writer implementations. Supporting XML Namespaces introduces significant complexity in the parsers, because it forces parsers to parse the entire start-tag before returning any text information. It complicates XML stores, such as DOM implementations, because the XML Namespace specification only discusses parsing XML, and introduces a number of serious complications to edit scenarios. It complicates XML writers, because it introduces new constraints and ambiguities.

Then there is the issue of the ‘default namespace’. I still see regular emails from people confused about why their XPath doesn’t work because of namespace issues. Namespaces is possibly the single largest obstacle for people new to XML.

My experiences as the program manager for the majority of the XML programming model in the .NET Framework agree with this list. The above list hits the 3 most common areas people seem to have problems with working with XML in the .NET Framework. His blog post makes a nice companion piece to my The XML Litmus Test: Understanding When and Why to Use XML article on MSDN.


Comments (3)

  1. Drew Marsh says:

    Argh, I never agree with the namespaces one. I just don’t get why people have such a hard time with them. I think part of it is that people love just having to work with well-formed documents or one’s that only have a single schema. Unfortunately in all but the simplest real world scenarios, that approach just doesn’t work.

  2. Yes: "I can’t explain how many times I had to explain that it was actually a text node with the value newline+tab." Sadly, the author just explained this again—to me!

    There is the implication here that XML is more friendly toward human-reading but, coming from a background in typography, I have to say that I had a problem with translating my use of HTML 4.x entities (e.g. &eacute;) into XHTML.

    I am actually storing huge blocks of HTML 4.x prose in text-type fields in SQL Server. In order to "upgrade" to XHTML I have to translate all of this markup into well formed XML and change named entities back into glyphs.

    Most web sites don’t show concern for displaying “smart quotes” and em dashes but mine do! So just for the sake of XML I have get rid of all of these entities and hope that the web browsers out there can translate glyphs like Æ, Ê, or ¡ correctly.