XML heresies

Article
10/26/2006

Right before I went on vacation, I wrote a little code to do syntax coloring for XML, which reminded me of my mixed feelings about XML. Don't get me wrong, there's enough momentum behind it that these days it's the right answer for almost any text format. But at the same time, it has some significant design flaws that drive me nuts. So at risk of getting flamed back to the Stone Age by the XML experts, here goes.

My biggest gripe is that its data model just doesn't map very well to programming languages. The biggest issue by far is the attribute/element distinction. If I have a class with a "foo" property, in XML should I make foo into an attribute, or a element? (Solving this dilemma is probably the single biggest reason for XAML's existence) One of the consequences of this is that a lot of information ends up getting crammed into attribute values, and those attribute values have a lot of structure to them -- structure that's not captured in XML. To compound matters, even simple data types like floating-point numbers are left out of the XML spec, so every application is left to its own devices to parse these. (Quick, what's the grammar for a floating-point number? Did you remember exponents and signs?)

But there are other representational difficulties as well -- what if your data structure is not a pure tree, but rather has sharing or cycles? Sure, you can give the shared element an ID and refer to it by name, but now you're inventing more semantics above and beyond XML. Arguably even XPointer doesn't completely solve the problem, since it supplies only named references, rather than anonymous "everyone's reference is created equal and no reference owns the object more than anyone else"-style references which programming languages use.

One of the other odd things is that there's very little you can do with a generic piece of XML. Suppose you have a file that you know conforms to XML, but not necessarily any of the other standards (namespaces, schemas, etc.) -- what can you do with that? Remarkably little. You can't understand the true data structure of the document (see above). You don't know when two element names refer to the same element because you don't understand namespaces. You can't provide a structure editor (think IntelliSense) because you don't know what's valid and what's not (unless the file specifies a DTD rather than XSD or something else, but then DTD's bring their own problems). You can't move elements around, or do a consistent rename of namespaces. You really shouldn't even insert constructs the XML specification says are safe, such as comments and processing instructions, because the XML DOM makes it so easy for programs to accidentally care about some syntactic detail they shouldn't. Nor can you write a tool that adds indentation to the XML, because whitespace may or may not be significant (and setting xml:space="default" doesn't help -- XML 1.0 doesn't specify what this actually means). About the only thing I can think of you could do with generic XML is syntax highlighting, and a viewer that lets you expand and collapse elements -- not exactly earth shattering.

Speaking of syntax highlighting, the XML lexical grammar (as opposed to a higher level, syntactic grammar) is a bit of an oddball. Okay, XML is defined using only a single level (context-free) grammar rather than separate lexical and syntactic specifications, but you can't use that for syntax highlighting... If you want a lexical grammar that is useful for syntax highlighting, what you find is that lexing is context-sensitive, because foo="bar" will be colored differently if it's part of an element's attribute list or if it appears as text content.

XML modularization is also a bit funny. On the one hand, XML puts core functionality such as namespaces and schemas into a separate specification. But XML also has some functionality that rarely gets used today. DTD's aren't used because they aren't very expressive, don't work with namespaces, and can infinitely recurse creating a potential denial of service attack. Processing instructions (e.g., <? SomePI ?>) have no well-defined meaning, and quite a few XML processors out there choke on them. Custom entities (beyond the standard < > & ' and ") solve who knows what problem. And XML supports a wide variety of encodings that I've never heard of anyone using.

And while I love using other people's XML parsers so I don't need to write my own, I've always wondered who the DOM's designers were targeting. For most people, the DOM gives out way too much information -- I don't want to know about comments and processing instructions, can't you strip that information out so my program doesn't accidentally trip on them? I'd much rather have something like the Infoset (although even that doesn't abstract away quite as much as I might like).

So what do I like about XML? Well, I like that it's a standardized and widely implemented format, with parsing libraries readily available. And despite a few quirks, XML namespaces are fairly well done, and provide the foundation for versioning. And XSD may not be perfect, but it's good enough to represent most of the information you want in IntelliSense most of the time. Still, I can't help but wonder, if someone turned in the XML design as their college homework, what grade would they get?

XML heresies

Additional resources