XML heresies

Right before I went on vacation, I wrote a little code to do syntax coloring for XML, which reminded me of my mixed feelings about XML.  Don’t get me wrong, there’s enough momentum behind it that these days it’s the right answer for almost any text format.  But at the same time, it has some significant design flaws that drive me nuts.  So at risk of getting flamed back to the Stone Age by the XML experts, here goes.

My biggest gripe is that its data model just doesn’t map very well to programming languages.  The biggest issue by far is the attribute/element distinction.  If I have a class with a “foo” property, in XML should I make foo into an attribute, or a element?  (Solving this dilemma is probably the single biggest reason for XAML’s existence)  One of the consequences of this is that a lot of information ends up getting crammed into attribute values, and those attribute values have a lot of structure to them — structure that’s not captured in XML.  To compound matters, even simple data types like floating-point numbers are left out of the XML spec, so every application is left to its own devices to parse these.  (Quick, what’s the grammar for a floating-point number?  Did you remember exponents and signs?)

But there are other representational difficulties as well — what if your data structure is not a pure tree, but rather has sharing or cycles?  Sure, you can give the shared element an ID and refer to it by name, but now you’re inventing more semantics above and beyond XML.  Arguably even XPointer doesn’t completely solve the problem, since it supplies only named references, rather than anonymous “everyone’s reference is created equal and no reference owns the object more than anyone else”-style references which programming languages use.

One of the other odd things is that there’s very little you can do with a generic piece of XML.  Suppose you have a file that you know conforms to XML, but not necessarily any of the other standards (namespaces, schemas, etc.) — what can you do with that?  Remarkably little.  You can’t understand the true data structure of the document (see above).  You don’t know when two element names refer to the same element because you don’t understand namespaces.  You can’t provide a structure editor (think IntelliSense) because you don’t know what’s valid and what’s not (unless the file specifies a DTD rather than XSD or something else, but then DTD’s bring their own problems).  You can’t move elements around, or do a consistent rename of namespaces.  You really shouldn’t even insert constructs the XML specification says are safe, such as comments and processing instructions, because the XML DOM makes it so easy for programs to accidentally care about some syntactic detail they shouldn’t.  Nor can you write a tool that adds indentation to the XML, because whitespace may or may not be significant (and setting xml:space=”default” doesn’t help — XML 1.0 doesn’t specify what this actually means).  About the only thing I can think of you could do with generic XML is syntax highlighting, and a viewer that lets you expand and collapse elements — not exactly earth shattering.

Speaking of syntax highlighting, the XML lexical grammar (as opposed to a higher level, syntactic grammar) is a bit of an oddball.  Okay, XML is defined using only a single level (context-free) grammar rather than separate lexical and syntactic specifications, but you can’t use that for syntax highlighting…  If you want a lexical grammar that is useful for syntax highlighting, what you find is that lexing is context-sensitive, because foo=”bar” will be colored differently if it’s part of an element’s attribute list or if it appears as text content.

XML modularization is also a bit funny.  On the one hand, XML puts core functionality such as namespaces and schemas into a separate specification.  But XML also has some functionality that rarely gets used today.  DTD’s aren’t used because they aren’t very expressive, don’t work with namespaces, and can infinitely recurse creating a potential denial of service attack.  Processing instructions (e.g., <? SomePI ?>) have no well-defined meaning, and quite a few XML processors out there choke on them.  Custom entities (beyond the standard &lt; &gt; &amp; &apos; and &quot;) solve who knows what problem.  And XML supports a wide variety of encodings that I’ve never heard of anyone using. 

And while I love using other people’s XML parsers so I don’t need to write my own, I’ve always wondered who the DOM’s designers were targeting.  For most people, the DOM gives out way too much information — I don’t want to know about comments and processing instructions, can’t you strip that information out so my program doesn’t accidentally trip on them?  I’d much rather have something like the Infoset (although even that doesn’t abstract away quite as much as I might like).

So what do I like about XML?  Well, I like that it’s a standardized and widely implemented format, with parsing libraries readily available.  And despite a few quirks, XML namespaces are fairly well done, and provide the foundation for versioning.  And XSD may not be perfect, but it’s good enough to represent most of the information you want in IntelliSense most of the time.  Still, I can’t help but wonder, if someone turned in the XML design as their college homework, what grade would they get?

Comments (5)

  1. BobStrogg says:

    Have to disagree with you :o)  To me the strength of XML is its simplicity, and the fact you can extend it in all kinds of cute & crazy ways.  XML Schema goes a long way; it’s just a shame it can’t quite represent everything needed for XAML.  Is that XML’s fault, or is XAML just frustrated it can’t quite fit in the box?

    I like what’s been done with XAML, even though it can’t be validated cleanly by schema alone (no doubt there was hair-pulling & headaches while that was being worked out).  I’m pretty impressed with the intellisense & validation Orcas is managing so far; it’s not *too* far off.

    Maybe XML Schema needs a few proprietary extensions ;o)

  2. Chris Nahr says:

    There appears to be a fundamental misunderstanding about what XML is supposed to be. XML is just a format for structured text containers, like its predecessor SGML. You want data types? That’s what XML Schema is for. Complaining that XML itself doesn’t provide the information of an XSD file is not very useful criticism.

    By the way, custom character entities allow users to input characters that they cannot input via their keyboard, to handle characters that their OS cannot display, and to transmit characters that would be stripped by a 7-bit line. Sounds pretty useful to me…

  3. Regarding entities — character references, such as &#1234; are absolutely useful.  But other kinds of entities are more dubious.  You can define your own entity, then refer to it using &myentity; syntax.  And you can define the meaning of myentity in that XML document, in a separate document, or know where.  So it’s a cute shorthand, but do we really need this at such a low level where every XML processor needs to implement it?

  4. Graham says:

    Although I love XML I think a lot of your points are valid. I know that it always frustrates me how much code needs to be written to consume XML without snaring yourself on comments and directives that you don’t care about. But this may be more of a failing of APIs for XML processors than anything else.

  5. Carter says:

    You mentioned that the only things you can do with XML without a schema is to do syntax highlighting and display a collapsable tree.  That may be true for humans, but the real power of XML, in my opinion, is the fact that the structured model is standard, which makes things like XPath possible.  I like to think of XML as just a way to break up a string into structured chunks.  But just having "strings with angle brackets" goes a long way.  Entire query languages can be written over that model, without any knowledge of the schema.  Of course, that’s where it leaves off, but you at least don’t need a lexer anymore to do simple, ubiquitous things.