Readability vs. Performance of XML formats

This isn't really news for most folks, but I wanted to better explain this for those folks new to XML. As I’ve mentioned before, the new formats are going to be fairly similar to WordprocessingML from Office 2003 in that they will not be pretty printed and will use fairly short tag names. The reason for both of these decisions has to do with performance. The side effect here is that if you open an Office XML file in notepad or some other text editor, it will look overwhelming at first. This is just cosmetic though, and I'll explain why it's that way.

Shorter tag names

We can generate and read these formats a lot faster if the tag names are shorter. When opening a file or generating a file we spend a good amount of the time just parsing the XML text. The longer the tag name, the longer it takes to parse the file. When parsing the file, we use a trie to lookup element/attribute names, and this is exactly proportional to the length of the tag name. Double the length, double the lookup.

So, for any element that is repeated often throughout the file, we use a short tag name to cut back on this time (for example, Excel uses “<c>” instead of “<cell>”). That may not seem like a big deal, but if you have a file with ten-thousand cells (which is not too uncommon in Excel), that means you have 30K less bytes to parse just for that particular element type. As you can imagine, that adds up pretty quick when you think about all the tags we need to generate.

Pretty Printing

For those of you how aren’t familiar with XML, “pretty printing” is a way of writing out XML so that it’s easier for people to read with a text editor. Imagine the following XML:

<wordDocument>

    <body>

       <p>

          <r>

             <t>Hello World</t>

          </r>

       </p>

    </body>

</wordDocument>

That’s pretty easy to read, because there are line breaks and indentations for each element. The problem with this approach is that the application writing out the file has to generate all of that. In Office, our files will have thousands of tags in them. If we were to pretty print each file, it would mean that saving a file would take more time. Since the file save times affect everyone, and the pretty printing only affects those people who want to look at the XML in plain text, we decided to optimize for performance (most other applications that have XML as their default format do the same thing). Here is what that same XML would look like when saved out from Office:

<wordDocument><body><p><r><t>Hello World</t></r></p></body></wordDocument>

Most XML editors out there today (like VisualStudio and even FrontPage), have functionality built in that allows you to apply pretty printing to a file. This means if you want to look at the XML, just load it up in one of those applications and apply the pretty printing to it. You can also load the file in IE which automatically applies an XSLT that gives you a pretty decent view.

-Brian