Readability vs. Performance of XML formats


This isn’t really news for most folks, but I wanted to better explain this for those folks new to XML. As I’ve mentioned before, the new formats are going to be fairly similar to WordprocessingML from Office 2003 in that they will not be pretty printed and will use fairly short tag names. The reason for both of these decisions has to do with performance. The side effect here is that if you open an Office XML file in notepad or some other text editor, it will look overwhelming at first. This is just cosmetic though, and I’ll explain why it’s that way.


Shorter tag names


We can generate and read these formats a lot faster if the tag names are shorter. When opening a file or generating a file we spend a good amount of the time just parsing the XML text. The longer the tag name, the longer it takes to parse the file. When parsing the file, we use a trie to lookup element/attribute names, and this is exactly proportional to the length of the tag name. Double the length, double the lookup.


So, for any element that is repeated often throughout the file, we use a short tag name to cut back on this time (for example, Excel uses “<c>” instead of “<cell>”). That may not seem like a big deal, but if you have a file with ten-thousand cells (which is not too uncommon in Excel), that means you have 30K less bytes to parse just for that particular element type. As you can imagine, that adds up pretty quick when you think about all the tags we need to generate.


Pretty Printing


For those of you how aren’t familiar with XML, “pretty printing” is a way of writing out XML so that it’s easier for people to read with a text editor. Imagine the following XML:



<wordDocument>


    <body>


      <p>


         <r>


            <t>Hello World</t>


         </r>


      </p>


   </body>


</wordDocument>


That’s pretty easy to read, because there are line breaks and indentations for each element. The problem with this approach is that the application writing out the file has to generate all of that. In Office, our files will have thousands of tags in them. If we were to pretty print each file, it would mean that saving a file would take more time. Since the file save times affect everyone, and the pretty printing only affects those people who want to look at the XML in plain text, we decided to optimize for performance (most other applications that have XML as their default format do the same thing). Here is what that same XML would look like when saved out from Office:



<wordDocument><body><p><r><t>Hello World</t></r></p></body></wordDocument>


Most XML editors out there today (like VisualStudio and even FrontPage), have functionality built in that allows you to apply pretty printing to a file. This means if you want to look at the XML, just load it up in one of those applications and apply the pretty printing to it. You can also load the file in IE which automatically applies an XSLT that gives you a pretty decent view.


-Brian

Comments (12)

  1. tzagotta says:

    Hi Brian, Internally, will O12 use MSXML to parse and generate these file formats, or will you develop dedicated code for this?

  2. Marco Antonio S&#225;nchez says:

    The decision you take about the readability vs. performance trade-off appears to be the appropriate. Creating verbose XML with pretty printing would be the right choice if the major consumer of the documents is a human. However, almost all of these documents would be consumed by a parser in a typical scenario.

    During develop-time you can format the documents using a XML editor or tool. Another approach to the pretty printing issue is to activate it programmatically. Since this feature is only useful for developers, including an option in the Save As dialog seems to make no much sense.

    After all, we all are familiar with the <p>, <c>, <li>… HTML tags.

    In my opinion, the key point is to create a XML vocabulary that is as much homogeneous as possible. This criteria includes capitalization, use of attributes vs. elements, verbose vs. efficient names… Having a consistent set of rules across all of the XML vocabularies of the MS Office System can make the difference.

  3. Bob ?:-) says:

    Hi Brian,

    You had mentioned previously I believe that there might be a tool/utility/transform that could be made available to format into ‘pretty print’ (human readable) form, if you made the decision to go with the fast but ugly (starting to look about as friendly as binary? <g>)

    Is that still the plan? So that folks who needed to troubleshoot a problem or who just wanted to become more familiar with how things work to get their job done more efficiently or in a more timely manner, would be able to look at a file that they can (a) read [example: help desk says ‘do a search/find on your file and look for xxxx/yyyy’, obviously that would be easier to readback and understand the context when it’s somewhat formatted.

    It would also be a useful ‘must have’ tool to be able to do the lookup and replace, so that someone could layout the file and replace all occurences of <c:> with <cell:> if they chose to do so. Is that still something you’re working on? The average end user isn’t going to want to hear ‘go buy this additional package if you want to have the files so *you* can read them’ and the average corporate user isn’t going to have the option of buying or installing / having deployed yet another piece of software. It would seem, at first look, that to have that ability shipped with Office Wave 12 and with the back-version update packs would be a useful thing, even at just a ‘comfort creating’ level for customers.

    We get too many folks already who look at the Office 2000 through 2003 ‘html’ (save as Web page) files, where you are formatting the output and just shake their hand trying to understand it. If that becomes the user reaction, you may have a larger number of customer’s that switch to the ‘old’ .doc, ;xls, etc format, because it is less of an ‘unknown’ or ‘fear’ factor for them.

    Bob Buckland ?:-)

  4. dd says:

    wordprosseingML and spreadsheetML is unnecessarily long, so i think shorten word inside the each tag is one thing. but how about output simplified markup in first place?

    think one <cell> tag and too many <c></c><c></c><c></c>….

    it is obvious that one <cell> tag is much easier both for human and machine. so i want feature in office suite save as simplified xml output for reperpose using another simple task.

    thank you.

  5. BrianJones says:

    Tzagotta – I just decided to address this in a separate post: https://blogs.msdn.com/brian_jones/archive/2005/06/24/432408.aspx

    Marco – I completely agree with you. It’s difficult to do this given the large scale of the project, and the fact that the scenarios for each application are slightly different. We have tried to maintain a level of consistency where possible though. As we start to provide the schemas, I’ll talk more about this and explain why we made the tradeoff that we did.

    Bob – Anyone programming against these files would just want to use the formats we output. If you want to temporarily put them into a more human readable state though, then there are a number of options which are fairly simple (such as pretty printing, or transforming into more verbose tag names). We also will provide a lot more documentation around these formats, as well as example files that make it clear how to create and read the files.

    dd – Not sure what your point is. Could you provide some more information. I was just talking about using <c> instead of <cell>. There would still be the same number of tags output in either case though. What type of simplified XML do you want? Which features to you want us to strip out when we save the file? This is something you can do today with XSLT if you want.

    -Brian

  6. This post is for those of you interested in learning the basics behind WordprocessingML. That’s the schema…

  7. I’ve really dropped the ball here over the past several months. I’d been meaning to post some example…

  8. It’s been awhile since I’ve talked in detail about the SpreadsheetML schema and I apologize. I had a…

  9. It’s been awhile since I’ve talked in more detail about the SpreadsheetML schema and I apologize. I had…

  10. It’s been awhile since I’ve talked in detail about the SpreadsheetML schema and I apologize. I had a