Does tag size matter?


Surprisingly, I haven’t seen much information out there discussing the performance impacts of XML tag name lengths (ie using “<c>” instead of “<table-cell>”). My last post about some of the design goals behind SpreadsheetML raised some questions from folks about where the time is actually spent when loading an XML file. There are a ton of things that we do to improve the performance when opening and saving Open XML files. The move to using these formats as the default for Office 2007 meant we had to get really serious about how the formats were constructed so they could open efficiently. I’d be really interested to hear from other people who’ve worked on XML formats if they’ve had similar experiences.


For a lot of people who have worked with XML, that parsing of tags isn’t really more than a percent to two of the overall load and save times. With office document formats, that’s not always the case. Just to give you a bit of an idea about the scale these documents can get to, check out the article by George Ou about performance of spreadsheet files: http://blogs.techrepublic.com.com/Ou/?p=120


In that article, the Spreadsheet he uses is pretty big, but we have definitely seen much larger and more complex spreadsheets, so don’t assume that it’s a fringe case. If you save from the article using the new Open XML format, you get the following:




  • File size compressed – 16 megabytes (131 megs uncompressed)


  • Number of XML elements – 7,042,830


  • Number of attributes – 9,639,043

So, as you can see, that’s a lot of XML to parse over. As we looked at files like this, we saw that we absolutely needed to find different ways to optimize the formats to make them faster. Using shorter tag names was one of the first obvious ones.



Impact of long tag names vs. short tag names



In the profiles that we’ve looked at over the years, we’ve seen that simply using shorter tag names can significantly improve the performance depending on the type of file. For those of you really interested, you should do your own profiles and let me know what you find out. Remember that for an application like Excel, we’re talking about the potential for millions of XML tags to represent a rich spreadsheet. Let’s look at a couple issues now:




  1. Compression – Since we use compression (ZIP), there isn’t much a difference in file size since a long tag name and short tag name will pretty much compress to be the same size. This also means that time spent hitting the hard drive or transmitting over the wire will be about equal. When you do the actual compression though, if the tag names are longer, than there are just a lot more bits you need to read through to figure out the compression. These bits may be in memory, they may be on disk, but either way you need to deal with them at some point when compressing them. The same is the case for decompression, you will end up generating a lot more bits if the tag names were longer, even if the compressed bits are significantly smaller.


  2. Parsing – We of course use a SAX parser to parse our XML, and in most cases we also use a Trie lookup which is super fast (in other cases we use a hash). When using the hash, we of course still have to store the full tag name for a final comparison, because we don’t have a known bound set of element values coming in. Not only do we allow for full extensibility, but we also have to also allow for the fact that people might make a mistake when generating the files and we need to be able to catch those errors. For those familiar with hashing, you’ll know that unless you are guaranteed a perfect hash, you also need to have a second straight string compare to ensure it was a proper match. So both for memory as well as processing time, tag length has a direct impact. The time taken for a Trie is directly proportional for the tag length. For a hash, it really depends on how you do your hash, and how you do your verification.



    • One drawback to the Trie is that it’s more memory intensive. In most cases we make that tradeoff though because but it’s super fast. You can really see though how tag names have an impact all over the place. There are memory issues, parsing times, compression times, etc.


  3. Streamed decompression and parsing – As the XML part is being decompressed, we stream it to the parser. SAX is connected directly to the part IStream which then decompresses on demand. On the compression side, it’s probably interesting to point out that we don’t always compress each XML part as a whole. Instead we keep a single deflate stream and flush the compressed data when the “current” part being written out changes. For most parts we write them out as a whole, but there are some parts where that isn’t the case.

I know that for a lot of people who’ve played around with XML, the parsing isn’t really something that you would think of as being a major part of the file load times. This is not the case with office document formats, and especially spreadsheet documents.


With the latest SpreadsheetML design, we’ve seen that the XML parsing alone (not including our parsing numbers, refs, formulas) can often range from 10-40% of the entire file load. That’s just the time it takes to read each tag and each attribute. This shouldn’t be too surprising though, as the internal memory structures for a spreadsheet application should be fairly similar to the shapes that are used in the format design. A big piece is just reading the XML in and interpreting what the tags are. 



Example



SpreadsheetML was designed so that for any tag or attribute that would appear frequently, we used super short tag names. We also established naming conventions for the abbreviations shared across all three formats (so that they become easier to interpret as you work with them). Elements that may only appear once in a file often have longer tag names, since their size doesn’t have nearly the same impact. Right now, most of our frequently used tag names are no more than a couple characters in length. Let’s imagine instead we decided to use longer more descriptive names so each tag was around 5 times larger (you can use the older SpreadsheetML or the OpenDocument format for examples of longer tag names):


Short tag example:


<row><c><v>1</v></c><c><v>2</v></c><c><v>3</v></c></row>
<row><c><v>4</v></c><c><v>5</v></c><c><v>6</v></c></row>


Long tag example:


<table:table-row table:style-name=”ro1″><table:table-cell office:value-type=”float” office:value=”1″><text:p>1</text:p></table:table-cell><table:table-cell office:value-type=”float” office:value=”2″><text:p>2</text:p></table:table-cell><table:table-cell office:value-type=”float” office:value=”3″> <text:p>3</text:p></table:table-cell></table:table-row>
<table:table-row table:style-name=”ro1″><table:table-cell office:value-type=”float” office:value=”4″><text:p>4</text:p></table:table-cell><table:table-cell office:value-type=”float” office:value=”5″><text:p>5</text:p></table:table-cell> <table:table-cell office:value-type=”float” office:value=”6″><text:p>6</text:p></table:table-cell></table:table-row>


For that example, the top one is using SpreadsheetML from the Ecma Office Open XML format. The second example is using the OpenDocument format. There is another optimization that SpreadsheetML does where you can optionally write out the column and row information on cells, but I removed that since it’s actually another performance optimization that I’d like to discuss in a separate future post (and as I said it’s optional).



Tag length impact



Let’s imagine we have that file I mentioned earlier with 7 million elements and 10 million attributes. If on average each attribute and element is about 2 characters long, then you have 34 megabytes of data to parse (which is a ton), just in tag names and element names. If instead though, the average length of an attribute and element were more like 10 characters, then your talking about 170 megabytes. That is a very significant difference.


This isn’t rocket science of course. Most folks I’ve talked to agree that it’s important to keep tag names short, especially in structures that are highly repetitive. In SpreadsheetML, you’ll see that a lot of the element names actually are pretty long and descriptive, but only if they appear in a few places, and won’t be much of a burden. Any element that can have a high frequency of occurrence is definitely kept to a minimum length.



Optimize based on your design goals



Remember, we’re not talking about creating a format for hobbyists. This format is supposed to be used by everyone, and most of those folks aren’t going to be happy with feature loss and performance degradation just so they can save out as XML (the average user doesn’t care about XML). The original SpreadsheetML from Office XP was actually more like a hobbyist format, and as a result, it was really easy to develop against, but it was bloated and slow. I wish that we didn’t have to worry so much about performance, but if you really expect these formats to be used by everyone, then you have to take the training wheels off. That’s why the standardization in Ecma is so important though, so that we can ensure that everything is fully documented and all the information is there to allow you to develop against them.


I’ll talk more about the other parts of the design that are optimized around performance in future posts. This was one that people had some questions on though so I just wanted to clarify and make sure there wasn’t any more confusion. If you’ve looked at similar issues in your file format designs and found other interesting things like this, I’d love to hear about them! 


-Brian

Comments (35)

  1. slueppken says:

    Hi Brian,

    thanks for the great post, you gave me quite a lot ideas how to improve my own XML file loading and saving. You’re posts are always enjoyable and very interesting! Thank you and keep up the good work! 🙂

    Sven

  2. C++ guy says:

    Brian, you’re a good guy, and your doing an excellent job of explaining these technical details.

    But I think everyone has a serious case of the "Emperor’s new clothes" or "group think" or whatever.

    Imagine many years ago, before XML was invented, if someone told you the the data

    1 2 3

    would be encoded as:

    <row><c><v>1</v></c><c><v>2</v></c><c><v>3</v></c></row>

    would you laugh or puke?

  3. BrianJones says:

    Sven, thanks for the good words! I hope we can all gain a lot through these discussions.

    C++ guy, I hear you man. 🙂

    It really is a fine line you need to walk when dealing with XML file formats. There are a number of drawbacks, but at the same time, there are huge advantages for developers with the ease of interoperability. That’s why we need to find the sweetspot between keeping the format useful for developers, without seriously affecting the end user. For the majority of documents, I think we can really nail this.

    -Brian

  4. Robert says:

    I remember when I first read about XML ten years ago and one of the tenets was

    "XML documents should be human-legible and reasonably clear. If you don’t have an XML browser and you’ve received a hunk of XML from somewhere, you ought to be able to look at it in your favorite text editor and actually figure out what the content means. "

    http://www.xml.com/pub/a/98/10/guide0.html?page=2#AEN84

    Sigh….

    <row><c><v>1</v></c><c><v>2</v></c><c><v>3</v></c></row>

  5. BrianJones says:

    Hey Robert, this is a pretty common approach. Heck, even HTML took this approach:
    <tr><td><p>1</p></td><td><p>2</p></td><td><p>3</p></td></tr>

    I think you can reasonably figure out what’s going on, and in addition that’s the whole point of documentation. Would you rather the documentation live in every file via long tag names? Or have the documentation stored out seperately for better efficiency? We’re not talking about a basic data format that’s representing a few fields of data. We’re talking about an XML format that represents three extremely powerful applications and is comprised of about 10,000 different elements, attributes, complex types and enumerations. 🙂

    -Brian

  6. If someone is interested in measuring the metadata/data-ratio or whitespace/data-ratio of some XML documents. The BioSchemas project offers a small .NET command line tool called XmlStatistic that does exactly that:

    http://sourceforge.net/projects/bioschemas/

  7. If someone is interested in measuring the metadata/data-ratio or whitespace/data-ratio of some XML documents. The BioSchemas project offers a small .NET command line tool called XmlStatistic that does exactly that:

    http://sourceforge.net/projects/bioschemas/

  8. Hi Brian,

    another great post 🙂

    Your posts have been very helpful. Especially when PPT B1TR decided to break an important file of mine and the only remedy was to dig into the XML…

    For a 2007 add-in of mine, I am reading a custom file that is a zip package with two XML files in there. Is there any way your ZIP/XML parser is exposed in the object model so that I could just plug into it and use it?

    Thanks,

    Patrick Schmid

  9. BrianJones says:

    Hey Martin, thanks for that pointer. Do you know if you can point it at a ZIP file and have it go over the XML files within the ZIP? That would be pretty nice.

    In order to get those numbers I posted, I created a little .Net application that used system.io.packaging to open the spreadsheet at look through all the parts counting up the statistics. I then created a seperate spreadsheetML file that had the counts for each part within the original package. Next step is to play around with formatting it in a cool way to that I can easily to different calculations, charting, etc. on top of it. I love using these new formats! 🙂

    Hey Patrick, I’m glad it was useful. I wasn’t sure if folks would find this data interesting or not. Almost all the posts you read lately on ODF and Open XML are more around politics, etc. so I wasn’t sure if that’s what people were more interested in or not.

    Sorry, but the open packaging conventions stuff isn’t exposed in the office object model directly. If you have WinFX, you could plug into that using system.io.packaging support, but I don’t think it would work against a ZIP file that didn’t follow the basic packaging conventions.

    -Brian

  10. Dileepa P says:

    People are against OpenXML just because Microsoft designed it! I just can’t understand that… especially when Microsoft is trying to be so open about the formats.

    I think you guys should release Office 97, for free to compete with OpenOffice/Star Office. No offence to those guys, but Office 97 still rocks compared to OO.

  11. Hi Brian,

    the mix of your blog posts is pretty good. I like reading the politics stuff, but I also like reading about the file formats themselves. If you are looking for more technical stuff to blog:

    A file size comparison of 97-2003 binary vs 2007 XML would be interesting, especially if media files come into play (I remember someone wondering why her PPT wouldn’t become any smaller when she saved it as PPTX. It was basically all pictures). It would be interesting to get an estimate how much smaller the typical Word, Excel, PPT files will be.

    I would also be interested in finding out what one has to do to make their own file format based on the open packaging convention? Where do I turn for resources on the topic? Where do I find tools? Where do I find examples?

    As the public beta is getting closer, a comprehensive summary blog post giving an overview of the new file formats would be very helpful. I had to hunt through your blog and lots of older posts on it to figure out the basics of how xxxxML works in general. One thing that baffled me for the longest time ever was "Open Package convention". I couldn’t find an explanation what that meant practically until I looked at your PDC presentation. It would be great if you could have two different posts: one for users/admins (what do the new file formats mean for me? e.g. with one implication being that users might want to stop zipping email attachments) and the other one being the technical one.

    Error recovery is also a topic of interest to me. If there is an error in the file, how do the Office programs recover from it? If that fails, how do I go about recovering my file manually? For example, do I try to cut out stuff in my existing file, or do I try to copy existing stuff over into a new one (the latter worked for me)? If you are interesting in my recovery story, there should be a post in the beta newsgroup. Alternatively, send me an email (stick anything you want in front of my domain name) and I’ll point you to the post and bug reports.

    Too bad that the parser is not accessible via the OM. Luckily I found a way out of my dilemma: #ziplib is a GPL licensed managed DLL for dealing with zip files (http://www.icsharpcode.net/OpenSource/SharpZipLib/). They grant one a nice exemption to the GPL that basically allows one to package the DLL even with closed-source commerical software. To parse the XML, I used System.Xml.XmlTextReader. That was a lot simpler to use than SAX and still a lot faster than DOM.

    Thanks,

    Patrick

  12. BrianJones says:

    Hi Dileepa, I think it just means that we have a bit more work ahead to gain folks trust and to help them understand more about the formats technically.

    Patrick, those are some great suggestions for blogs. I had been thinking that it would probably be good to write a couple overview posts again, just to help tie everything together.

    I’m also glad to hear you found a good zip library. That was one of the main reasons we chose ZIP. It was already so widely in use, that we figured it wouldn’t be too hard for people to get tools for working with them. Of course we also want to provide more tools to make it even easier, but we didn’t want our tools to be a requirement for working on the files.

    -Brian

  13. Brandon Bloom says:

    @ Dileepa P

    Microsoft would never release Office 97 for free to compete with OpenOffice. It just doesn’t make sense.

    Currently, the biggest compeditor to Office 2003 is NOT OpenOffice. Not by a long shot.

    The biggest compeditor to Office 2003 is OLDER VERSIONS OF OFFICE!

  14. XmlTeam says:

    > How about binary XML ?

    > http://www.w3.org/XML/Binary/

    That’s an interesting effort, but it has not resulted in even a draft of a W3C Recommendation yet, so can’t be considered any sort of "standard". The W3C effort appears to be driven mostly by folks in the wireless industry who need to optimize bandwidth utilization more than speed.  Both the ODF and OpenXML formats use ordinary "zip" compression to minimize file size.  That doesn’t work for the wireless industry because their messages are too short for ordinary compression algorithms to work well, but office documents generally are large enough to get good compression.

    It all goes back to what Brian said – "Optimize based on your design goals ."   There are a lot of binary XML formats out there, including several within Microsoft.  We’ve looked into trying to at least cut down on the apparent duplication, but upon closer examination we find that what look like small differences are actually optimizations for different design goals.  XML text is, and probably will remain, the most useful approach when an important design goal is data interoperability.

  15. Alex says:

    I’m not sure your SpreadsheetML vs. OpenDocument comparison is fair – the OpenDocument markup is verbose because it has a _lot_ more information contained in it, much of which is optional.

    For example, you’ve stored both the value of each cell and a printable representation of that value, whereas your SpreadsheetML doesn’t have that information. It doesn’t have the type information, so you don’t know what each cell contains. Etc.

    The point about tag lengths is a fair one, but your comparison isn’t. E.g., you’re using the prefix table: on each table tag; that could as easily be t: if you care about length.

    Also, OpenDocument is a mixed format, so on average you’re going to use many less tags than the equivalent OpenXML.

    I would be interested to see if you had any figures regarding XML parser performance. Looking at XML parsers here, the parsing speed is such that even the addition of a few hundred megabytes is only a few seconds of parsing speed. Some actual figures would be useful.

  16. BrianJones says:

    Alex, I think it’s a pretty fair comparrison. I loaded both applications and typed in "1 2 3" in the first row and "4 5 6" in the second row, and then saved. That’s what was saved out.

    The default datatype on the cell for Excel is float, so it doesn’t bother writing that out. If the type were a string, it would said t="s" rather than value-type="string". I mentioned that there are a couple extra things that Excel writes out (the row and column numbers) which are another optimization that improves performance of load, and maybe I should have left those in. They are of minimal length though so I figured it would be easier to address them later.

    The idea of whether or not having a mixed content model helps should really be a seperate discussion. In terms of just this discussion though. I’m not sure I see how a mixed content model would help in anyway for spreadsheets since each text node is ultimately going to be in a leaf node ("<c>"). The exception to that would be if the the cells had a mixture of formatting.

    I’ll try to dig up some profiles from load times. The percentage of time varies based on the document, but in earlier testing with Excel we were seeing parsing as taking between 10-40% of the load time.

    Now, it’s important to note the the length of tag names is hardly the only optimization we made around performance. There are many more that have a significant impact. Some pieces of the design actually are more forward looking in terms of taking advantage of multi-proc machines, etc. (when I say forward looking, I mean that the format is allowing for things that we actually aren’t currently taking advantage of, but most likely will in the future). Anyone could come along and leverage that. I’ll continue to provide information around the spreadsheetML design, and why we made the different initial designs that we did…

    -Brian

  17. As others have commented, fantastic post!

    It’s unfortunate that the world has forced XML from its original design as a human readable > data exchange format <, into the actual > data format < itself.  It seems that in the name of "freedom" everyone on the planet has become an expert in how to properly go about the development of document formats.

    It’s funny for me to hear me say this, as I’ve never been much of a VB fan, but in some ways I am actually longing for the days when VBA provided all of the extensibility anybody every wanted or needed.

    Of course, this was back in the days when software developers were the ones designing the software instead of politicians, and even further, software developers were the ones extending applications…  instead of politicians.

    Maybe its just me, but I just don’t see "Mom" cracking open her favorite XML editor to hack out some primo extension she’s been thinking about all morning, and just has to see if it works as well as she thinks it will.

    Oh well… Viva le’ COM.. err…. "MOM"!

  18. For those of you with interest in the tech side of the XML-based file format discussion, I’d recommend…

  19. Alex says:

    Brian,

    It’s not a fair comparison because you’re using two different applications. While OpenDocument is obviously a strong standard, that doesn’t mean that any given data can only be represented in one way. So, you’re comparing both the formats _and_ the applications which created them – two different variables. If you want to say OpenOffice.org’s performance is not up to Excel, that’s fine and probably true – but it’s another thing to then make statements about OpenDocument based on OpenOffice.org, which is but one app amongst many which implements the format.

    It may be that OpenDocument spreadsheets aren’t as performant in general, and it would be interesting to see figures. I think it’s unfair to declare it a "hobbyist format" on the basis above though; different design decisions have been made.

    As an example, your spreadsheet above has only numbers in it. If it was just strings, you could remove the type tags from the OpenDocument, but the SpreadsheetML would need them. Swings and roundabouts.

  20. BrianJones says:

    Alex, why don’t you post something you think is more representative if I’ve misstated things. That would probably be really helpful.

    OpenOffice is the only application I’ve been able to use that supports ODF so that’s how I created the example.

    -Brian

  21. Wesley Parish says:

    Just a couple of comments:

    Dileepa P and Brandon Bloom, I’ve often thought that the compelling reason why Microsoft should release the likes of MS Office 97, is precisely that Microsoft is currently competing against this massive installed base.  Release MS Office 97 under the Microsoft Community License, of course, so that Microsoft could still have some say in what happens with it.

    You see, this massive installed base is either a hinderance or a help.  If it is seen as a hinderance in the uptake of future MS Office updates, then Microsoft will in the end be at cross purposes, and opposed to the same customers who have loyally purchased Microsoft products and then decided that they have no further need for upgrading or any such thing.

    Or if Microsoft sees this massive installed base as a help, and decides to work with it, they can quite safely release the source tree of MS Office 97 etc, under a reciprocal license such as the MS Community License and use the flow of bug fixes and suchlike to ensure that the quality of their newer MS Office releases remains high.  All it takes is trusting your customers and the hobbyists, but Microsoft has seen this as a character flaw since BillG’s (in)famous Open Letter to Hobbyists.

    That’s what I think on that subject, anyway.

    Secondly, Brian Jones, I’m perfectly happy for SpreadsheetML to be faster and more efficient than ODF.  What I am more than a little worried about in the example, is the comment you let drop so casually: "The default datatype on the cell for Excel is float, so it doesn’t bother writing that out."  Now assuming I am writing a SpreadsheetML plugin for some other spreadsheet; am I constrained by SpreadsheetML to cast everything by default to floating point?  It seems to me that there could well be a hornets’ nest in that assumption.  And in moving a SpreadsheetML file to KSpread, say, from MS Office, or to GNumeric, is this (unstated) assumption likely to cause errors in the places where the spreadsheets make (even slightly) different assumptions?

  22. BrianJones says:

    Wesley, I’m not sure what you are concerned about. The schema defines the different datatypes (it’s an attribute called "t" that shows up on the cell tag "<c>"). The schema has an enumeration of values for the type attribute, and the schema also specifies that if the attribute is not present, then the default value is "n" which is Number.

    So there really isn’t any reason why another application will have problems, since the behavior is fully documented.

    Also, you were asking about the Gnumeric site… if you go there and check out the "News" section, you’ll see:

    "May 2006: We have a 1.7.0 prerelease build for Win32, including a first look at our MS Office 12 import filter. Get it from here (http://www.gnome.org/projects/gnumeric/downloads.shtml)!"

    -Brian

  23. Wesley Parish says:

    Thanks, Brian.  If it’s defined in the schema, and the schema’s public, then it answers my question aka problem with it.

    And it’s good that Gnumeric’s getting an MS Office 12 import filter.  It should mean that OpenXML will be supported on it as well.  While the Open Document Foundation gets ODF up and running on MS Office.

    Now the ball’s in the ODF court as far as performance goes.

  24. I wanted to follow up on the thread I started a couple weeks ago discussing the design goals behind spreadsheetML….

  25. As we move forward with the standardization of the Office Open XML formats, it’s interesting to look…

  26. There were a lot of great comments from last week’s announcement about the creation of an open source…

  27. I’ve blogged in the past about some of the things we did with the SpreadsheetML format to help improve

  28. I wanted to follow up on the thread I started a couple weeks ago discussing the design goals behind spreadsheetML.