Performance of an XML file format for spreadsheets

I've blogged in the past about some of the things we did with the SpreadsheetML format to help improve the performance of opening and saving those files. While a file format most likely won't affect the performance of an application once the file is loaded, it can significantly impact the performance of the actual load or save. The majority of a wordprocessing file is usually the actual text content and anything you can do to avoid making the storage of that content complex the better. The majority of a presentation is usually the rich media and styling information and it's important to store those in an efficient manner. The biggest challenge though hands down in terms of file format performance though is a spreadsheet.

It's very easy to get Spreadsheets comprised of millions of cells. In older versions of Excel, it used to be that the maximum number of columns allowed in a worksheet was 256 and the maximum number of rows was 64,000. That meant that a single worksheet in a workbook could consist of over 16 million cells. An of course there can me many worksheets in a workbook.

There were a number of customers who were hitting this limit and wanted us to increase the number of rows and columns allowed in a worksheet (this was actually one of Excel's top requests). This had been a request for awhile, but it was going to be extremely difficult to increase this limit in the old binary formats so we didn't do it. The move to the new Open XML formats though allowed us to increase the limits, so now in Excel 2007, you can have up to 16,000 columns and 1 million rows. That means you could have a worksheet with 16 billion cells. I can't imaging anyone hitting that limit, but to be honest, I couldn't believe folks were hitting the older limits.

So, while that's a look at the extreme end of things, it's not too uncommon to have very large spreadsheets, and a workbook with a million cells of data is definitely something we see a lot of.

Now, while in a Wordprocessing document you may potentially have a million words (which would be about 2000 pages), it's not as likely. More importantly though, the structure of the XML doesn't have as big of an impact. In wordprocessingML, you don't have every word contained in it's own XML element. So you while may have a million words, your XML element count would probably only be in the tens of thousands.

In a spreadsheet though, you have each cell represented by at least one, if not more XML elements which means that a spreadsheet of any reasonable size can easily have millions of XML elements and attributes. This can have significant implications on the load and save performance of the files.

A couple examples:

There are a number of things we looked into doing to help improve the performance of these files both using current technology as well as thinking about future approaches. Some of the approaches we took that are easiest to understand are:

  1. Reduce the need to fully parse repeating data - In order to cut back on parsing large strings repeated multiple times, or formulas that are repeated down an entire column, spreadsheetML does a lot of sharing. I already talked about the huge benefits we can get from the shared formulas work in a post last spring. The shared string table can give similar results.

  2. Tag Size - Contrary to this post from IBM's Rob Wier, the size of XML elements actually does directly affect performance times. The more text you have to parse, the longer it will take. Rob I hope you didn't misunderstand some of my earlier points on this though. I don't mean in any way to give the impression that tag size is the biggest factor in terms of file format performance because it's not. There are a number of other architectural issues that play a much bigger role. That said, the size of the tags does have an impact (as I've blogged about before).

    The simplest way to see the impact for yourself is to just take an application like OpenOffice, and get a decent sized spreadsheet (I took a file with just one worksheet, that was 15 columns by 15,000 rows). Save it in ODF and make 2 copies. In one of the copies open content.xml and change some of the namespace prefixes to make them much longer. I aimed to make it so that the resulting content.xml file was about 3 times larger than the original since this helps replicate the size difference you would see if we changed SpreadsheetML so that it's tag length was longer and more descriptive like ODF (it would probably be more than that, but I found that making content.xml 3 times larger was more than enough to get a noticable slowdown in load times). I used namespace prefix approach because it's the easiest way to replicate the effect of a larger element name without changing the actual element names (since that would prevent the file from being readable by OpenOffice). In my test, by essentially tripling the size of the element names, I found that the file went from loading in about 9 seconds to taking about 11 seconds.

  3. Splitting the XML into multiple parts - In SpreadsheetML we break the XML content out into multiple parts in the ZIP (each worksheet, the string table, pivot table data, etc.). This can help a ton with on demand loading, or even loading multiple pieces at once. For example, it means that if you had a multi-proc machine, you could load each worksheet on a separate thread. It also means that you could decide to only load on sheet and you wouldn't have to parse through all the XML from the other sheets.

  4. Relationships stored outside of content - By storing all the relationships from one part to another in separate (and much smaller) files, it makes it really easy to see what other parts you should load when you are loading a particular part of the file. If it weren't for the relationships, you'd actually have to parse through the content of the XML to determine what other resources that part used. For example, if you wanted to just load one slide in a presentation, you can also see what images are used on that slide before you start parsing the XML for the slide.

There are clearly a large number of factors that play into the performance of a particular format. As you can see, it primarily comes into affect when loading or saving a file. If you do any type of delay loading though or incremental saves, it can also start to have an impact on your editing performance though. The key issues I've focused on primarily affect the general load and save times. The examples above are just a handful of issues we looked at when designing the formats.

-Brian