Design Goals Behind SpreadsheetML


It’s been awhile since I’ve talked in detail about the SpreadsheetML schema and I apologize. I had a number of posts back in the summer which talked through Office XP’s SpreadsheetML format that we built about 6 years ago, but obviously a lot has changed since then.


The new SpreadsheetML that is part of the Open XML formats coming with Office 2007 had to undergo serious work in order to make it ready to be the default format. As you all know, the majority of folks don’t really care about what kind of format they are using, they just want it to work (remember that most end users have never even heard of XML). We wanted our formats to play a more vital role in business processes though, which is why we’ve slowly been progressing towards these new default XML formats. We want people to be able to easily build solutions on top of the formats, but at the same time, we don’t want the average end user to feel much noticeable difference with the change (at least no negative differences).


That leads me to why we had to restructure SpreadsheetML from the original design. The two issues with the SpreadsheetML format from 6 years ago was that it wasn’t full fidelity, and it wasn’t optimized for performance/file size. The term “Full fidelity” just means that everything that is in your file can be saved into the format without fear of it being modified or lost. The old SpreadsheetML format didn’t support a number of feature like images, charts, objects, etc. So we had to add all those additional things to the format.


The second part (performance) was a really important and challenging one. We wanted to move to an open format so that people could build solutions around our formats. Like many other applications out there, we chose a combination of ZIP and XML to achieve this. We had to write the XML though in such a way that it could be parsed extremely efficiently so that the file open and save experience wouldn’t get significantly slower. There have been a number of articles related to this issue, where people have complained about performance in other applications that use XML as their format. Of course we had to keep this in mind with our design, and for those of you who have played around with it I’m sure you’ve noticed the difference.


While I’m not going to go into a full description on the SpreadsheetML format, I’d at least like to give you a brief introduction. A SpreadsheetML package has a few different pieces that it’s comprised of. Let’s lok at a basic diagram of the pieces of a spreadsheet:



The main parts I wanted to call out for today are:




  1. “sheet1” – This is the data for the worksheet. Each worksheet is stored as its own XML file within the ZIP package which means you can easily get at your data within a particular sheet without having to parse all the other sheets.


  2. “sharedStrings” – Any string (not number, just string) used in the sheet is actually stored in a separate location. There is a part called the “Shared string table” that stores all the strings used in the files. So, if you have a column called “states”, and “Washington” appears 100 times in the spreadsheet, it will only need to be saved into the file once, and then just referenced.

I think an example might be best to help show what I’m talking about. Let’s take a spreadsheet that looks like this:

















ID


Num


Resource


1


543


F068BP106B.DWG


2


248


F068BP106B.DWG


 


In the Open XML file, there would be an XML file that contained the strings used, that would look like this:



Shared String Table



<sst xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/5/main”>
  <si>
    <t>ID</t>
  </si>
  <si>
    <t>Num</t>
  </si>
  <si>
    <t>Resource</t>
  </si>
  <si>
    <t>F068BP106B.DWG</t>
  </si>
</sst>


Then, in the main sheet, there would be cell values, and pointers into the string table wherever a string occurs:



Sheet1



<worksheet xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/5/main”>
 
<sheetData>
   
<row>
     
<c t=”s”>
       
<v>0</v> 
     
</c> 
     
<c t=”s”> 
       
<v>1</v> 
     
</c> 
     
<c t=”s”> 
       
<v>2</v> 
     
</c> 
    
</row> 
    
<row> 
     
<c> 
       
<v>1</v> 
     
</c> 
     
<c> 
        <v>0</v> 
      </c> 
      <c t=”s”> 
        <v>3</v> 
      </c>
    </row> 
    <row> 
      <c> 
        <v>2</v> 
      </c> 
      <c> 
        <v>0</v> 
      </c> 
      <c t=”s”> 
        <v>3</v> 
      </c>
    </row>
  </sheetData>
</worksheet>


Notice that in the first row, each cell has the attribute t=”s” which means it’s a string. Then, the value is interpreted as in index into the string table, rather than an actual number value. In the 2nd and 3rd rows, the first two cells are interpreted as numbers, so they don’t have the t=”s” attribute, and the values are actual values.


This may seem a bit complex, but remember that while this format was designed for developers to be able to use, it we couldn’t take the hit that comes with making it completely intuitive. Believe me, as a developer, I would have loved to make the formats more verbose and straight forward, but that would have meant that everyone else opening the files would have to suffer for it. If the example above was a more complex set of data with a number of separate worksheets, each with a few thousand rows, you can imagine how quickly the savings of the string table and terse tag names would add up. I had a couple posts back in the summer talking about some other basic things we do to make sure that the formats are quick and efficient.


This tradeoff of who you design around and how you way ease of use versus efficiency is something folks have to look at every day when they design products. Whether it’s an API, a user interface, or a file format, you need to decide which target user you are going to give more weight to when you make your design decisions. We had to give more weight to the end user, and instead require a bit more knowledge from the developer. That’s why the Ecma documentation is so important. We need to make sure that the format is documented 100% and there are no barrier to interoperability. The great group of people we have on TC45 are really helping a lot here. As I said last week, the Novell guys have already built some working code that allows Gnumeric to open and save Spreadsheet files in the Open XML format. I’m sure we’ll see more and more implementations as we provide better documentation and get closer to a complete standard. It’s really exciting! That’s one of the great things we’ll see more and more of up on the openxmldeveloper.org site.

-Brian

Comments (21)

  1. itsadok says:

    Is the key for the Shared String Table really determined by the position of the string in the table? What happens when a string is deleted?

    I would have thought it would be something like

    <si key="0">

     <t>ID</t>

    </si>

  2. BrianJones says:

    Yes, it’s done based on the position. That makes it faster to read the string table into memory when loading the file.

    It’s pretty easy to query for a particular node based on it’s index rather than an attribute value, so really the only complexity this ads is in situations where you want to delete a string (as you mention).

    Since a string only appears once in the string table though, even if you’ve removed all reference to that string there isn’t much benefit to deleting it (it’s not taking up much space). You can just leave it around, or worse case (if you are removing it for some kind of privacy reason or just want to simplify) you can just zero it out.

    At some point later (like when the entire file is loaded and resaved) the string table and all references can be cleaned up. Excel (for example) will recreate the string table whenever the file is saved. Other applications can do the same if they want.

    -Brian

  3. Adam says:

    Gadzooks!

    I understand your reasoning behind it, but it just doesn’t make sense to me.

    Using short strings instead of long strings in the on-disk format shouldn’t make any difference at all to the final file size, because the on-disk format is compressed. Any string that occurs a lot (such as "cell", "type", "string" or "value") in the file will end up with a small size in the zip no matter what it actually is, because that’s how compression works.

    Similarly, using indexes and a string table in the XML format seems an awkward way of doing things. If instead you updated the XML parser so that as it loads and parses XML, it hashes each string it finds (be that tag names, or CDATA) and uses a hash map to store the strings and eliminate duplicates at run-time, while still presenting the normal XML DOM API to clients this would:

    a) Provide most of the benefits that the Office XML way provides. Certainly you should get similar in terms of memory savings as in both cases you’re ending up with a string table. The performance should not be significantly (order of magnitude) worse either as hash lookups and array access should both be O(n), even though n may be a little higher for the hash lookup.

    b) Provide these benefits for *all* applications that use the MS XML parser. Which is – oh – almost all XML-aware application written for Windows!

    c) Make things *much* easier for everyone else who has to work with the files. That is, after all, the whole point of using an XML format.

  4. BrianJones says:

    Adam,

    Don’t forget the file size and parse times are significantly different. You don’t parse the zipped XML, you instead unzip and then parse. The longer the tag names, the longer it takes you to parse the file.

    -Brian

  5. Andrea says:

    Adam brought up an interesting point.  Is MS Office using DOM or SAX when parsing the XML files?  As I understand it, it is using SAX.

    Assuming it is using SAX, how does Excel resolve references to other cells, particularly to sheets not yet loaded?  In the binary format there used to be a way to jump around rows in groups of 32 using indexes/dbcells, so you could get to a cell on another sheet pretty quickly directly from the file.  Is there any corresponding shortcut in the XML format, or does one have to parse the entire sheet to get at the desired source of the reference?  Given a worse case scenario, you may end up having to read all sheets just to load the first one.

    This final question probably needs to be directed at the Excel people, but it’s still format related.  I noticed that diagonal borders still cannot be formatted independently of each other.  What is so difficult about allowing the up diagonal to be blue and the down diagonal red?  This seems like a most ridiculous limitation 🙂  I assume this is because in the binary format there was no way to store it, but with the new format this should be really easy.

  6. Alex says:

    So, does that mean it’s actually impossible to save strings on the sheet itself, or just that by default Excel will store strings in a table?

    If you can’t put strings on the sheet, that would mean that in general you’re going to need at least two XSLT transforms if you’re converting some other XML, since you’re having to output two XML trees. :/

  7. oli says:

    Just curious, what are the .DWG resources you mention here? Just made up? Autodesk DWG files? Or something particular to Excel or OpenXML/SpreadsheetML?

    thanks

  8. BrianJones says:

    Andrea,

    You are correct, we use SAX for parsing the files. The references to other cells are not fully resolved until the entire book is loaded. I’ll try to provide some more information for you around what you would do if you only wanted to read one sheet, but the short of it is that yes, depending on to what level you want to get at, you may end up having to parse all sheets in order to properly display just one (there is a last cached value on most things though, so as I said it depends on what level you want to go to).

    Alex,

    No, you don’t have to use the string table. It’s used by Excel (and I would assume other Spreadsheet applications that care about performance optimizations), but it is not required. You can instead just do the following if you want the string inline:

    <c t="inlineStr"><is><t>my string</t></is></c>

    -Brian

  9. BrianJones says:

    Oli, those are just made up. I grabbed an example file and just trimmed it down to a couple of rows. Not sure what they were resources for. The original file was from this article: http://blogs.zdnet.com/Ou/?p=120

    -Brian

  10. Adam says:

    Brian, by "unzip and then parse", does that mean you write an uncompressed copy of the zip file to a temp location and then read that copy with the parser?

  11. BrianJones says:

    Hey Adam, sorry if I wasn’t clear. I’m not implying that you need to write to a temp file (we don’t). I mean that you need to uncompress the XML before you can start to parse it. This could all be done in memory, but either way at the end of the day you are parsing XML, not the binary blob the results from the compression. I’m not talking about the hit it takes to read from disk, I’m talking about the hit it takes to actually parse over the text as you interpret the tags, attributes, and values. The longer the tag names, the more characters you have to look at to determine the tag name.

    -Brian

  12. Adam says:

    That’s cool. It’s what I originally figured you’d be doing.

    But I can’t see the merit in your point about saving time actually parsing the tags, attribs and values. If the time it takes to iterate over a bunch of bytes looking for a quote, apostrophe, ampersand, equal-sign or less-than character is taking even 1% of the time it takes to actually read the data from disk (very slow!) and decompress it (quite slow!), I’d have thought you’d probably be doing something seriously wrong with the parser.[0]

    Secondly, you don’t have to uncompress the whole XML stream before you pass it to a SAX parser, that’s part of the point of using SAX.

    If you only uncompress, say 20K, of compressed data at a time before feeding that to SAX, then in the case of short tags the uncompressed data might take ~100k, but for long tags it might take ~400K (rough guesses here).

    Now, that’s sounds bad, but that buffer is temporary and only exists while you’re loading the data. Most importantly, it doesn’t need to get bigger as the size of the file does; you just iterate over the read-from-zip/decompress/send-to-sax loop more times. So yes, you do lose out a bit there.

    OTOH, if your SAX parser keeps tag names/attribute names/etc… in a hash table as well as the actual CDATA, then it shouldn’t matter how long those names are. You can use descriptive names pretty much without worrying about it. And all developers benefit – especially new ones. You’d lower the learning curve/barrier for entry for hacking around with the XML formats.

    Also, I’d have thought that keeping tag names/attrib names in a hash table would probably save a fair amount of memory all by itself. If you’ve got an object representing a SAX tag or a DOM node, then to store a one-letter string of that name will take a fair amount of memory. String objects typically contain a pointer to the actual data, a chunk of memory to hold that data (2 bytes for a one-character unicode string), and an int containing the length of that data. Pointer + int + data = 10 bytes, but the data will probably be padded to 4 bytes *minimum*, and you’ve got heap manager overhead for the data, so you’ve looking at 16 bytes minimum for each 1 character tag name. OTOH, a 32-bit int hash of the string, plus a 32-bit int index into that particular hash bucket, gives you only 8 bytes per tag for as long a tag name as you want, stored directly in the SAX/DOM structure. (Assuming tag names are repeated a lot and the size of the hash table is dwarfed by the size of the SAX/DOM structures, which isn’t unreasonable)

    So you’d halve your minimum per-tag-name-memory-costs while making long tag names "free". Wouldn’t you?

    Or where have I gone wrong here?

    [0] OK, I’ve not seen any profile data on that, so I may be way off there. It doesn’t feel right though. I/O takes *years* in computer time.

  13. BrianJones says:

    Hey Adam, those are some great questions. Some of this is actually at a much deeper level than I deal with, but I’ve been talking with some folks on my team about it. Each application has slightly different approaches, and rather than write it up in comments I think what I’ll do is pull together more of the details and write a new post about it. Sound good?

    -Brian

  14. Adam says:

    Looking forward to it.

  15. Alex says:

    I would agree with Adam somewhat.

    This just sounds like another layer of compression designed to get the data set size down a bit to get the parser through it quicker.

    The extra indirection means that a string for any given cell could be in two places, which sounds like reading OpenXML with simple tools (e.g., XSL) is going to be a pain 🙁 I guess at least you can ignore the string table for output, which reduces the pain somewhat.

  16. Surprisingly, I haven’t seen much information out there discussing the performance impacts of XML tag…

  17. BrianJones says:

    Adam, sorry that I didn’t get to this right away. I had a couple things I wanted to verify. Here is some more information: http://blogs.msdn.com/brian_jones/archive/2006/05/16/599096.aspx

    Alex, many spreadsheet applications actually practice the use of string tables even for their internal memory structures. It’s much faster, especially if you are talking about a large spreadsheet. Image a table on cencus data where you have a column called "state", and "North Dakota" appears 50,000 times. Do you want to read in the string "North Dakota" that many times? Or would you rather have a string table of the 50 states, and each cell just have the value of which row it is in the string table (ie "15"). You’re right that compression can solve the same issue, but that doesn’t help you when it comes down to actually parsing the XML as it’s being decompressed.

    -Brian

  18. I wanted to follow up on the thread I started a couple weeks ago discussing the design goals behind spreadsheetML….

  19. As we move forward with the standardization of the Office Open XML formats, it’s interesting to look…

  20. There were a lot of great comments from last week’s announcement about the creation of an open source…

  21. I wanted to follow up on the thread I started a couple weeks ago discussing the design goals behind spreadsheetML.