Spreadsheet performance – Shared Formulas


I wanted to follow up on the thread I started a couple weeks ago discussing the design goals behind spreadsheetML. There’s a whole host of things we’ve done to make sure that the move to XML formats is a huge benefit to developers, without it actually having a significant negative impact on our end users. Moving our formats into an open XML format significantly increases the relevance and value of Office files because they can have more interactive roles with business processes and systems. If the files aren’t open, then they can’t fit into as many scenarios and we lessen their value. That’s was the whole reason we started the move to XML formats so many years ago, because we wanted to significantly increase the value of Office documents.


The big key here though is to make sure that people will actually use the formats. If our users decide to stick with the old binary formats, then we lose out on that opportunity. So it’s really important to make sure that for the average end user (who doesn’t really care or put any thought into file formats) doesn’t see any significant losses from moving into the new XML formats.


A big area for Spreadsheets is performance. It was really scary to move from the Excel binary formats that were so damned fast into an XML format that we knew couldn’t match up. There was a lot of work though analyzing every aspect of file load times to make sure that we could keep the performance drop to an absolute minimum, and we actually have really done a great job. We are absolutely trying to design for the developer, but the priority is of course given to the end user experience when it comes down to difficult design decisions.


I already talked about a few things we’ve done like keeping tag and attribute lengths small. I also mentioned the shared string table and I’ll go into more detail on the benefits of that later. Today I want to discuss some optimizations we’ve done around formulas.



Shared Formulas



Most folks who’ve done any type of spreadsheet work are familiar with formulas. More relevant to this post though, you are probably also familiar with using a common formula in a column to make a similar calculation for each row. For example, image the following table:

























Product ID 


Price 


Quantity 


Total 



5.45 



49.05 



3.99 


15 


59.85 


Most likely, that fourth column is a formula so that if you could look at the formulas, the table would look like this:

























Product ID 


Price 


Quantity 


Total 



5.45 



=B2*C2 



3.99 


15 


=B3*C3 


Now, for this case, it probably doesn’t look like there are a lot of optimizations we can do around the formula storage. The XML for this table would look something like this (in shorthand):


<table>
  <
row>
    <
c><v>Product ID</v></c>
    <
c><v>Price</v></c>
    <
c><v>Quantity</v></c>
    <
c><v>Total</v></c>
  </
row>
  <
row>
    <
c><v>1</v></c>
    <
c><v>5.45</v></c>
    <
c><v>9</v></c>
    <
c>
      <
f>=B2*C2</f>
      <
v>49.05</v>
    </
c>
  </row>
  <
row>
    <
c><v>2</v></c>
    <
c><v>3.99</v></c>
    <
c><v>15</v></c>
    <
c>
      <
f>=B3*C3</f>
      <
v>59.85</v>
    </
c>
  </
row>
</
table>


The above XML representation would be pretty good if the table were this simple. Imagine though if this were actually more of a real world spreadsheet. Let’s say this is tracking a large number of orders, so maybe you have more like 10,000 rows instead of just 2. Well that would mean that you have to parse that formula and figure it out 10,000 times. This would be a huge performance hit, especially if the formula was more complex.


You’ve probably noticed in Excel that if you type a formula into the first row and then copy it down through all the other rows, Excel can automatically adjust that formula in each row so that it’s making the right calculation (ie you put =B2*C2 in row 2 and when you copy that into row 3 it says =B3*C3). Well, why shouldn’t we do the same things in the file format? No reason to write out the formula 10,000 times and have to parse it 10,000 times. A complex formula can take awhile to parse, so we need to optimize around as little parsing as possible. Here is what the XML would actually look like using this approach:


<table>
  <
row
>
    <
c><v>Product ID</v></c
>
    <
c><v>Price</v></c
>
    <
c><v>Quantity</v></c
>
    <
c><v>Total</v></c
>
  </
row
>
  <
row
>
    <
c><v>1</v></c
>
    <
c><v>5.45</v></c
>
    <
c><v>9</v></c
>
    <
c
>
      <
f t=shared si=0 ref=B2:B3>=B2*C2</f
>
      <
v>49.05</v
>
    </
c
>
  </row
>
  <
row
>
    <
c><v>2</v></c
>
    <
c><v>3.99</v></c
>
    <
c><v>15</v></c
>
    <
c
>
      <f t=shared si=0/>
      <v>59.85</v>
    </
c
>
  </
row
>
</
table>


Notice that now the formulas say they are “shared” and they have an id of “0” so that we know they match up. The first instance of the formula stores the actual formula value, as well as specifies the range that it applies to so that as the rest of the spreadsheet is parsed, you know that you can ignore those other cells and just apply the shared formula.


I admit that it makes development a bit more difficult, but as I said before, we kind of had to take the training wheels off and actually think more about the end users as well. There are hundreds of millions of users, and we need to keep them in mind. We definitely want these formats to be accessible to developers (otherwise we should have just stuck with the old binaries), and that’s why we are taking the Ecma standardization so seriously. We need to make sure that we fully document these formats so that anyone can build solutions around them. We are also looking to folks in the openxmldeveloper.org community to help out here by building tools that work on top of the formats. This way we get the best of both worlds. We get a very fast format that’s also easy to develop against!


-Brian

Comments (25)

  1. John says:

    Brian,

    You might want to fix your examples. In the tables row 2 and 3 are different,(Product 1 and 2). In the code examples you repeate the data from row 2 in row 3.  

  2. BrianJones says:

    Great catch John, thanks. I’ve updated the examples to match the tables…

  3. >> I admit that it makes development a bit more difficult, but as I said before, we kind of had to take the training wheels off and actually think more about the end users as well. <<

    Beautiful!  Take ’em off :)  I’m ready to ride without them 😀

    Computer Science is supposed to be difficult…  Not that I would want to encourage the notion of making it as difficult as possible such that the only one who can possibly make sense of it all is Knuth, but you have to draw the line somewhere.  

    Its also nice to see you folks are thinking about the fact that just because we have the processing power doesn’t mean we need to be overclocking our CPU’s and keeping them at 90% capacity 24 hours a day.  Its nice to know the power is there if we need it.  Its also nice to know that when we need it, its there (if thats make any sense :)

  4. Mike says:

    Looks a lot like BIFF with angle brackets so far…

    Case in point, what about business solutions that have to update formula values as part of the file update? Since this blog is about the so-called power of Xml and associated API, how do you achieve this without a running Excel instance?

  5. Biff says:

    "Since this blog is about the so-called power of Xml and associated API, how do you achieve this without a running Excel instance?" — Mike

    Perhaps invest in better coders? This appears to be very sensible and straightforward optimization, not rocket science by any stretch of imagination. C’mon, do you believe every user out there must suffer from performance drops because your coders cannot get their act together? Bleh.

  6. Mike says:

    Biff,

    I think you missed my point. How can you possibly say you can perform calculations without relying on a running instance of Excel? You realize that in order to do that you end up rewriting Excel, right?

  7. Biff says:

    "How can you possibly say you can perform calculations without relying on a running instance of Excel? You realize that in order to do that you end up rewriting Excel, right?" — Mike

    Wrong. You’re free to use any or all of Quattro Pro, OOo Calc, Gnumeric, KSpread, and more pure-bred calculation engines like KDCalc and SpreadsheetGear – plug them in and update your data. You might get different results, but we’re talking file formats here, right?

  8. Alex says:

    I guess if you’re doing string sharing etc. it’s probably a nice idea to extend that to different parts, like formulas.

    Brian, is there a way of turning these optimisations off individually or not? I’m just wondering if it’s possible to try and actually get some numbers on what difference these changes make.

    I have the same worry as Mike – the more complex it is to interpret a spreadsheet, the less useful the OXML format becomes IMHO. But, since OXML doesn’t store result values (as far as I know?) you have to process the formulas anyway, so I guess this isn’t much extra work on top of that.

    Biff – you probably could use alternative engines for some spreadsheets. Gnumeric in particular has a highly compatible implementation, which contains all the functions Excel does and gives more accurate results. But the others are not as compatible.

    Since OXML is standardising the functions available, maybe other spreadsheets will build in a compatibility module. It would be better if the formula stuff could be standardised on its own, though.

  9. >> I think you missed my point. How can you possibly say you can perform calculations without relying on a running instance of Excel? You realize that in order to do that you end up rewriting Excel, right? <<

    I feel like I’m in the twilight zone:

    Group:We HATE Microsoft, err… Open Standards Advocates:

    "We DEMAND Open Standards to encourage competition and to not lock out other vendors from developing their own versions of office software that can openly read and write to this format"

    Group:Same as above:

    "How can you possibly say you can perform calculations without relying on a running instance of Excel? You realize that in order to do that you end up rewriting Excel, right?"

    Me: "Du doo du doo… Du doo du doo… Do not adjust your set… Do not… "

    To quote Biff:

    Bleh!

    Okay, sarcasm aside…  The point of an open standard is to ensure that anybody, regardless of who they are, has the ability to openly use this standard for the primary purpose of interop.  Or in other words, no matter what tool you might choose to use, you can expect that this same document format will be accessible to open, edit, save, repeat, as long as that vendor of choice has implemented support for this open standard.

    "You realize that in order to do that you end up rewriting Excel, right?"

    Yep!  And thats the whole point.

  10. @Alex,

    Hey, I’m diggin’ your comments… VERY constructive…  It’s fantastic to see that you have taken to this position as you are obviously someone who has a lot of value to add to the conversation.

    Learning things from folks is what I hope to do everyday of my life.  Today, you help me learn something I didn’t know yesterday.

    Thanks! 😀

  11. Brian Jones: Open XML Formats : Spreadsheet performance – Shared Formulas via a comment to the above linked, Biff (unfortunately I don’t have a link (Biff, if you happen to read this and have a preferred link, let me know…

  12. Mike says:

    Biff said "Quattro Pro, OOo Calc, Gnumeric, KSpread, and more pure-bred calculation engines like KDCalc and SpreadsheetGear"

    I didn’t know those apps had native support for importing/exporting .xlsx files in order to perform calculations outside of Excel. Thanks for the info.

    😉

  13. Biff says:

    "I didn’t know those apps had native support for importing/exporting .xlsx files in order to perform calculations outside of Excel." — Mike

    So your attention span is under 25 words, is it what you wanted to say? <g>

  14. Mike says:

    That was sarcasm.

    Btw, are you a Microsoft employee? I’m willing to know if someone of your "caliber" is inside or outside the fence.

  15. Biff says:

    "That was sarcasm. " — Mike

    Doubt it.

    "Btw, are you a Microsoft employee?" — Mike

    No, I’m but a lowly Microsoft shill, can’t you tell?

  16. Gabe says:

    Alex: The spreadsheet has to contain the values shown for each formula. Doing a recalc may be very expensive or may even change the values (in case of RAND or iterative computations).

    It appears you didn’t notice that the formula cells have both an "f" tag (for the formula) and a "v" tag (for the value.

  17. Hi Brian,

    I just came across an interesting Word topic that would be great to hear on from you: Master Documents

    Is there still a need for master documents in 2007 (meaning can it better handle documents with >500 pages, more than 32 MB of text, etc?) Do master documents work better (hence don’t corrupt easily) when used with new XML file formats?

    Thanks,

    Patrick

  18. As we move forward with the standardization of the Office Open XML formats, it’s interesting to look…

  19. By using a reference for relative identical formulars it makes it more difficult to read and understand an OOXML document on the fly.

    You also have the problem when updating a document writing a "new" formular because you have to check all the existing formulars if one of them is relative identical – or is it only when the formulars are in the same or extended area like original in B2:B5 but now B1:B5.

    You could also implement it in a way so you both have the reference and the formular so the application only have to read one of the two but have to save them both. Quickly to read but slower to write.

  20. BrianJones says:

    Claus, it is true that it’s a bit more difficult if your goal is to go to a specific cell and analyze it without looking anywhere else. Otherwise though, it’s fairly straightforward.

    Your suggestion of duplicating the data isn’t really useful unless we actually enforced that they matched. Otherwise a consumer wouldn’t know which one to trust. We try to avoid data duplication unless it gives a perf gain, and even in that case we try to make sure we enforce that the duplicated data is valid (otherwise it’s much harder for consumers to know which value to pick).

    If you think about it, the shared formulas not only make things significantly faster, but they also make them easier. If you have 50,000 rows of data, and one of the columns is a calculation of some of the other columns, then it’s much easier to update that formula in this case (you only need to edit one cell). Without the shared formulas, you’d have to edit all 50,000 instances of the formula.

    -Brian

  21. There were a lot of great comments from last week’s announcement about the creation of an open source…

  22. I’ve blogged in the past about some of the things we did with the SpreadsheetML format to help improve