Spreadsheet performance - Shared Formulas

I wanted to follow up on the thread I started a couple weeks ago discussing the design goals behind spreadsheetML. There's a whole host of things we've done to make sure that the move to XML formats is a huge benefit to developers, without it actually having a significant negative impact on our end users. Moving our formats into an open XML format significantly increases the relevance and value of Office files because they can have more interactive roles with business processes and systems. If the files aren't open, then they can't fit into as many scenarios and we lessen their value. That's was the whole reason we started the move to XML formats so many years ago, because we wanted to significantly increase the value of Office documents.

The big key here though is to make sure that people will actually use the formats. If our users decide to stick with the old binary formats, then we lose out on that opportunity. So it's really important to make sure that for the average end user (who doesn't really care or put any thought into file formats) doesn't see any significant losses from moving into the new XML formats.

A big area for Spreadsheets is performance. It was really scary to move from the Excel binary formats that were so damned fast into an XML format that we knew couldn't match up. There was a lot of work though analyzing every aspect of file load times to make sure that we could keep the performance drop to an absolute minimum, and we actually have really done a great job. We are absolutely trying to design for the developer, but the priority is of course given to the end user experience when it comes down to difficult design decisions.

I already talked about a few things we've done like keeping tag and attribute lengths small. I also mentioned the shared string table and I'll go into more detail on the benefits of that later. Today I want to discuss some optimizations we've done around formulas.

Shared Formulas

Most folks who've done any type of spreadsheet work are familiar with formulas. More relevant to this post though, you are probably also familiar with using a common formula in a column to make a similar calculation for each row. For example, image the following table:

Product ID 

Price 

Quantity 

Total 

5.45 

49.05 

3.99 

15 

59.85 

Most likely, that fourth column is a formula so that if you could look at the formulas, the table would look like this:

Product ID 

Price 

Quantity 

Total 

5.45 

=B2*C2 

3.99 

15 

=B3*C3 

Now, for this case, it probably doesn't look like there are a lot of optimizations we can do around the formula storage. The XML for this table would look something like this (in shorthand):

<

table>
<row>
<c><v>Product ID</v></c>
<c><v>Price</v></c>
<c><v>Quantity</v></c>
<c><v>Total</v></c>
</row>
<row>
<c><v>1</v></c>
<c><v>5.45</v></c>
<c><v>9</v></c>
<c>
<f>=B2*C2</f>
<v>49.05</v>
</c>
</row>
<row>
<c><v>2</v></c>
<c><v>3.99</v></c>
<c><v>15</v></c>
<c>
<f>=B3*C3</f>
<v>59.85</v>
</c>
</row>
</table>

The above XML representation would be pretty good if the table were this simple. Imagine though if this were actually more of a real world spreadsheet. Let's say this is tracking a large number of orders, so maybe you have more like 10,000 rows instead of just 2. Well that would mean that you have to parse that formula and figure it out 10,000 times. This would be a huge performance hit, especially if the formula was more complex.

You've probably noticed in Excel that if you type a formula into the first row and then copy it down through all the other rows, Excel can automatically adjust that formula in each row so that it's making the right calculation (ie you put =B2*C2 in row 2 and when you copy that into row 3 it says =B3*C3). Well, why shouldn't we do the same things in the file format? No reason to write out the formula 10,000 times and have to parse it 10,000 times. A complex formula can take awhile to parse, so we need to optimize around as little parsing as possible. Here is what the XML would actually look like using this approach:

<table>
<row>
<c><v>Product ID</v></c>
<c><v>Price</v></c>
<c><v>Quantity</v></c>
<c><v>Total</v></c>
</row>
<row>
<c><v>1</v></c>
<c><v>5.45</v></c>
<c><v>9</v></c>
<c>
<f t="shared" si="0" ref="B2:B3">=B2*C2</f>
<v>49.05</v>
</c>
</row>
<row>
<c><v>2</v></c>
<c><v>3.99</v></c>
<c><v>15</v></c>
<c>
<f t="shared" si="0"/>
      <v>59.85</v>
</c>
</row>
</table>

Notice that now the formulas say they are "shared" and they have an id of "0" so that we know they match up. The first instance of the formula stores the actual formula value, as well as specifies the range that it applies to so that as the rest of the spreadsheet is parsed, you know that you can ignore those other cells and just apply the shared formula.

I admit that it makes development a bit more difficult, but as I said before, we kind of had to take the training wheels off and actually think more about the end users as well. There are hundreds of millions of users, and we need to keep them in mind. We definitely want these formats to be accessible to developers (otherwise we should have just stuck with the old binaries), and that's why we are taking the Ecma standardization so seriously. We need to make sure that we fully document these formats so that anyone can build solutions around them. We are also looking to folks in the openxmldeveloper.org community to help out here by building tools that work on top of the formats. This way we get the best of both worlds. We get a very fast format that's also easy to develop against!

-Brian