ODF Spreadsheet Interoperability

Rob Weir posted on his blog a couple of days ago an Update on ODF Spreadsheet Interoperability.  I think it’s great that he has brought up spreadsheet interoperability, and specifically the issue of formulas, which seems to be the main thrust of his post.  I mentioned on the day of our SP2 release last week that “I’ll be doing some blog posts that get down into more of the technical details, to help explain some of the engineering decisions that we made in our implementation,” and Rob’s post is a good starting point for that conversation.

For those who only want to read the first two paragraphs of this very long post, here’s a summary:

  • ODF’s lack of spreadsheet formula syntax creates some interoperability challenges;   Because ODF 1.0 and 1.1 do not support formulas, all ODF spreadsheet implementations are application-dependent
  • We’ve worked hard to overcome these challenges in ways that provide accurate results and predictable interoperability
  • We have been fully transparent about the decisions we’ve made in our ODF implementation, both in terms of the guiding principles we’ve followed and also the specific details published in our implementer notes 
  • The Open Formula specification is not yet a standard, so we do not support it in its unfinished state, but we will look closely at Open Formula when it becomes a standard and make a decision then about how to best proceed

Before I get into the details, I think it’s worth noting that there’s nothing new here.  The challenges of ODF’s lack of formula specification have been around for a long time, and many people saw the current situation coming.

Nearly three years ago, Stephen McGibbon had a good post covering the situation, which is worthwhile reading for some background on how we got to where we are today.  As you can see from the quotes in Stephen’s blog post, many people in the ODF community – and the broader standards community – were dismayed at the decision to not include formula syntax in ODF.  Others outside of Microsoft also pointed out the problem years ago.  Rob’s blog post, as well as this post, are excellent examples of the sorts of interoperability challenges that those people saw coming as a result of the decision to not include formula syntax in the ODF standard.

Testing Methodology

The first thing I did upon reading Rob’s post was to try to follow his steps for myself, so that I could understand the context of his findings.  Here’s the methodology he describes in his post:

The test scenario I used was a simple wedding planner for a fictional user, Maya, who is getting married on August 15th. She wants to track how many days are left until her wedding, as well as track a simple ledger of wedding-related expenses. Nothing complicated here. I created this spreadsheet from scratch in each of the editors, by performing the following steps:

  • Enter the title in A1 "May's Wedding Planner" and increased font size to 14 point.
  • Enter formula = TODAY() in B3 and set US style MM/DD/YY date format/
  • Enter the date of the wedding as a constant in cell B4, also setting date format.
  • Added simple calculations on cells B6-B8, to calculate days, weeks and months until the wedding.
  • A11 through E16 is a simple ledger of the kind that is done thousands of times a day by spreadsheet users everywhere. Once you have the formula set up in column E (Balance = previous balance + credits - debits) then you can simply copy down the formula to the new row for each new entry

Sounds simple enough.  So I fired up Open Office 3.0.1, and followed those steps.  The resulting spreadsheet looked like this:

image

Then I saved the document, by clicking File/Save and then typing in a filename:

image

So far so good.  Next step, I tried opening the document in IBM Lotus Symphony version 1.2.0.  Here’s what I saw:

image

And then I opened the same document in Excel 2007 SP2, and here’s what I saw:

image

This is a great example of a common ODF spreadsheet interoperability challenge, and two different ways of dealing with it.  The challenge is caused by the fact that Open Office writes formulas in a syntax that is unknown to Symphony and Excel.  Open Office, for reasons I don’t understand, has decided to use as their default formula syntax the unfinished Open Formula specification, which is neither approved nor published by OASIS – not even out for public review yet.

So what do Symphony and Excel do about this challenge?  The answer is that Symphony preserves the (unrecognized)  formula markup, and Excel preserves the cached values.  (A quick aside for those who don’t know: spreadsheets typically store both the formula and the value resulting from the most recent recalculation.)

Getting back to Rob’s initial premise of this being a typical wedding-planning exercise, if Maya were to send this spreadsheet to a person using Microsoft Excel SP2, that person would see the values as shown above.  They’d know at a glance what day was ‘today’ when Maya made the file, and that the ledger balance that day was 5500.

But if Maya were to send this spreadsheet to a person running IBM Lotus Symphony, they’d see only the formulas.  Perhaps an ODF markup expert like Rob would be able to massage that spreadsheet into something usable, but most people would find it a bit hard to conceive of what it means for a wedding to be “of:=B4-B3” days away, or for there to be a ledger balance of “of:=E15+C16-D16” dollars.

So what does Rob’s test matrix show for these two scenarios?  Oddly, it labels Open Office to IBM Lotus Symphony interoperability for this scenario as “OK” and it labels Open Office to Microsoft Office SP2 interoperability as “Fail” (with a red background for added emphasis).  Now, I know Rob works for IBM and probably wants to portray Symphony in the best possible light, but is that a reasonable assessment of the interoperability we’ve just seen above?

After further investigation, though, I think I see what Rob may have actually done to get the result in his table.  He seems to have included some steps that aren’t documented in his blog post:

  • In Open Office Calc, he went into Tools, Options, Load/Save, General.
  • For “ODF format version” he changed the setting from “1.2 (recommended")” to “1.0/1.1 (OpenOffice.org 2.x)”
  • The dialog then warned him that “Not using ODF 1.2 may cause information to be lost.”
  • He clicked OK to save the change.

Here’s how it looks, for those who don’t have Open Office handy:

image

After following these steps, Rob was then able to create a spreadsheet that stores formulas in the undocumented non-standardized syntax that was used by old versions of Open Office.  Symphony, being simply a fork of an older version of the Open Office code base, is able to understand those formulas, so it can load both the values and the formulas themselves.

It’s worth noting what the OpenOffice.org developers have to say about this option that Rob has apparently used for his interoperability testing:

image

So it sounds like there isn’t any single version of ODF that will provide compatibility across all versions of Open Office and Symphony.   You can use the “may cause information to be lost” option if you want to do a demonstration of formula interoperability with ODF, but if you want to demonstrate text-bullet interoperability, you may need to use another option.

And what does Excel 2007 SP2 do with the document saved in this alternative format?  Exactly the same thing it does with Open Office’s current default format: it displays the data, so that the document user can see the results of the last recalculation of the spreadsheet, and it ignores the formulas that are written in a non-standardized syntax that Excel doesn’t support.  I think that’s a pretty reasonable approach, when a spreadsheet application comes across non-standardized formula syntax: show the last recalculated result, thus preserving the data, and don’t try to guess at the semantics of undocumented formula markup.

Why Doesn’t Office 2007 SP2 support Open Office formula syntax?

That’s a logical question to ask, in regard to how SP2 handles formulas.  To answer it accurately and completely, we should distinguish between the two formula syntaxes that Open Office uses.

The first is the syntax they use in their non-recommended “1.0/1.1 (OpenOffice.org 2.x)” setting.  This is an undocumented, deprecated syntax, and therefore not a reliable mechanism for formula interoperability.  Despite what you may read on some blogs, it is not the same syntax as used by Excel 97/2000/2003.  Open Office copied quite a bit of the feature set from Excel, and there are definitely similarities in the formula syntax, but there are also differences with regard to referencing, operators, data types, and function arguments.

The other formula syntax that Open Office supports is the Open Formula syntax, which will eventually appear in ODF 1.2.  This syntax has not yet been approved by a standards body, nor has it undergone the 60-day public review period that OASIS requires prior to approval and publication.  It may go to public review soon, but it won’t be a standard until later this year at the soonest, and the details may change as a result of the remaining TC work and the public review process.  (According to recent discussions in the ODF TC, we may send the other parts of ODF 1.2 out for public review first, to allow more time to finish up Open Formula.)

In Office’s implementation, we haven’t chosen to support the draft Open Formula spec (as Open Office currently does), because we have certain obligations when we ship software that don’t apply to open-source projects like Open Office.  We need to test and verify behavior to a degree that’s not possible without final, fixed documentation that is believed to be 100% complete and accurate.  When Open Formula is completed, standardized, and published, we'll be looking at that as the future path for enabling formula interoperability in ODF spreadsheets.  But we’re not there yet; ODF 1.2 is not done, and not even ready for public review.

It’s interesting to note that we have discussed this very issue at a DII workshop.  Last July, we had a workshop in Redmond, with attendees including other ODF implementers, members of the ODF TC, standards professionals, and others.  In the roundtable discussions, I brought up our approach to formula support as outlined above, and asked for feedback.  Although it was not 100% unanimous, there was clear consensus among most of the participants in the discussion that they did not want us to implement a non-standard formula syntax in anticipation of it becoming a standard.  “Putting the cart before the horse” in that manner was seen by many as a possible source of future interoperability problems, rather than a solution to them, and we took that feedback into consideration.

As I’ve covered before, we feel that thorough documentation of implementation details is a cornerstone of document format interoperability.  We’ve published detailed implementer notes for our ODF implementation, and on the matter of formulas (which are stored in the table:formula element), here’s what our implementer notes have to say:

The standard defines the attribute table:formula, contained within the element <table:table-cell>, contained within the parent element <office:spreadsheet \ table:table-row>

This attribute is supported in core Excel 2007.

1. When saving the table:formula attribute, Excel 2007 precedes its formula syntax with the "msoxl" namespace.

2. When loading the attribute table:formula, Excel 2007 first looks at the namespace. If the namespace is “msoxl”, Excel 2007 will load the value of table:formula as a formula in Excel.

3. When loading the table:formula attribute, if the namespace is missing or unknown, the table:formula attribute is not loaded, and the value “office:value” is used instead. If the result of the formula is an error, Excel 2007 loads the <text:p> element and maps the element to an Error data type. Error data types that Excel 2007 does not support are mapped to #VALUE!

And, as both Rob’s tests and mine show, that is exactly what Excel does.  It would be great if there were a place implementers could go to see these sorts of details for all major ODF implementations.

Summary

I’m glad to see this sort of public scrutiny of the details of ODF interoperability and how the underlying challenges are handled by various implementations.  As you can see, spreadsheet interoperability is a complicated topic, and in the specific case of ODF spreadsheets, there is even more complexity created by the lack of a defined formula syntax in any published version of ODF.

The good news, when it comes to formulas, is that the Open Formula specification will address this area soon.  My colleague Eric Patterson represents the Excel team in the Open Formula SC, and the very capable David Wheeler leads that group.  Much good work has been done already, and we look forward to seeing the final Open Formula spec go out for public review and then approval by OASIS.  The nearly 400 pages of formula syntax documentation in ISO/IEC  29500 (Part 1, section 18.17) enables reliable formula interoperability in the Open XML community, and soon the ODF community will have a similar level of formula interoperability.

But formulas are not the only ODF interoperability challenge.  As members of the ODF TC and also the OIC (ODF Interoperability and Conformance) TC, both Rob and I – and many others – will need to work together to enable better interoperability in areas including tracked changes, mail merge, application settings, and others.  Will ODF 1.2 be the most interoperable version of ODF yet?  I hope so, and there are signs that it will be.  But our work is not nearly done.