1 + 2 = 1?

Does 1 plus 2 equal 3?   After last week’s sometimes acrimonious discussion about formulas in ODF, you may be glad to hear that IBM and Microsoft appear to agree on that answer to this simple question.  But OpenOffice.org is not so certain – maybe the answer is just 1 sometimes – and the question itself turns out not to be so simple after all.  Let me explain.

The State of ODF Formula Interoperability Today

What is the current reality of ODF formula interoperability?  Understanding the status of the ODF ecosystem will help clarify the set of issues and options that we faced when making the tough decisions we had to make about how to best support formulas in ODF spreadsheets.

For this example, I’ll use the latest released versions of two well-known ODF implementations: IBM Lotus Symphony (version 1.2, download here) and OpenOffice.org (version 3.1, download here).  I want to talk about current reality, so I’m not using any outdated versions of software (the OO build I’m using, for example, was released in the last week).  I also stayed away from unreleased or private beta versions that might become available sometime in the future, and I used the default settings for each application.

First, I fired up Symphony 1.2, and followed these steps:

  • Enter a numeric value of 1 in cell A1.
  • Format cell A2 as text, right-justified, then enter a 2 in that cell.
  • In cell A3, enter the formula =A1+A2.

In Symphony 1.2, here’s what I see:

image

After saving this spreadsheet as an ODS file, I open it in OpenOffice.org 3.1 and see this:

image

Clearly this is a problem.  The exact same data, in the exact same spreadsheet, when operated on with the exact same formula, provides different results.

Some might be tempted to say that formatting a cell as text and then using it in a calculation is dumb.  And I’d agree that there are few people who ever do such a thing intentionally.  But in a large complex spreadsheet, with thousands of cells involved in complex calculations, it’s easy to make mistakes like this.  In fact, if you’ve spent any amount of time at all creating complex spreadsheets, I’ll bet that on more than one occasion you’ve wasted a bunch of time trying to debug a problem that turned out to be caused by such mistakes; I know I sure have.

Similar issues arise with boolean values – what does it mean to “sum” a column of cells that includes both numeric values and boolean values?  Not all spreadsheet implementations agree on the answer to that question, either. This can create interactions between formatting and calculating – change the format of some cells, and the totals change in your spreadsheet.  Most users find such behavior very confusing, to say the least.

One of the most interesting things I found in my testing of these two implementations was that although they write different markup for formulas, the exact same interoperability problem occurs regardless of which application is used to create the spreadsheet.

If you create the spreadsheet in Symphony 1.2, as I did, the table:table-cell element has a table:formula attribute with a value of "=[.A1]+[.A2]" .  And this formula will yield a result of 3 in Symphony and 1 in OpenOffice.org, as described above.

If instead you create the same spreadsheet in OpenOffice.org 3.1, when you open it in Symphony 1.2 you'll see of:=A1+A2 in cell A3.  But after you manually correct the formula, this spreadsheet, too, will yield a result of 3 in Symphony and 1 in OpenOffice.org.

So these two ODF implementations do not have predictable formula interoperability, regardless of where you start.  And these are not obscure implementations – they are the latest released versions of the implementations from IBM and Sun, the two companies that together chair the ODF TC.  Even if both companies released fixes tomorrow, there will still be many copies of the current non-interoperable versions of these applications in use for a long time to come.  This is the state of formula interoperability among ODF spreadsheets today.

Fixing the Problem

This difference in behavior is a well-known issue among those who work with spreadsheet formulas.  As Rob Weir said three years ago “Automatic string conversions considered dangerous. They are the GOTO statements of spreadsheets.”  (One of the ODF TC members even has that line in his email auto-signature.)

How to manage string conversions is far from the only problem with spreadsheet interoperability across vendors (and even across versions of the same product in some cases). The current draft OpenFormula specification contains 254 notes (by my count) about other issues similar to this one.

The OpenFormula sub-committee of the ODF TC has worked hard to address this.  Here is an excerpt from the draft OpenFormula specification (emphasis added):

6.2.4 Conversion to Number

If the expected type is Number, then if value is of type:

  • Number, return it.
  • Logical, return 0 if FALSE, 1 if TRUE.
  • Text: The specific conversion is implementation-defined; an application may return 0, an error value, or the results of its attempt to convert the text value to a number (and fall back to 0 or error if it fails to do so). Applications may apply VALUE() or some other function to do this conversion, should they choose to do so. Conversion depends on the actual locale the application runs in, especially if group or decimal separators are involved. Note that portable spreadsheet files cannot depend on any particular conversion, and shall avoid implicit conversions from text to number.

After OpenFormula is approved and published, this approach, with its explicitly defined concept of “portable spreadsheet files,” will allow more predictable and consistent interoperability for ODF spreadsheet users.

But in the current environment, with no standardization of formula markup across major ODF implementations, users who want to avoid interoperability problems need to stick to a very conservative strategy.  As Burton Group analyst Guy Creese said last week:

“… this in-between time (between the OpenOffice.org de facto standard and the wait for the officially approved 1.2 standard) means there isn't one way to handle this problem. The vendors would like you to believe that there is (their way), but in reality there isn't. Ultimately, this will resolve itself over time. ODF 1.2 will be approved, and there will finally be an approved standard that everyone--IBM, Microsoft, Sun (Sun/Oracle)--can follow.

Until then, if an enterprise does want to use ODF, the best strategy is to stick with one productivity suite as a way to avoid these interoperability problems. That way, even if formula support is idiosyncratic, it at least will be consistent within the enterprise.”

 How Excel 2007 SP2 Handles ODF Formulas

The question of how to handle formulas in SP2’s ODF implementation was one of the tough decisions we faced in our ODF implementation.  We had made conformance to the ODF 1.1 specification a top priority, and yet the spec doesn’t specify a formula language. 

It seemed clear to us that we couldn’t simply omit the  namespace, as the current version of Symphony does.  That would be in violation of Section 8.1.3 of the ODF specification, where it says “Every formula should begin with a namespace prefix specifying the syntax and semantics used within the formula.”

What about using the same of: namespace that OpenOffice.org 3.1 uses?  We saw a couple of pretty serious problems with that approach as well:

  • It would not be interoperable with some existing implementations, such as the widely  used current version of IBM Lotus Symphony.
  • It is based on a draft specification that has not been finalized or approved as a standard, and therefore could still change.

What about using the oooc: namespace that OpenOffice.org 3.1 writes when you choose its ODF 1.1 compatibilty mode? That syntax is on its way out for everyone, and we saw no point creating yet another new implementation of something that is clearly going to be deprecated soon.  And it doesn’t solve the problem: OpenOffice.org 3.1 writes the oooc: namespace prefix in its ODF 1.0/1.1 compatibility mode, and those spreadsheets still can yield different results in OpenOffice.org and Symphony.

After a robust internal debate on the topic, it became clear what we needed to do to apply the first two of our five prioritized guiding principles for Office’s ODF implementation:

  • Adhere to the ODF 1.1 standard
  • Be Predictable
  • Preserve User Intent
  • Preserve Editability
  • Preserve Visual Fidelity

As we discussed in several DII workshops starting back in July of 2008 (with multiple ODF implementers and multiple ODF TC members in attendance), these guiding principles are in priority order. When we could not achieve them all, we choose the top ones first.

To adhere to the ODF 1.1 standard, we begin formulas with “a namespace prefix specifying the syntax and semantics used within the formula.”  Excel 2007 SP2 uses an msoxl prefix and write the formula attribute like this:

table:formula="msoxl:=A1+A2"

That fulfills our goal of adhering to the standard since ISO/IEC 29500 defines both the syntax and semantics of this namespace.  Then, to provide a predictable user experience across all spreadsheets, we elected to support this namespace, and only this namespace.

If I move my spreadsheet from one application to another, and then discover I can’t recalculate it any longer, that is certainly disappointing.  But the behavior is predictable: nothing recalculates, and no erroneous results are created.

But what if I move my spreadsheet and everything looks fine at first, and I can recalculate my totals, but only much later do I discover that the results are completely different than the results I got in the first application?

That will most definitely not be a predictable experience.  And in actual fact, the unpredictable consequences of that sort of variation in spreadsheet behavior can be very consequential for some users.  Our customers expect and require accurate, predictable results, and so do we. That’s why we put so much time, money and effort into working through these difficult issues.

What Does Excel 2007 SP2 Do With the Example Above?

The answer is that we agree with IBM: 1 + 2 = 3.

Excel does the same thing Symphony 1.2 does, converting the text “2” to a numeric 2 and using that value in the calculation, so that the total is 3.  Excel does this because this type of automatic conversion – which has been a popular Excel feature for a very long time – is allowed by the semantics of the formula markup language Excel uses.

The formula markup that Excel uses is based on the formula language defined in ECMA-376 and ISO/IEC 29500, and here’s what it says about type conversion in Section 18.17.2.6 (Types and Values) of Part 1 of IS29500:

An implementation is permitted to provide an implicit conversion from string-constant to number. However, the rules by which such conversions take place are implementation-defined. [Example: An implementation might choose to accept "123"+10 by converting the string "123" to the number 123. Such conversions might be locale-specific in that a string-constant such as "10,56" might be converted to 10.56 in some locales, but not in others, depending on the radix point character. end example]

Excel’s approach to formulas in ODF, as well as our approach to other difficult issues, is completely public and fully documented in the implementer notes for SP2.  As the note for this issue explains:

The standard defines the attribute table:formula, contained within the element <able:table-cell>, contained within the parent element <office:spreadsheet table:table-row>

This attribute is supported in core Excel 2007. This attribute is supported in core Excel 1. When saving the Table:Formula attribute, Excel precedes its formula syntax with the "msoxl" namespace. 2. When loading the attribute Table:formula, Excel first looks at the namespace. If the namespace is "msoxl", Excel will load the value of Table:formula as a formula in Excel. 3. When loading the Table:formula attribute, if the namespace is missing or unknown, the Table:formula attribute is not loaded, and the value "Office:value" is used instead. If the result of the formula is an error, the element <text:p> will be loaded and mapped to an Error data type in Excel. Error types not supported by Excel are mapped to #VALUE!

The Question of Syntax

I’d like to also address the issue of cell reference syntax in the ODF 1.1 specification, since that was also a topic of much discussion on several blogs last week.  I’ll start with some quick background for those who don’t wallow in standards documents for a living.

The English language is inherently an ambiguous thing,  and great literature sometimes uses the ambiguity to good effect.  Words can have more than one meaning, and verb phrases might be intended to go with one noun or with another, as in famously ambiguous job references like “You will be very fortunate to get this person to work for you."

Writers of technical standards like to use rules and procedures that are designed to avoid this sort of problem.  These rules, which place requirements on the  use of words like should, shall, must and may,  tend to result in a stilted writing style which gets tedious fast, but reduces the need to agree on what is “obvious” or “implied” when interpreting the meaning of the text later.

A standards document is said to contain both normative language and informative language.   The things you must do to comply with a standard are supposed to be in the normative part,  and things like examples and introductions are informative.   So that everyone can be  sure about which parts are which,  the normative parts use specific phrases like “shall” and “shall not” to clearly label the things the standard actually requires you to do.

So the debate about Excel 2007 SP2’s cell reference syntax comes down to whether the few sentences in the ODF 1.1 spec which cover this were meant to be informative or normative.  The section of ODF 1.1  in question does not use the words shall or must.   It introduces the topic with the phrases “typically” and “can include”.   In our reading of it,  this language makes that part of the specification informative, stating no requirements for implementers.

The ODF 1.1 spec is casual about applying the rules of normative language, and as a result ODF 1.1 has more than its share of ambiguity.  The ODF 1.2 draft, however, is already much improved in this regard, mainly through the great work of ODF editor Patrick Durusau.  The OpenForumla draft specification is extremely careful in its use of normative language, and that will help implementers a great deal when they sit down to write their software.

When Will Office Support OpenFormula?

This question has come up on some blogs, so I’d like to address it here as well.

The real question is “when will Office support ODF 1.2,” since OpenFormula is simply a part of the ODF 1.2 specification.  And the answer is that we don’t know yet, because nobody knows yet when ODF 1.2 will be published as an OASIS or ISO standard.  As I said in the previous post, “we will look closely at Open Formula when it becomes a standard and make a decision then about how to best proceed.”  (It looks like IBM has committed to supporting ODF 1.2 and OpenFormula in late 2010.)

In the meantime, if you want to use Excel 2007 SP2 to edit documents that contain formulas from OpenOffice.org or Symphony, and preserve those formulas through editing sessions, and you understand the risk that the results might not be the same, you have a couple of free options.

The Open XML / ODF Translator Add-Ins for Office can be used with Office 2007 SP2, and as covered on the translator team blog, supports a variety of formula namespaces.

The Sun ODF Plugin provides yet another option, and apparently works with SP2.