Table models in file formats


In my post last week about the lack of table support in ODF, some folks were curious as to why the Ecma Open XML formats have three different table models. I explained that when you are designing a file format, you need to examine closely the target user scenarios of the applications that will use those formats.


Obviously the use cases for a table in a spreadsheet are different from those around tables in presentations or wordprocessing documents. For instance, it’s not much of a stretch to imagine a table of data in a spreadsheet with 50,000 rows and 200 columns. That would never happen in a presentation though. A table in a presentation is much more heavily focused on the layout and formatting of the data (same case with a wordprocessing document).


I think a great example of why you often need different table models would be the ODF spec itself. It was too difficult to map their existing table model to the presentation format so instead of working through that issue they instead left it out of the spec. Otherwise the spec would have just stated that the table model applied to all three formats. As it stands right now, the only way to get a table in a presentation is to embed a spreadsheet. The plan is that in V 1.2 (which is still over a year out) they will have support for spreadsheet formulas and presentation tables. One could argue over whether it would have been better to actually finish the spec before submitting it to ISO and creating organizations like the ODF Alliance who purpose is to push for policies that mandate ODF, but maybe that’s just me ;-).


Looking at the table models, I do think the ODF guys made a big mistake in the design of their spreadsheet format. They chose to make the table model for wordprocessing documents and spreadsheet documents the exact same (but it looks like it’s still different from HTML or CALS). Now I do understand that this level of commonality is the nirvana for most folks and I also had the goal of making the Open XML formats as close to this ideal as possible (it’s something we actually looked really hard at doing when we first started working on the Open XML formats). The problem is that for the same reason you often need different user interfaces in the different applications, you also need a different file format at times. Sure there are plenty of concepts that can be reused (such as basic formatting), but a spreadsheet grid is different from a presentation table. Otherwise you’re stuck with a format that sells everyone a bit short as it’s the greatest common factor of all the applications, and isn’t optimized for the unique customer scenarios.


If you’ve looked at the Ecma spec, you can see that we had to diverge in the table design of for the Office Open XML formats. The use of tables in wordprocessing documents and presentations is very similar, and as a result the table models in those formats are very similar. In a spreadsheet though, you have to account for much larger sets of data, and at that point the efficiency with which you write out that information can have a much more significant impact in the amount of time it takes to actually parse the files.


So, is the spreadsheetML format super easy? Well that depends on who you ask. For people that have developed against the old binary formats, things will be unbelievably easier and more reliable. But for folks who’ve primarily used table models like HTML, there will be a bit of a learning curve. That’s why the file format documentation that we’re doing in Ecma is so important. It will empower anyone to program against these files. We could have gone with a more verbose simple table model, but that would have been at the detriment of every user out there. Most people don’t care about their file format, they just want things to work. As I said in an earlier post, we had to take the training wheels off, but we’re going to be there with you as you learn to ride on your own.


-Brian

Comments (24)

  1. Mike says:

    "Looking at the table models, I do think the ODF guys made a big mistake in the design of their spreadsheet format but that’s because they had different design goals. They chose to make the table model for wordprocessing documents and spreadsheet documents the exact same"

    You speak as if there were any table model in BIFF. There isn’t. You’ve added a "list"model in Office 2003 which is now a full fledge "Table" in 2007, but it’s disingenuous to say what you said the way you said it, since billions of BIFF users have obviously managed their way through without a table model so far.

  2. Biff says:

    Mike, of course there is a table model in BIFF, it simply omits all the junk that XML formats, ODF in particular, insist on dragging around in every single file. There is no reason to describe the structure of the spreadsheet’s grid – it is fixed (or more accurately was fixed up until Office 2007).

  3. Gabe says:

    As far as I can tell, the table model in BIFF is the worksheet. It sounds like the ODF guys just said "a spreadsheet is a table, so every time somebody needs a table just make it a spreadsheet" or maybe "everytime somebody needs a spreadsheet just make it a table".

    Brian makes a good point that ODF obviously does not have a very good format for representing spreadsheets. It does a perfectly good job of representing tables that can be used as spreadsheets, but large spreadsheets used by financial analysts and the like are just not well-supported.

    If the format is not optimal and requires long (i.e. minutes) save and load times, people won’t use it unless they are forced to. If people won’t use the format, why create it? MS is trying to create formats that people will want to use.

  4. Alex says:

    I love how these mini-ODF dramas are turned into crises.

    Your argument is "ZOMG they embed spreadsheets in presentations instead of creating new markup"? I really don’t get this.

    What difference does the user see? None. Does the user care whether or not the file format has a special language for these special presentation tables? I doubt it.

    This is just nuts. You say that users will care about the format because "Otherwise you’re stuck with a format that sells everyone a bit short as it’s the greatest common factor of all the applications, and isn’t optimized for the unique customer scenarios". However slow ODF spreadsheets could be, I very much doubt they require "optimization" for inclusion into Presentations. Your logical leap from "spreadsheets aren’t optimized enough for use as tables" to "ODF’s spreadsheet format is based on tables, and therefore not optimized enough for use as a spreadsheet" is, frankly, hilarious.

    If you want to have another shot at ODF’s spreadsheet format; sure, go ahead. Dressing up an area where ODF is improving as some missing feature crisis is spectacularly funny, but hardly realistic.

  5. Rick says:

    I would argue that a table and a worksheet are not the same beast, so having to deal with two models is of little concern.  Having two types of tables bothers me, but I will let this go to those who know better why it had to be.

    I am bothered by what I read, however.  I had thought that we were moving toward application units (maybe read services?) that did what they did very well and didn’t try to do all things.  By putting these units together, you could build what you needed, not what someone decided to allow.  If I need a spreadsheet in the middle of a document, then I get it.  I am not hampered by a word processing table model which isn’t complete enough.  I get a full spreadsheet.

    If this apporach is taken, there is no lowest common denominator for someone with all of the units installed.  But what if they do not have the units they need, or the units come from different vendors.  This is the case for completeness and fidelity in the document format.  The format can have everything in it that the application unit can describe.  If another unit can express more, then it does, but still within the published format.  This is the beauty of XML – it’s extensible.

    If my unit doesn’t know how to do everything that has been captured in the document format, it ignores what it can’t do and does the best with what it can.  If it has to compromise what is there, it warns the user and gives them the choice of continuing.  And if this happens too often, they will use a better application unit for that function.

    Why do we still view things so monolithically?  I like the idea of competition for the service provided.  Go for it.  But I want the underlying document format to allow me to change vendors, upgrade to better/higher function, pick what’s important to me.

  6. ck says:

    I have one off topic question I would like to know your answer on. Our company is starting to get .docx documents more and more from outside, and we are wondering if MS is recommending the deployment of the beta compatibility pack for office2003. I don’t see an easy way to roll it out in any case. I think MS should have waited on the .docx default till after the beta, when the compatibility packs were also out of beta and ready for originations to roll out.

  7. Biff says:

    Alex, Rick, thank you for inventing OLE 🙂 As you might have noticed (if you were not too busy inventing) embedding does not work very well. Go on and insert a table in Impress and check if you got full-fidelity editing experience from this "application unit" model. Heck, go on and insert equation in Word and pray tell you enjoyed the editing experience there as well.

    Sorry to rain on your parade, but the industry learned that “use application units everywhere” and “screw optimization for the sake of generality” do not work – the users just can not care less when their convenience and productivity are at stake.

  8. Oscar says:

    Brian,

    I woud argue that if anyone is using Excel to analyse spreadsheets with 50,000 rows and 200 columns they need to be sent for re training. There is plenty of documentation about spreadsheet errors that have resulted in mis-statement of financial results. There are tools more suited to the task than a spreadsheet.

  9. Todd Knarr says:

    For the argument that different models for word-processing tables and spreadsheets are a good thing, I’d point to Microsoft’s own creations: OLE and COM. Apparently they’ve found sufficient need for spreadsheets in word-processing documents to create ways of putting full-fledged spreadsheets inside word-processing documents. IMHO it’s much easier to do that when the word processing document has native support for spreadsheets in it, and if spreadsheets are simply an extension of tables it becomes a lot easier since all the support work’s there already. It also makes the format degrade gracefully: opening such a document with an embedded spreadsheet in a program that doesn’t support full-blown spreadsheets simply causes the spreadsheet to be displayed as an ordinary table with it’s last computed values in the cells. This completely avoids the problem with embedding of how to display embedded objects when the program components to handle them aren’t installed.

    As for ODF spreadsheet load performance, after playing with the samples over the weekend I’ve concluded it’s not a problem with the ODF format, or the parts of the program that work with the internal structure. OO.o Calc handles even large spreadsheets quickly once they’re loaded. It’s only loading (and to a lesser extent saving) that bogs down. It looks to me like OO.o’s using a DOM-based XML parsing strategy: read the entire file in and parse into an XML tree, then walk the XML tree to create the internal data structures. Writing involves the reverse operation. The code for this is simple, robust and able to handle non-understood XML in the document (a desirable trait). It does, however, break down on large documents, and requires extra memory to hold the original XML tree in addition to the internal representation. The samples used to demonstrate the performance difference, for example, are 16 worksheets, each worksheet being 16,200+ rows long. When I strip the example down to a single worksheet, or 16 worksheets each about 1000 rows, load speed improves to nearly that of Excel. I can think of a SAX-based (callback-based) parsing strategy that’d build the internal structure as the XML is read in and parsed, eliminating the memory used by the XML tree and the time to build and walk that tree. I suspect Excel uses that kind of strategy, taking advantage of the fact that non-understood XML would be either discardable or an error. I’d need to put a lot of thought into how to handle non-understood XML to preserve not only the content but it’s placement in the document, though, and the code involved would be non-trivial.

    In short, the ODF load/save performance issues appear more related to the method the program uses to read and write XML than to the ODF format itself. It doesn’t need any particular changes to the ODF XML structure, merely a change in the way the XML is read/created to be better-behaved when dealing with extremely large documents.

  10. Patrick Schmid says:

    Todd: What happens to performance if you increase spreadsheet size drastically? For example, take it to the 65536×256 grid size that was the upper limit for pre-2007 Excel versions? And then, take it further till you reach the 1048576×16384 grid limit of Excel 2007.

    I would be highly interested in seeing a graph of performance vs. spreadsheet size for that.

  11. BrianJones says:

    Also note that I’m specifically talking about load and save time, not the performance of the application once the file is loaded.

    OpenOffice is obviously slow in those cases as well, but in most cases that isn’t tied to the file format itself. If you look at the design of the two formats, you’ll see that SpreadsheetML can be loaded and saved much faster than ODF.

    Also, while I’ve been refering to the "table model" of spreadsheets, I really do mean the entire spreadsheet. If you look at an ODF spreadsheet, the entire sheet is wrapped in a <table> tag.

    -Brian

  12. Molly C says:

    First, embedded OLE objects have their place, but an embedded spreadsheet isn’t the same as native table functionality.  Embedded spreadsheets are useful if you have a need to activate the spreadsheet object in order to work with the spreadsheet functionality of the spreadsheet app.  But embedded spreadsheets are not necessarily the best way to present a table of data.

    Second, advocating that word processors include native spreadsheet functionality (as someone seemed to above) is whacky.  Do you really want word processors to have to support pivot tables and the like?

    Lastly, tables are not the same as spreadsheets.  Spreadsheets obviously have more calculation features than tables.  But more than that, tables (at least in Microsoft Office) don’t necessarily have the same shape as spreadsheets.  In Word, you can use the free form table drawing tool to create a table that looks nothing like a spreadsheet grid (with diagonal grid lines, gridlines that don’t go all the way across or down the table, nested tables, etc).

    Forcing the same format for tables and spreadsheets is a misguided attempt at elegance.

  13. Why are you talking? says:

    After seeing OpenOffice 2.0.3’s implementation of the ODF format versus Microsoft Office (beta 2)’s plugin’s implementation of the ODF format here: http://www.robweir.com/blog/, I’m left wondering why anyone associated with Microsoft can carry any authority about how to implement the ODF format–particularly with regard to tables, which in the examples I’ve seen, are pretty thoroughly munged in Microsoft Office (beta 2).

  14. Alex says:

    OLE is irrelevent in this discussion.

    Biff would like us all to think that because the table data is specified in the same way, that must mean that ODF-compliant apps have to embed the spreadsheet into the presentation app in order to make it work.

    And, while discussion of the Ecma spec. is all well and good, remember that ODF is the only spec. here that’s actually been, y’know, standardised. Given ODF is already out, and being implemented, it really would be a sorry tale for the Ecma guys if they couldn’t improve on it. Well, a sorrier tale than people not contributing to the multi-vendor TC at OASIS, anyway.

  15. OK, forgive the random Sneaker Pimps reference and I promise we will move off this topic of ODF politics…

  16. BrianJones says:

    Alex, table data isn’t defined *at all* in terms of a presentation right now. The only way to get a table in a presentation for ODF is to embed a spreadsheet. So Biff’s statments are very accurate. The ODF newsgroups say that they are going to work to get tables in for version 1.2 which is over a  year away.

    Your second point really hits on my big concern at this point. While ODF has gone through standardization, it was done so prematurely. It’s missing a number of significant pieces of functionality, and it’s clear if you look at the discussions that they pushed it through for political reasons.

    -Brian

  17. Brutus says:

    Yes, ODF standardization was prematurely pushed thru for political reasons, and it’s sad that ISO allowed itself to be used like that.

  18. yreih says:

    I love how these mini-ODF dramas are turned into crises.

    <a href="http://www.b086.com/blog2/1251/&quot; title="竹地板">竹地板</a>

  19. yreih says:

    I love how these mini-ODF dramas are turned into crises.

    <a href="http://www.b086.com/blog2/1251/&quot; title="竹地板">竹地板</a>

  20. Mike says:

    I find it intriguing Mr Jones that while you said a while back you have had "only covered 1% of what you had to cover" you still fancy yourself with more spin, obviously spending a lot of time reading, analyzing, commenting.

    I just don’t get what the point of this blog is. Can you clarify what is the actual audience? Are you writing all of this just in hope that some government people read it?

    Also, I’d like to reflect on one of your broken statements. You’ve said "locked binary file formats". I don’t know how this can be true. Let’s take Excel. If the binary file format is locked down, how did the Excel team managed to add features to it over time? BIFF is extensible, the OLE container mechanism is extensible. So I don’t know what statements like binary file formats being "locked down", obviously wrong, buy you.

    Also, would you mind telling what the .XLSB binary workbook introduced in Excel 2007 buys you in your XML world?

    I am interested on your thoughts between a true XML programming model, and just binary surrounded by angle brackets (which is what the XML parts of the new file formats are, obvious to any serious reader). Don’t you think you are missing a big opportunity here?

    While you talk a lot about accessing the internal of the file formats using some library, I don’t think it’s the true goal. It’s too hard, and frankly insane to manipulate all the internal indexes and references. Isn’t this all just a big smoking wall to push something else, like servers instead?