Friday thoughts

I just had a couple random things to mention this afternoon:

  1. ARCast – Office 2007 Open XML Format (Part 1 of 2)Doug Mahugh and I did a live webcast last month with Ron Jacobs ( Ron just recently posted an audio recording of the first half of the discussion. Not sure when he’ll post the second half…
  2. Convert SpreadsheetML into generic XML – There’s a new article up on that shows how you can convert spreadsheetML into generic XML using an XSLT, and then bind that xml data source to an ASP.NET data grid (which you can then display in the browser).
  3. Performance of XML file formats – I saw these two posts from IBM’s Rob Wier discussing different issues around XML file format performance (Celerity of verbosity and Why is OOXML slow?). I love the fox news style of that second post’s title BTW (“Democrats want to destroy your family?”)… just kidding Rob <g/>. I have to admit that the first post and the second seem a bit contradictory. The first post says that the size of the XML file doesn’t affect parse times; and then the second one says that wordprocessingML is slower to parse than ODF because of the larger file size. I admit I haven’t had a chance to drill deeper, so I’m sure there is more to it than that (Rob’s a performance architect, so I doubt he would miss that).
    I had a post a number of months ago around tag size being an issue in XML parsing times, and I still hold to that. I’ll actually try to pull together some numbers to back that up as it sounds like Rob disagrees. There are of course a number of other factors that play a much more important role in the structure of the file format besides tag size (I even mentioned in my original post that tag size itself was a small factor, but still significant enough that we made the decision to use terse tag names on any structure that is likely to repeat often throughout the file). The other issue is that in Rob’s experiments he is focusing on WordprocessingML rather than SpreadsheetML. Spreadsheets really are the bigger item in this discussion, as they can have hundreds of millions of XML tags in a single file. In a large wordprocessing document, it’s really the text content itself that makes up a lot of the file, and there aren’t nearly as many XML tags.
  4. History of richedit – Murray Sargent has a couple great posts discussing richedit and his history with the team (he’s been working on it since 1994). The first post discusses the different versions of richedit ( The second post goes into more of the history behind the project and talks about how 3rd parties can leverage it (
  5. Standard at Microsoft – Jason Matusow had a couple interesting posts this week. The first posts gives more information on the OSP which is a new approach we recently took towards making various formats freely available to developers ( . The second discussed some of his thinking around interoperability based on his latest trip out to Brussels ( There are really some important points here around what interoperability really means, and what the most effective ways of building interoperable applications. Custom defined schema support for example is extremely important for allowing Office documents to easily interact with backend data.
  6. Calorie burning drink from coca-cola -Not to sound completely random, but I just saw this and it instantly brought back memories of that old Jim Carrey SNL bit about a hard core diet approach. “Ride the Snake”

Have a great weekend everyone.


Comments (6)

  1. hAl says:

    The objection I have against Rob Weir’s performance test is that he starts by converting .doc files in .ODF and OOXML via OOo.

    That means te files do no longer contain the same data. In general in such a file conversion the file loses functionality. It is a weird way to start testing as it immediatly screws up any following testresult. Jeff Reliffe did something simular in a test last month but then the other way around on his blog.

    It surprises me that you say Rob is a performance architect as then he should know better.

  2. says:

    Hi hAl, the first thing I’ll pull together some simple numbers on will just be on the affect of tag size on parsing a file. It’s really suprising to me that anyone would claim otherwise, so I’ll grab some data to back it up. As you’ve noticed, Rob’s "proof" that tag size doesn’t matter was with a rather basic wordprocessing document, rather than something that’s much more tag intensive like a large spreadsheet.

    I was just looking at Rob’ bio on his blog and it said that he’s a performance architect for IBM. So either I’m missing something in his posts (which could definitely be the case), or he should know better…


  3. The gist of those two blog posts about XML performance is that

    1. For the common case, tag name length does not matter.

    2. For the common case, the number of tags as well as the number of XML files that have to be parsed matter. The structure of the XML might be a factor too, but that is harder to analyze.

    – Stephan

  4. hAl says:


    That gist is completly lost if you start with converting .doc files to ODF and OOXML. The conversion to ODF can loose more information and make the document severly simpler in the total amount of tags but also in the amount of different tags.

    It is very likely for instance that in that conversion some revision information or special formatting is lost. The conversion to OOXML should hardly loose any information as the format is created to be compatible with legacy features.

    I hope someone will do a more competent and most of all indepent review of office format performance as the figures provided by both IBM or Micrsoft are not likely to provide much proof for anything but a preference for their point of view on ODf or OOXML

  5. Stephane Rodriguez says:

    I once wrote a small utility to introspect arbitrary XML structures and measure what it’s made of. It’s available here :

    Here is an example XML document (, and the corresponding report ( produced by the tool.

    Let me know if that helps,

  6. @hAl:

    That of course might be true, even though the speed differences seem too huge for that (suggesting some more fundamental problem). But it is easy enough to test: Measure it again by converting the ODF back to OOXML. That OOXML document can have at most as much information as the ODF. If ODF still wins, we can eliminate differing breadth of information inside the documents as cause.

    – Stephan