Open XML links for 10-11-2007

Andy Updegrove’s “Meanwhile, Back in Minnesota: Your Chance to Help” provides information about how to provide feedback on document formats legislation to the state of Minnesota. The deadline for feedback is next Monday, so if you’d like to participate in the process now’s the time to do it. You might want to share your views on whether choice is a good thing or not, whether governments should mandate specific formats rather than general guidelines, and related topics. The Massachusetts ETRM is a good example of one state’s approach in this area. The state of Texas also did a study entitled “Estimated Two-year Net Impact to General Revenue Related Funds” that sheds some light on the costs involved in mandating document formats.

Wouter van Vugt’s “Extracting data from a xml-mapped document” includes a handy XSLT for extracting the custom XML data from WordprocessingML. I’ve covered before how custom markup works in WordprocessingML, and although it has some unique benefits, one downside relative to custom XML parts is that the business data is interspersed with the Open XML markup. Wouter’s sample transformation helps to simplify that messy detail.

By the way, if you’re interested in creating documents with custom schemas attached using Microsoft Word (no programming involved), MSDN has a how-to article entitled “Create an XML document based on a custom Schema” that takes you through the steps involved.

Guy Creese has done some informal “IBM Lotus Symphony Performance Tests” to assess the performance of IBM’s recently announced open-source suite …

I put up a post last week about IBM’s new Lotus Notes Symphony office software suite, saying that based on an article in PC World, it seemed to be sloow in loading and a significant consumer of system resources. In short, the free software had some hidden costs. Shazaam, I got a ping from IBM Analyst Relations along the lines of, mmm, a few facts are not correct and how about a briefing on the product?

Fair enough, I thought. I’m still waiting for that to occur. But in the meantime, I figured I’d download the software and try it out myself, so I could ask some intelligent questions during the briefing. At a summary level, here’s what I found, when running the software on a Pentium 4 with 2 GB of memory:

  • On average, an IBM Lotus Notes Symphony app (Beta 1) takes three to four times as long to load as the comparable Microsoft Office 2003 product (with some significant outliers: e.g., 15 and 33 times as long).

  • An IBM Lotus Notes Symphony app (Beta 1) consumes more CPU at load time than the comparable Microsoft Office 2003 product.

  • An IBM Lotus Notes Symphony app (Beta 1) consumes three to five times more memory than the comparable Microsoft Office 2003 product.

It will be interesting to hear his thoughts after the analyst briefing.

And finally, speaking of performance, Zeth posted a comparison on Command Line Warriors this week about file sizes for a few document formats. He created a table showing the size of a document that only contains “Hello World” in a few different formats:

Format Application File Size (bytes)
.txt Emacs 21.4.1 11
.abw Abiword 2.4.6 2517
.odt OpenOffice Writer 2.20 6674
.doc Microsoft Word 2003 SP2 24064

Just to extend this research a bit, here are the results with the two editors I use most often:

Format Application File Size (bytes)
.docx Microsoft Word 2007 9870
.txt Notepad 6.0 11

Comments (12)

  1. Anna Niemous says:

    Doug, now please post file size results for documents that have some actual content, say, the text of U.S. Constitution or something :)))

  2. dmahugh says:

    Well, Anna, I was just following the rules of that comparison page, to get an apples-to-apples comparison.  But since you asked …

    I found a copy of the text of the constitution at

    So pasting that into a Word 2007 document and saving it, I get these file sizes …

    .TXT = 45,992 bytes

    .DOC = 85,504 bytes

    .DOCX = 46,618 bytes

  3. Anna Niemous says:

    Thanks, Doug! These numbers make much more practical sense than "Hello, World!" example :)))

  4. Dave S. says:

    The zipped verision of the text format U.S. Constitution is only 14,182 byes.

    This is 30% the size of the zipped .docx file.

    What does the other 70% of the .docx file do?

  5. dmahugh says:

    Good question, Dave.  There are several optional components that Word writes to the document, including document metadata, various application settings, section settings (page size, margins, etc.), revision IDs, and style/theme information.  The style/theme info is the biggest single portion, representing most of the size increase relative to text-only.

    All of this information was also written in the legacy binary formats, so by writing it out on save Word is assuring a consistent user experience for those who expect this information to be saved in their Office documents.

  6. Dave S. says:

    It still seems rather pudgy for a straight text file converted to a Word document. In the Constitution example, there should be only one section, maybe 2K of metadata**, one style/theme.

    How big is your zipped version of the .doc format for the document? I get 19,309 – less than half the zipped .docx format.

    Is every Word 2007 setting possible included in the 30K of overhead, rather than just the ones used?

    ** 14k was used to describe the operation of an entire country, 2k is pretty generous to put those words on paper.

  7. dmahugh says:

    If a text file includes everything you need, then I’d agree that a DOCX is overkill.  Text is great if file size is a top priority, and DOCX is great if compatibility with existing Office documents is a priority.

    As for the contents of the DOCX, it’s easy to rename it to a ZIP and open it, and everything’s XML-based and defined in the spec, so you can look up the individual elements if you’d like to understand what’s there in more detail.

    Most users experience a signficant reduction in disk storage requirements when moving from DOC to DOCX.

  8. Dave S. says:

    Your apples to apples comparison is misleading – suggesting the docx file format is as compact as the original textfile. However, when accessed, the docx file is much larger.

    I’m sure storage requirements go down with the new format since it uses zip compression – most pudgy files do. I wonder when the casual user will realize they can no longer save disk space by zipping the new formats. The storage might even get larger.

    I’m not against non-text document files, just against pudgy ones, particularly those where the pudgy part is not pertinant to the content of the document.

  9. dmahugh says:

    Well, the two rows I added to that table show that the DOCX (9870 bytes) is about 900 times larger than the text file (11 bytes).  It’s not clear to me how that is "suggesting that a docx file is as compact as the original text file."

  10. Dave S. says:

    I believe I wrote – apples to apples. There was a reason I did not say table.  

    "# dmahugh said on October 12, 2007 5:00 PM:

    Well, Anna, I was just following the rules of that comparison page, to get an apples-to-apples comparison.  But since you asked …

    I found a copy of the text of the constitution at

    So pasting that into a Word 2007 document and saving it, I get these file sizes …

    .TXT = 45,992 bytes

    .DOC = 85,504 bytes

    .DOCX = 46,618 bytes"

    45k txt -> 46k docx is misleading to anyone who has to ask, such as the anonymous Anna Niemous.

  11. Dave S. says:

    " DOCX is great if compatibility with existing Office documents is a priority"-Doug Mahugh

    Files/documents/formats are always compatible with each other – that is, nothing in one Files/documents/formats causes the contents of another Files/documents/formats to fail.

    The real question is whether the new format is compatible with the old applications and the old format compatible with the new apps.

    For Microsoft’s own applications a converter had to be written to patch the old applications. This indicates that the new format is not compatible with the old applications, only that a new application can bridge the gap.

    Any older application that was written to handle the old format cannot read the new format, therefore the new format is not really compatible.