Binary Documentation (.doc, .xls, .ppt) and Translator Project Site are now live


As promised last month, the binary documentation (.doc, .xls, .ppt) is now live. In addition to this, the project to create an open source translator (binary -> Open XML) has now been formed on sourceforge, and the development roadmap has been published. Read my earlier post for more background on this: http://blogs.msdn.com/brian_jones/archive/2008/01/16/mapping-documents-in-the-binary-format-doc-xls-ppt-to-the-open-xml-format.aspx


Here’s an overview of what’s now available:


Office Binary (doc, xls, ppt) Translator to Open XML


The “Office Binary (doc, xls, ppt) Translator to Open XML” project is now live on sourceforge: http://b2xtranslator.sourceforge.net/


As you may remember, this was a request from a number of national bodies, and while Ecma TC45 believed it was outside of the scope of DIS 29500, they did talk with Microsoft and come to this agreement:


Nonetheless, Ecma International discussed this subject with Microsoft Corporation, the author of the Binary Formats.  To make it even easier for third party conversion of Binary Format-to-DIS 29500, Microsoft agreed to:



  • Initiate a Binary Format-to-ISO/IEC JTC 1 DIS 29500 Translator Project on the open source software development web site SourceForge (http://sourceforge.net/ ) in collaboration with independent software vendors.  The Translator Project will create software tools, plus guidance, showing how a document written using the Binary Formats can be translated to DIS 29500.  The Translator will be available under the open source Berkeley Software Distribution (BSD) license, and anyone can use the mapping, submit bugs and feedback, or contribute to the Project.  The Translator Project will start on February 15, 2008. 

  • Make it even easier to get access to the  Binary Formats documentation by posting it and making it available for a direct download on the Microsoft web site no later than February 15, 2008.  The Binary Formats have been under a covenant not to sue and Microsoft will also make them available under its Open Specification Promise (see www.microsoft.com/interop/osp) by the time they are posted.

We will modify DIS 29500 to include an informative reference to the SourceForge project.


While the project is still in its infancy, you can see what the planned project roadmap is, as well as an early draft of a mapping table between the Word binary format (.doc) and the Open XML format (.docx).


Microsoft Office Binary (doc, xls, ppt) File Formats


The binary documentation itself is available up here: http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx



  • Word 97-2007 Binary File Format (.doc) Specification PDF | XPS

  • PowerPoint 97-2007 Binary File Format (.ppt) Specification PDF | XPS

  • Excel 97-2007 Binary File Format (.xls) Specification PDF | XPS

  • Office Drawing 97-2007 Binary Format Specification PDF | XPS

It’s all covered under the Open Specification Promise.


Another Surprise


Another great surprise in all of this is that we’ve made the documentation for a few other supporting technologies available as it may be of use to folks implementing the binary formats: http://www.microsoft.com/interop/docs/supportingtechnologies.mspx


The technologies included are:



  • Windows Compound Binary File Format Specification PDF | XPS

  • Windows Metafile Format (.wmf) Specification PDF | XPS

  • Ink Serialized Format (ISF) Specification PDF | XPS

These technologies are also all available under the Open Specification Promise.


Have a great weekend everyone!


-Brian

Comments (33)

  1. JasonG says:

    Wow! Who would of thought…

    But you just know the spin will be a spinnin’ soon.  

  2. Stephane Rodriguez says:

    I think Microsoft should be commended for making this information available.

    With that said, the specs is still incomplete.

    I first gave a cursory look at BIFF.

    1) Missing records : examples are 0x00EF and 0x01BA, just off the top of my head.

    2) No specification : example is the OBJ record for a Forms Combobox : containing a ftCblsData but this is undocumented :

    ftCblsData (12h)

    Offset Name Size Contents

    0 ft 2 =ftCblsData (12h)

    2 cb 2 Length of ftCblsData

    4 (Reserved) var Reserved

    See this "reserved" area above?

    Then I gave a cursory look at the Office Drawing specs (i.e. MSO). And again, just a cursory look at it showed unspecified records.

    Here is one example :

    msofbtClientData (F011)

    host-defined

    host-specific data

    Those "host-defined" elements are used by Excel to store very relevant bits. But this just remains undocumented.

    My 2 cents.

  3. Brian,

    The .doc specification document title is:

    MICROSOFT OFFICE WORD 97-2007 BINARY FILE FORMAT SPECIFICATION

    However, there is nothing there that described Word 2007 stuff. The documentation is barely up to Word 2003.

    At a glance, there is no FIB and no DOP records related Word 2007.

  4. André says:

    Brian,

    The "Office Binary (doc, xls, ppt) Translator to Open XML" project is now live on sourceforge

    Yes, but as of now 16/2/2008 there is not a single line of code nor any binary package for download:

    https://sourceforge.net/project/showfiles.php?group_id=216787&package_id=200089

    Not that I question the ability of Dialogika to provide it, these guys rock, but you cannot make an announcement when the only thing achieved is a "website". The promise was the release of software, no?

    I think the release of the binary spec is a step into the right direction. I would recommend Microsoft to set up a wiki so that biff hardcore experts as Stephane can pose information requests and your internal experts can retrieve and contribute the information. This way we get high class documentation.

  5. Yesterday I noted that the Office Binary <-> OpenXML translator project had started on SourceForge

  6. hAl says:

    @Andre,

    I think the 15th was the expected startdate of the project.

  7. It would be interesting to see how the project progresses. On the sourceforce page: "Milestone 2 June 30th, 2008: Final Word translator" I think way too optimistic. I can’t go by without predicting that it will be far from final by that date.

    Mappings between some DOC and DOCX structures are not straightforward. For example revisions and table formatting are written very differently and it will take time to get them right even with help from Microsoft.

    It is more likely to take a year than 5 months to get to the first final release.

  8. Ian Easson says:

    Betanews had a confused and incorrect story about this yesterday.  (They reported that Microsoft had just put OOXML under OSP — something that goes in fact back to 2006).  I commented, and pointed out their error.

    They then corrected it.

    Their "correction" amounted to:

    –  a statement that Microsoft has now "clarified" that OSP is the "one" license that applies to OOXML

    – a statement that Microsoft is also just today announcing that it will be releasing the binary formats under OSP.

    I then wrote another comment pointing out that these corrections were themselves incorrect.  

    My comment was on their site briefly, and then it "disappeared".  (Perhaps it was because I was getting so frustrated at the inaccuracies that I said they should just withdraw the story, and contemplate firing the reporter who apparently knew nothing about OOXML and related matters.)

    I have now written a comment giving the facts.  Let’s see if it stays up on their web site, or if they will remove it as well.  (This time, I left out the nasty comment about the ignorant reporter!)

  9. Tux says:

    Why should anybody use a M$ format? There is an open and free standard with a good documentation: ODF. The only reason for implementing the specs of DOC, XLS, PPT is to get better import filters in OpenOffice.org.

  10. orcad says:

    Standard Bodies should request the mapping between DOC and DOCX, otherwise delete those functions from the proposed standard.

    Now they are in good position to do so, since you have some published documentation about the DOC format.

    But the documentation about the DOCX format is still missing…

  11. As promised, Brian Jones has announced the posting of the Microsoft Office Binary Format specs in this

  12. Brian Jones has announced the posting of the Microsoft Office Binary Format specs on this blog . Along

  13. Doug Mahugh says:

    Binary documentation and translator project. On Friday Brian Jones covered the availability of the Office

  14. Ian Easson says:

    Orcad,

    I’m not sure what to make of your confused comments.

    You say:

    "Standard Bodies should request the mapping between DOC and DOCX"

    If you had read Brians’ blog, you would realize that they have already considered and rejected this.

    You say:

    "…otherwise delete those functions from the proposed standard".

    Do you mean delete the information about the mapping from the OOXML specification?  It’s not there in the OOXML standard to be deleted!  (and it is a standard — ECMA — by the way)

    You say:

    "But the documentation about the DOCX format is still missing…"

    From where?  It’s documented in the OOXML standard.

  15. Binary documentation and translator project. On Friday Brian Jones covered the availability of the Office

  16. Brian Jones, Senior Program Manager just broke the news in his post today. Quoting from him: "As

  17. W^L+ says:

    Brian, I commend you and Microsoft for releasing this information. Hopefully, groups like OpenOffice.org and KOffice can get favorable rulings from their legal eagles that will allow them to improve binary compatibility.

    While I still disagree about the necessity for OOXML, I have always believed in giving thanks where it is due.

  18. News says:

    Brian Jones has announced the posting of the Microsoft Office Binary Format specs on this blog . Along

  19. Ian Easson says:

    Here is a piece of OOXML news for you, Brian.

    As of 16 Feb, the arXiv online library of scientific preprints "now accepts submissions in DOCX/OOXML format (from Word 2007 and other OOXML compliant applications)."  It is located at http://xxx.lanl.gov/

    Since its inception, the libary has focussed on accepting submissions in TeX or LaTeX, with a few other formats allowed on an exception basis (.doc, PDF).

  20. Stephane Rodriguez says:

    Just a quick note, before those specifications were made available freely, you had to email Microsoft. In fact, the steps are described here :

    http://support.microsoft.com/kb/840817

    "Microsoft Office Binary File Formats

    Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not to sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program."

    See, there is one binary file format mentioned that is not being made freely available : VBA.

    What’s the logic? VBA is certainly part of the binary formats (in fact, not just Office 97-2003, but in fact Office 2007 as well : see .bin parts).

    Also, a welcome specification is the new encryption specification (Encryption stream, etc.). This is not documented anywhere either, to the best of my knowledge.

    Then, we have all application-level bits. But I’ll leave that for later. The boundaries between document-level interoperability and application-level interoperability is unclear due to the amount of application-level bits that are stored in files (for instance, all those bits that are used to pre-check options).

    Just a quick comment on the specifications.

    1) It would be handy to have a TOC-browsable version of it.

    As in MSDN Library, screenshot : http://www.arstdesign.com/BBS/picsupload/Office97doc.gif

    If you guys have a .doc/.docx version of the files, some people could convert them to .html, then .chm (compressed html help).

    As for the specifications, it is obvious that it remains substantial unspecified or missing records, but I believe these are the true Microsoft internal specs. The reason why I think so is because when you have the source code, those specs are handy references. The problem in this scenario is, obviously, when you don’t have the source code…

    2) I believe those specs were really incrementally updated as each major Office version shipped. In fact, the portions of Office 97 formats did not change at all (including typos and errors), and sections were added to account for release 2000/XP/2003 and 2007 (compatibility mode).

  21. JasonG says:

    In case anyone missed it, there’s some interesting comments from Joel Spolsky:

    http://www.joelonsoftware.com/items/2008/02/19.html

  22. Brian Jones Open XML Formats Binary Documentation (.doc, .xls, .ppt) and Translator Project Site 혹시 소식을

  23. Brian Jones Open XML Formats Binary Documentation (.doc, .xls, .ppt) and Translator Project Site 혹시 소식을

  24. For folks excited by the availability of the binary file format specs last week , but concerned about

  25. For folks excited by the availability of the binary file format specs last week , but concerned about

  26. an obersver says:

    This information is just promises that come too late, just for the purpose to get the bad OOXML format accepted as an ISO standard. And the translation project on sourceforge, also only consists in promises that may well never be fully or conveniently implemented. The deadline, that is really short, is after ISO vote ! In French we would say "ceci ne sert qu’à noyer le poisson". And we do not want to be these fishes.

  27. Olivier says:

    It seems that the OOXML format was designed with backward compatibility with the binary formats in mind. Now I wonder what that means. I take it as: easy and faithful conversion to the new formats. Pro: upgrading existing implementations of the binary formats to support OOXML is easier. Con: if OOXML makes it as a standard, then all future implementations of OOXML will have to support parts of the old binary formats!

    Such a choice seems therefore profitable on the short term for the existing implementations of the binary formats. Frankly, the only one I know of is Microsoft’s. Load any .doc or .xls file in OpenOffice and you’ll see what "messed-up layout" really means. Stephane above and other people on this blog comment on the shortcomings of the specs of the binary formats. They could be the cause for the difficulty to recreate the exact layout of MS Office documents.

    I am worried about the long term implications of this choice. Any government committing to the choice of a standard does so in the hopes of having its archives readable for centuries. If that means having to fill the gaps of the specs and trying to emulate a decade of quirks, this is clearly not going to benefit anyone on the long term.

    If Microsoft is truly concerned about the long term interest of its customers, it should design its standard as clearly as possible, in a way that guarantees the possibility of writing a *fully compliant* implementation from scratch in 200 years. And I think that means either dropping the "backward compatibility" with binary formats or making the latter (and any associated quirks) become an international standard too. The reasonable choice would obviously be a clean break with the binary formats.

    Why won’t Microsoft do it? The answer is simple. The long term interest of its customers is to be able to easily develop fully compliant implementations from scratch in the future. In the short term, that would allow competing products to become fully compatible with MS Office. Which is precisely what Microsoft has been fighting against for years.

  28. Stephane Rodriguez says:

    @Olivier

    Joel On Spolsky said it very well the other day on his blog,

    "(…) you have to redo all that work that some intern did at Microsoft 15 years ago. The bottom line is that there are thousands of developer years of work that went into the current versions of Word and Excel, and if you really want to clone those applications completely."

    That is very true. And it’s also true for the special angle-bracket edition of the file formats. I have spent month round-tripping BIFF based charts. And I have spent month round-tripping angle-bracket based charts (with all the knowledge of how it worked before, so that would actually be more for someone starting now).

    It’s hard to say how it’s going to evolve in the future (Office 2009 beta 1 will be a hint of that). The platform is at risk, that’s for sure. I’m sure some people think the document-level stuff is going to change a lot in the future. I don’t think so, it’s Microsoft cash machine. I see a lot coming at the application level (VSTA .NET on steroids among others). With that said, easy collaboration on the cloud is becoming key, and you don’t need an Office license for that anymore. (Expect a Microsoft acquisition in this area).

  29. Microsoft has now released documentation for the Office binary formats (.doc, .xls, .ppt) in addition to kicking off the project for an open source binary to Open XML converter (.doc to .docx)   The threw in WMF for good measure. The…

  30. Carol says:

    This is a great blog about Microsoft formats.  Thanks for the information.  I have a few questions….

    The Office binary formats for Excel 97-2003 covers mostly only the file format for excel workbook.  For workspace files, it says "Excel creates several other files, some of which are documented in this material. The workspace file (.XLW extension in Microsoft Windows) and the toolbar file (.XLB extension in Microsoft Windows) are not covered in this document. The files are used to configure Excel‘s UI and do not contain user data."  This information is too coarse for people who have to archive and process files generated by Excel.  Is there any documentation about Excel Workspace (.xlw) and Excel Template (.xlt) file format?

    In addition, what is the difference between the XML spreadsheet format (.xml) generated by Excel 2003 verse the current OOXML format for spreadhsheeht (spreadsheetML)?  Are they different XML schema?  How compatible are they?  

  31. Derrick Bowen says:

    Now if Microsoft would just release the visio format documentation all could be happy in the world.

    well, maybe not quite that far, but it’d be a step.

  32. Stephane Rodriguez says:

    "The workspace file (.XLW extension in Microsoft Windows) and the toolbar file (.XLB extension in Microsoft Windows) are not covered in this document."

    Microsoft probably considers those file formats as part of what is at the application-level, as opposed to document-level. This can be argued obviously.

    But it should indeed be reminded that anything applicationl-level remains undocumented. For instance, VBA remains undocumented.

  33. I’m catching up with a bunch of Open XML blogging from ages ago, so apologies if some of these are old