Leveraging content in other formats


There is a really cool feature that we added into the WordprocessingML format that allows you to pass a file off to a consumer using alternative formats embedded within the WordprocessingML if you know that the consumer supports that alternate format. We had a lot of customers asking for this type of functionality, so we added the alternate content anchors into the WordprocessingML format to help with document assembly scenarios.


Scenario


I’m building a document generation tool that will allow users to fill out a form on a web site and automatically generate a rich Wordprocessing document based on the values they filled out in the form (think of a tool like a contract generator). I get some rich content back from the user that they filled out in a web form, so it’s formatted as XHTML. Rather than having to do a translation from XHTML into WordprocessingML, I can just include the XHTML in the file as well, as long as I know that the user is going to open the file in an application that knows how to consume XHTML. If I don’t know this, then of course I’ll need to transform it into WordprocessingML.


This was a scenario we saw a lot of folks hitting with the earlier version of WordprocessingML from Office 2003. People had content repositories with HTML, and they didn’t want to have to build an HTML to WordprocessingML translator just to get those chunks into the Word document.


Types of content allowed


This is completely up to the consuming application. The Ecma spec just defines where you put the alternate content, and how you identify it. There are no limitations on what kind of content you can place, and there are no rules on what type of content you must support. If you look at the definition in the spec, it says the type of content allowed is:


Any content, support for which is application-defined.


[Note: Some examples of formats which might be supported include:



  • Text = application/txt

  • RTF = application/rtf

  • HTML = application/html

  • XML = application/xml

end note]


So you can see that there are a few examples of the types of content you might want to support, but there are no limits or requirements.


Creating alternate content


In order to make it clear that there isn’t an additional burden of supporting the various types of content that may occur, we said in the spec that a conformant producer is not able to create the alternate content chunks. This way, you know when you write a consuming application that you aren’t required to support these alternate chunks. It’s only something you can optionally decide to support. A producer should only create alternate chunks if they have a knowledge of what the consumer understands. There is no guarantee that anyone else will support XHTML within the files for example.


Guidelines in the spec


The IBM folks are clearly spending some resources scouring the Open XML spec looking for ways in which they can try to block the ISO approval (we’ve already discussed the huge financial bet they’ve made in ODF being the only standard). It looks like Rob Wier of IBM found a rather poorly worded description of alternate chunks in Part 1 of the spec. I think we did a poor job of explaining that there is no requirement to consume alternate chunks. The spec was trying to call out that any content type can be used if you want to, but it’s not a conformant document because consumers don’t have to support that content. We also wanted to be clear that if a consumer does understand the alternate content, they should translate it into WordprocessingML to match the rest of the file, so that on save, it’s all the same format.


This was a good catch by Rob. I agree that it could be a bit clearer in what is required of both consumers and producers. I really wish IBM had spent more energy trying to improve the spec earlier on (they are members of Ecma and could have joined the Open XML TC). This is something we easily could have cleared up the wording on. As part of the ISO fast track process though, we have a chance to gather comments from the various national bodies and make any fixes required before finalizing. This is definitely something we can look into clearing up.


-Brian

Comments (14)

  1. Adam says:

    "This is something we easily could have cleared up the wording on."

    While reviewing/editing/approving at an fairly forced 18.3 pages/day[0] in order to make the Dec 2006 Ecma vote deadline?[1] Are you sure you had time?

    [0] http://www.robweir.com/blog/2006/12/notable-achievement.html

    [1] http://www.sutor.com/newsite/blog-open/?p=1281

  2. Adam says:

    "consumers don’t have to support [alternate chunk] content."

    Hmmm…..looks to me like the standard says it does. I think Rob’s interpretation of that paragraph ("A WordprocessingML consumer shall treat the contents of such legacy text files as if they were formatted using equivalent WordprocessingML […]") is spot on.

    Yes, the "support for which is application-defined" does somewhat imply that different applications may have support some formats better or worse than others, but the "shall" in the paragraph in question does read to me like the consumer has to do *something* "valid" with it.

    The only other part that supports your point is the "assuming that the content type of Demo.html is supported by the application" sentence in the example below. However, if the MOOXML spec is like most others, examples are informative, and if they conflict with normative text it is the example that is considered to be in error.

    What’s the process going to be for MOOXML defect reports? Will they be publicly listed/discussed somewhere? Is there a rough guess anywhere as to when the first TC might be issued?

  3. Sinleeh says:

    Thanks Brian for clearing it out. Weir’s post was a bit confusing when I first read it.

    As I understand, the "alternate content" part in WordProcessingML allows any arbitrary contents, including binary blobs. I understand the design reasons behind this but wonder is there anything that OOXML did that will stop vendors to sneaking proprietary "extensions" into these binary blob to achieve vendor lock in?

    I mean, if I were devilAdvocateVendor1, I can write a program which only my devilApplication will read and substitute all occurrance of "shall not" with "shall" and vice-versa in  the "10 commandment" and keep the "10 commandments" file in such a way that unsuspected person, using an alternative application, we read exactly the opposite of what the "10 commandments" is about.

  4. jones206@hotmail.com says:

    Sinleeh, you are correct that people could put their own proprietary binary information into the file, but there is nothing in the spec that says others need to understand it. In addition to that, notice that a truly conforming producer isn’t allowed to create these things. Only folks who don’t want to create a conforming interoperable document and instead want to create a document and they know more about the application that will be consuming it that would use them.

    ODF and Open XML are both fully extensible specifications, which means that they can be improved over time, but it also means 3rd parties can add their own proprietary markup. It’s kind of hard to put a restriction on this, and it’s actually undesirable. An application may decide that they want to add some stuff to OpenXML, but it’s not worth submitting to the Ecma TC because it’s too specialized.

    Open Office today has a ton of proprietary extensions that they’ve added to ODF. The way they store spreadsheet formulas, view settings, print settings, and even some layout settings (as I discussed last week) are proprietary extensions. It’s not always a bad thing, as you may have things that you don’t think need to be included in the standard (especially if it doesn’t affect the interoperability of the document).

    -Brian

  5. Doug Mahugh says:

    Here are a few links to recent news of interest to Open XML developers … Package Explorer Update. The

  6. Adam says:

    "there is nothing in the spec that says others need to understand it."

    Apart from the words "A WordprocessingML consumer shall treat the contents of such legacy text files as if they were formatted using equivalent WordprocessingML […]" you mean.

    "a truly conforming producer isn’t allowed to create these things."

    No, but that doesn’t mean that a WordproceccingML consumer won’t be presented with them. In which case, the standard requires that it "*shall* treat the contents of such legacy text files as if they were formatted using equivalent WordprocessingML […]"

    "I think we did a poor job of explaining that there is no requirement to consume alternate chunks."

    I think that’s an understatement – I think you *did* add a requirement (possibly unintentionally) to consume alternate chunks. This is a standard now. It’s been accepted by Ecma. I realise that MS’s position on standards is to treat them as guidelines instead of rules, but not everyone else does it that way. While MS’s implementors may be fine with a verbal "oh, it’s not really meant that way, don’t worry about it", that’s not good enough for other people.

    You (or Ecma) need to *fix* the spec. Or at least take the first step and issue a Defect Report.

    Given that MOOXML was ratified at 20x the speed of other specs, I’d expect a roughly equivalent rise in the defect density compared with other specs. This is not meant to disparage the people doing the work; I’m sure if they were given an equivalent amount of time to do the work that they’d have got on other specs, they’d have caught more problems. Given that they were skimming it though…

    Of course, given also that MOOXML is more than 6x as long as most other specs, I’d again expect at least an equivalent rise in the total number of defects in this spec than most others. (This may be conservative; a lot of defects in other specs are internal inconsistencies between different sections. As the number of pages goes up, the possible combinations of sections will rise more along the lines of the length squared)

    Taken together, my guess is that this spec, due to its nature alone (length and speed of writing) will be more than 100x as buggy as most other specs. Don’t you think it be wise to let this one mature for at least a couple more years (e.g. at least until the first Technical Corrigendum, and possibly until a beta of the next version of Word after that has implemented the TC) before it goes further down the standards path? (e.g. to ISO).

  7. hAl says:

    It seems IBM has also done another donation to Groklaw or something as it is now also trying to aid in the IBM effort to try to move the discussion for ISO certification.

  8. jones206@hotmail.com says:

    hAl,

    Thanks for pointing that out. IBM has taken a very odd way to approach here. It’s basically saying “hey we don’t like this and want to block it, help us find ways to make that happen.”

    That’s a very competitive antagonistic approach.

    Adam,

    Check out Part 4 of the spec which is a much more detailed reference. You’ll see in section 2.17.3.1 that there is a very detailed description of altChunks. In there, it clearly states that:

    “If an application cannot process external content of the content type specified by the targeted part, then it should ignore the specified alternate content but continue to process the file. If possible, it should also provide some indication that unknown content was not imported.”

    -Brian

  9. Brian Thomas says:

    Shame on IBM for being so antagonistically competitive.  What did you ever do to them, anyway?

    There is a characteristic Microsoft way of looking at things that is strikingly similar to that of the abusive spouse or substance abuser – one that takes reality as most people see it and completely inverts it.

    Just as when Massachusetts insists on a file format that for the first time in decades offers hope that other office software vendors than Microsoft can get the state’s business, Alan Yates cries foul, saying that Microsoft is being "shut out", now when Rob Weir and Bob Sutor point out egregious violations of both the letter and the spirit of the ISO/IEC standards rules, you turn on the innuendo machine, crying that big bad IBM is trying to stop you.

    You latched on uncritically to hAl’s speculation that IBM were somehow behind the Groklaw effort (which he didn’t even explain, let alone substantiate) and ran with it, as though it were fact, using it – in a wonderfully classic abuser’s twist of reality – to paint yourself as the victim of "competitive antagonistic" tactics, as though the reader could not Get the Facts(tm) such as are currently being reported out of the Comes v. Microsoft trial.

    I’ll give you a hint.  Dan Bricklin doesn’t work for IBM.  Andy Updegrove doesn’t work for IBM.  Bruce Schneier doesn’t work for IBM.  Neither do Peter Gutmann, Pamela Jones (whatever you and Darl McBride may say), Marbux or I.  And we all want very much to stop this farcical "standard" that is unquestionably technically inferior and whose transparent purpose is to perpetuate and shore up the illegal monopoly which you have been tried and found guilty of creating and maintaining in courts throughout the world.

    Realize that the world is now able to see that the emperor is indeed naked, and that all your talk just emphasizes your unrelenting intent to continue to deceive and manipulate us into quiescing to your hegemonious lust.  You are like Saruman unmasked; like the Great and Terrible OZ after Toto pulled back the curtain on his control booth.

    In your position, I’d just shut up, and look for another job while I could.

  10. hAl says:

    @Brian Thomas

    The whole Groklaw site is mostly dedicated in supporting IBM in it’s articles. It is hardly surprising people might consider it an IBM front.

    The suggestions of Groklaw being a very one-sided IBM supporter are not originating from me but are to be found on other places on the internet as well.

    The articles about the office formats regularly show direct citations from IBM bloggers and are always negative on OOXML and is  never negative on ODF, where that format and the way it has proceded to ISO standards without being really complete should also warrant simular critisisms that are placed on OOXML.

    That is of course ok for a blog but Groklaw is clearly making themselfs a target for critisism on being a very biased blog if they mostly write very one-sided articles.

    (oh and btw, you should mayby tell Marbux that insulting me on blogs is not the way to discuss issues)

  11. Lurker says:

    Groklaw has absolutely no connection with IBM.

    http://floatingpoint.wordpress.com/2006/10/22/groklaws-non-connection-to-ibm/

    I´m surprised PJ is siding with IBM on this issue. She rarely does that.

  12. Karl G says:

    Lurker, I’ve never run across a groklaw article that didn’t support IBM, but I only read when it gets linked from /. or other news aggregator. I don’t think PJ is being paid by IBM or is in IBM’s palm, but it wouldn’t surprise me if she was.

    As for the main point of this blog entry, I really don’t understand why it’s cool. If your destination processor can parse and translate foo format, which it seems it MUST be able to do in order for this to take place, why not just send it in foo format? Why wrap it in WordprocessingML?

    The Rick Jelliffe story got me reading on this, and I’ve since read your, weir’s, and Dare’s entries on the subject. I’m inclined to agree with Dare: MS finds ODF lacking in features (reasonable) and OOXML is sprawling.

    You’ve mentioned that both formats are extendable. I’m curious as to why you chose to invent a new standard rather than extend ODF — since I think it was pretty stable before the OOXML standards effort was started — with a couple namespaces (spreadsheet functions, legacy format support, etc), document those extensions, and work with the ODF folks to get your additions merged into ODF.

    You’d still get linux fanbois accusuing you of embracing and extending, but that will always happen. It’d be more work for MS but from my perspective it looks like OOXML is an exercise in doing what’s expedient for Microsoft instead of something that would make the least work for their customers and the rest of the software world. I don’t begrudge MS in this and don’t see nor do I see a plot to undermine IBM, but it wouldn’t surprise me if there was.

  13. One of the most common requests we hear related to word processing documents is the ability to merge

  14. Resolution ================ Step 1: Open a new Microsoft Word 2007 document and type A B C Save the document