Slashdot article on Apple outputting Word XML format


Slashdot: MS Office XML Format Now in TextEdit


I saw this the other day on slashdot. I have to admit that this is the first time I’ve heard about this and I’m not really familiar with exactly what is being output by TextEdit. In the slashdot post there is a link to an example file, and if that’s really what’s being output I’m surprised. Does anyone know for sure if that really is what TextEdit outputs? The file posted matches our XML format that we had in the Beta 2 version of Word 2003. You can tell by the namespace: “http://schemas.microsoft.com/office/word/2003/2/wordml”. The final namespace that we shipped with was “http://schemas.microsoft.com/office/word/2003/wordml”.  It also has the PI to tell the shell to launch the file in Word instead of IE. I talked about that behavior in this post last month.


If that really is what they output and you you want to try opening that file in Word 2003 there are a couple things you need to do. The first problem is the namespace (as I mentioned before). Just remove the “/2” and it will now be in the right namespace. Same goes for the hint namespace. It should be “http://schemas.microsoft.com/office/word/2003/auxHint” and not “http://schemas.microsoft.com/office/word/2003/2/auxHint”.


There are some interesting comments on that post as well. It looks like some people thought this was a glimpse at the new XML formats for Office 12, but I think most people saw that this was the XML format from the last version. One comment that was really interesting to see was that someone mentioned that the XML was overly complicated. As I’ve mentioned before, we could have gone with a really simple schema, but in doing that it would not be full fidelity. We have over a billion Word documents out there today that we need to be able to represent in XML. We have to be able to represent every piece of Office functionality in XML, and that results in a pretty large schema. Word, PPT, and Excel are rich in functionality. The files don’t have to be that complicated though if you don’t care about all the functionality. Just read this post to see how simple a file can be. As you start representing more functionality though, it will become more complex.


-Brian

Comments (7)

  1. Ignace says:

    The comments about the complexity of the xml output are to be expected, I think. It’s about the first impression. People that are trying to figure out the XML of a Word document, all get this "shock" when they open one in a simple editor for the first time. Using Internet Explorer to view it and collapsing some parts in it, makes a whole difference. Most people don’t know this functionality of IE I think.

    The Word XML also doesn’t allow you to study it by making some simple examples in Word yourself and looking at the code afterwards. In this sense the code is too complex, yes. You really have to download the documentation and start from scratch in order to understand it. For some people this may be a little too much effort. Blame them.

    Those using Word XML professionally will find it quite usefull, i’m sure. Probably those complaining will never need to use it.

    Just my thoughts…

  2. Mark Baird says:

    The problem with the WordML is that it is only as good as the binary as is evident from the XML snippet below.

    You will notice that there are two runs of text. The first run has the run properties and the second run does not. You get the same results if you output this same file to HTML.

    It appears that instead of fixing the DOM Microsoft just added logic to Word that reads the binary that goes like this, "if the run of text does not contain run properties then just use the properties from the previous run".

    Because of these hidden gems it is difficult for programmers to use the output from Word in this case WordML, without any surprises. In my case this little gem wasn’t found until after we had posted a clients financial on the web, in HTML, for all to see and to only have the client upset that you couldn’t forsee this little hidden Microsoft gem that has been around since Word 2K.

    <p>&lt;w:r&gt;<br>

    &nbsp;&nbsp;&nbsp; &lt;w:rPr&gt;<br>

    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;w:sz-cs w:val=&quot;20&quot;/&gt;<br>

    &nbsp;&nbsp;&nbsp; &lt;/w:rPr&gt;<br>

    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;w:t&gt; GDS or pricing that differs

    between long-haul or short-haul trips. In some cases we may approach airlines

    with pricing options from any combination of our business units. We plan to

    offer airlines a choice of multiple pricing schedules, and we expect that each

    new airline participation agreement will differ in many ways, including by

    price. We believe that airlines will see the advantages that may be inherent in

    moving quickly to enter into these new, more customizable relationships with

    us.&lt;/w:t&gt;<br>

    &lt;/w:r&gt;<br>

    &lt;w:r&gt;<br>

    &nbsp;&nbsp;&nbsp; &lt;w:t&gt;It is difficult to predict with certainty, in a recently

    deregulated environment, the impact of new pricing models on our revenues. It is

    our goal to maintain, over a several year period, a neutral impact to the

    average unit revenue (including merchandizing revenue) in the Sabre Travel

    Network business. If certain pricing models were to gain further traction, we

    could see a reduction in our average unit revenue which could be partially

    offset by reduced expenses. Our goal is to have new agreements in place with

    many airlines before the expiration of our DCA 3-Year Option agreements in 2006.

    Our DCA 3-Year Option agreement with US Airways, the only such agreement with a

    major U.S. airline that had a 2005 expiration date, was extended for one year

    beginning in October of 2005. Regardless of the outcome of these pricing models,

    it is our intent to reduce costs in the Sabre Travel Network business. &lt;/w:t&gt;<br>

    &lt;/w:r&gt;<br>

    &nbsp;</p>

  3. BrianJones says:

    Ignace, you are absolutely correct. Folks that have wanted to (or actually tried to) work with the binary formats will find these new XML formats a huge step forward. The Office applications are complex so it will at times be a bit of work, but it is hard to stress how much easier it will be than it was in the past.

    Mark, I’m not following what you’re having trouble with. The encoding of your example was a bit off, but I was still able to see what you were trying to do, and it works fine for me. If a text run doesn’t have any properties specified it will inherit the properties of the paragraph, not the previous run. Your example only specified the font size for complex script text, so there isn’t a noticable difference, but if you use this example:

    <w:p>

    <w:r>

    <w:rPr>

    <w:sz w:val="20" />

    </w:rPr>

    <w:t>GDS or pricing that…</w:t>

    </w:r>

    <w:r>

    <w:t>It is difficult to predict…</w:t>

    </w:r>

    </w:p>

    You should see that the two runs to have different formatting when you open them in Word.

    Can you give a bit more detail on what you’re having trouble with? You mentioned HTML from Word 2K. Is that the problem or are you having the same trouble with the XML formats?

    -Brian

  4. Dating says:

    Slashdot: MS Office XML Format Now in TextEdit I saw this the other day on slashdot. I have to admit that this is the first time I’ve heard about this and I’m not really familiar with exactly what is being output by TextEdit. In the slashdot post ther

  5. Weddings says:

    Slashdot: MS Office XML Format Now in TextEdit I saw this the other day on slashdot. I have to admit that this is the first time I’ve heard about this and I’m not really familiar with exactly what is being output by TextEdit. In the slashdot post ther