Word XSLT: Data Only Transform


If you’ve played around with Word 2003’s XML support, you’re probably aware that you can load your own schemas into Word and markup the document with your XML. When you save the file out, you get both the WordML and your XML mixed together. This allows you to search the files for your XML while still maintaining the presentation information. You also have the option to save as Data Only, so the result is just you’re pure XML.


Often, it’s best to store the files with both your XML and the WordML, so that all the presentation information is preserved. You can always tranform the file later to remove the WordML if at some point you want to work just with your data. Here’s a simple transform you can run on a WordML file that will give you the equivalent of a data only save:


http://jonesxml.com/resources/basicDataOnly.zip


It’s a really simple transform. There are some additional pieces of functionality people have requested from our Data Only save such as line breaks for paragraphs, and pretty printing. I’ll look at getting some of that added to the transform at some point and send out another update.


-Brian

Comments (29)

  1. Evan Lenz says:

    Just discovered your blog. Keep the great content coming!

    I wrote a similar stylesheet for "Office 2003 XML". It’s included in the book example files: http://examples.oreilly.com/officexml/

    It’s the saveDataOnly.xsl file (in the chapter 4 examples), and it includes some other features like a configurable option to ignore mixed content, and the reconstruction of processing instructions stored in custom document properties.

    It had a dual purpose of demystifying the "Save Data Only" behavior (at least for people familiar with XSLT) and of being used in the "Apply Transform" option when saving the main example of the chapter. The biggest advantage as I recall was in giving the developer the choice as to what mixed content to keep and what to throw away, rather than having to make an all-or-nothing decision.

    Anyway, I thought you’d find it interesting to compare with.

    Evan

  2. BrianJones says:

    That’s great Evan. I’ll have to look through those other examples too. Thanks for the post!

  3. <p>I would really appreciate a no-holds barred response to the design goals behind the article “<a href="http://songhaysystem.com/document.php?cmd=getDoc&amp;get=24&quot; shape="rect">XHTML Schemas in Word 2003 Documents</a>.” Your post implies that formatting is <em>only</em> preserved by WordML and that any user-defined schemas loaded into a Word document are for data only. Are we confounding the designers, the application architects, of Word 2003 when we decide to load a formatting schema like XHTML into a Word document? Would you openly discourage such a move? Do you, in the very least, find it redundant and therefore useless?</p><p>Please do not be kind to our Mort and be frank in your reply.</p>

  4. Don Box tried to write an XSL transform from WordML to XHTML. There was a Beta XSLT out there from someone in the Office Team as well. Both of these projects are now unavailable at Microsoft.com. I look forward to your XSL template demonstrating a mastery over the WordML run element and producing at least paragraph-level formatting in XHTML (like Bold, Italic, Hyperlinks, Superscript and Subscript). I fail to understand why all of my Google searches produce no one who has completed at least this small subset of XHTML.

    Your approach to this problem is refreshing as all other contenders tried an all-or-nothing deal. All I am looking for are subsets of WordML translated into XHTML. I am not looking for a replacement for WordML. I am looking for “lossy” formatting option.

  5. Thanks, Evan. I’ve read some of the sample chapter from Office 2003 XML. This is a start:

    <?xml version="1.0" encoding="UTF-8" ?>

    <xsl:stylesheet version="1.0"

    xmlns="http://www.w3.org/1999/xhtml&quot;

    xmlns:o="urn:schemas-microsoft-com:office:office"

    xmlns:xsl="http://www.w3.org/1999/XSL/Transform&quot;

    xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml&quot;

    >

    <xsl:template match="w:wordDocument">

    <html>

    <head>

    <title><xsl:value-of select="o:DocumentProperties/o:Title" /></title>

    </head>

    <body>

    <xsl:apply-templates select="w:body" />

    </body>

    </html>

    </xsl:template>

    <xsl:template match="w:body">

    <xsl:apply-templates select="w:p" />

    </xsl:template>

    <xsl:template match="w:p">

    <p><xsl:apply-templates select="w:r | w:hlink" /></p>

    </xsl:template>

    <xsl:template match="w:r">

    <xsl:choose>

    <xsl:when test="w:rPr/w:i"><em><xsl:value-of select="w:t" /></em></xsl:when>

    <xsl:otherwise><xsl:value-of select="w:t" /></xsl:otherwise>

    </xsl:choose>

    </xsl:template>

    <xsl:template match="w:hlink">

    <a href="{@w:dest}"><xsl:apply-templates select="w:r" /></a>

    </xsl:template>

    </xsl:stylesheet>

  6. BrianJones says:

    That’s a great start Bryan. I was going to post something similar, but hadn’t got around to it yet.

    In answer to your earlier question, I definitely do not recommend loading the XHTML schema into Word and marking up the document, that’s duplicating too much information.

    We designed the XML support so that you could leverage both WordML and your XML together. If there are features such as formatting, lists, and tables that Word already supports, then you don’t need to mark that up. Instead you can just take the subset of your schema that isn’t already represented by Word functionality, and only mark up with that.

    Then you can just transform on the way out into your schema. At one point I had an example of doing this for DocBook, but I can’t seem to find it anywhere. I’ll post it if I ever dig it up.

    -Brian

  7. fred says:

    I think the xslt does not work on complicated files. i have tried using a form for car mileage claimts and apply the data only xslt given but there is no output after I ticked on the transform and the dataonly button on saving he document. Can some one look at my doc file to see any problem with it. Please email me at red131@gmail.com

  8. Peter Sefton says:

    I am skeptical about the value of Word’s schema support, particularly as presented here as a kind of ghetto-ized feature that was never really intended to be used on real-life schemas like XHTML or DocBook.

    I have posted on my bog about this: http://ptsefton.com/blog/2005/08/13

  9. BrianJones says:

    Hey Peter, I just read through your blog entry. There are some good points you raise, and I think there are a few things that need to be addressed.

    The first point is that our main scenarios weren’t about turning Word into an XML editor. As you can imagine, we have a fairly large user base, and investing the amount of resources we did into our XML support just to target the XML editor market wouldn’t have made a lot of sense. The XML support is really for a much broader set of scenarios.

    There is a huge market that exists today for custom Office solutions. People customize the Office applications in all kinds of ways to try to get more out of their documents. By adding the support for custom defined schemas, we made it much easier to build semi-structured solutions on top of Word. Rather than rely on hacks with styles or bookmarks, folks could create a simple schema and add some XML tags into their existing document solutions.

    We provide a fairly rich object model on top of the XML functionality, as well as the ability to save an entire Word document as XML (using the WordprocessingML schema). These tools make it much easier to build document generation and consumption solutions, as well as more reliable add-ins that act on the document while it’s being authored.

    I think the points you raise are great, and there are a couple of things I’ll try to follow-up on. I’ll try to see if we have any good XSLTs for mapping our lists into XHTML lists. I’ll also try to get a more complete description of the goals behind our XML support in Office. As you can image, each application has different uses for XML since each application targets different scenarios. Excel is all about crunching on data and that’s why we did the work to import lists of data as XML, and then map out the resulting calculations. Infopath is all about data collection, so there we are much more structured and form-like. Word is about editing rich documents, so Word is more loose, and the XML is good for adding some additional structure for richer semantics.

    -Brian

  10. Peter Sefton says:

    Thanks Brian, your response is promising. The Word / XML story is becoming a bit clearer and I look forward to hearing more on the list issue, and more in general about Office and XML.

    I have another post on this at my site:

    http://ptsefton.com/blog/2005/08/25/word_xml_clarified_a_bit

  11. Barry S. says:

    So if I want to author XML in Word 2003 using my schema and visual styles, it appears that the software doesn’t find it easy to apply styles to the appearance of my XML display in the edit window.

    Can you tell me how that can be done. The default display of XML in W2003 is pretty ugly, tags and all.

    The software also seems not to be capable of imposing my schema object model on the edit window, allowing me to put things pretty much anywhere I want and not providing support for auto-insertion of required elements, etc. Am I missing something or is W2003 not aimed at the professional, high-performance author.

    My background is with software like Epic editor and XMetal, both fully based on the user-defined schema.

  12. BrianJones says:

    Hey Barry, the Word 2003 XML support was not targeting the same scenarios that XML editors like Epic and XMetal are going after.

    The key scenarios were really around taking existing Word based solutions and leveraging the XML to add additional structure. With that added structure you can more easily pull out information from the file as well as program against your XML structures rather than just the Word structures.

    We’ve done a lot of work in Word "12" to make it even easier to mark up a document with your data, but still keep your data seperate from the presentation. The main scenario for the 12 work is to create mappings from the surface of the document into specific XML nodes in your Data structures.

    I’ll try to provide more information on this, as we’ve already talked a bit about it at PDC last week. I’ll also try to provide more info around the XML support in 2003 and see if I can help you better understand the functionality that is there, and that isn’t.

    The main way we’ve recommended you enable the end user to insert the XML structures into the files is by creating boilerplate document chunks that are pre-structured and that your solution allows them to easily insert in the correct locations.

    -Brian

  13. Ben says:

    If you’re planning on using the XML saved by Word in any other application, you’re in for a wild (probably read frustrating) ride.

    I’m looking at the issue of producing XHTML from Word’s XML output – there have been some really strange decisions made, in terms of how the documents structure is represented.

    Take tables for example – a cell that spans rows (two or more vertical cells, merged) is marked by a self-closing element. At some point, later in the document, another self-closing element appears to end the merge.

    Compare this to the rowspan attribute in XHTML – one attribute tells you in advance how many rows/cells will be affected.

    As a programmer I can’t imagine what possible advantage these obscure decisions can have. Bizarre…

  14. Marcel Gnoth says:

    I have a WordML doc with a custom schema attached and would like to extract the data on a server form this custom schema. The bad thing is, that the nodes of my custom schema are mixed with the nodes from WordML schema and extraction is not easy and not reliable.

    I would like to do this with .net-code on a BizTalk or SharePoint server. Do you have an good idea how to achieve this? Or some links about this topic?

    Bye Marcel

    <ns1:Ship_Name>

    <w:proofErr w:type="spellStart" />

    <w:p>

    <w:pPr>

    <w:ind w:left="720" />

    </w:pPr>

    <w:r>

    <w:t>Alfreds</w:t>

    </w:r>

    <w:proofErr w:type="spellEnd" />

    <w:r>

    <w:t></w:t>

    </w:r>

    <w:proofErr w:type="spellStart" />

    <w:r>

    <w:t>Futterkiste</w:t>

    </w:r>

    <w:proofErr w:type="spellEnd" />

    <w:r>

    <w:t>. </w:t>

    </w:r>

    </w:p>

    </ns1:Ship_Name>

    I would like to get as a return just

    <ns1:Ship_Name>Alfreds Futterkiste</ns1:Ship_Name>

    ——————————-

    http://www.gnoth.net

  15. BrianJones says:

    Marcel, did you try applying the transform that I linked to? You should be able to apply the transform and it would give you what you’re looking for…

    -Brian

  16. Dinko Fabricni says:

    This transform is good for extracting data from WordML document with custom XML schema.

    But how can I generate Word XML document dynamically from a custom application to retain all Word formatting and to conform to my own XML custom schema?

  17. BrianJones says:

    Dinko,

    Could you describe your problem a bit more? What is the custom application? Is it consuming a Word document or just generating one? If it is only generating one, then what do you mean by retaining all Word formatting?

    I’m sorry, but I’m having a hard time understanding your question.

    -Brian

  18. dinko.fabricni@perpetuum.hr says:

    Hello Brian,

    We have a Captaris Workflow and a .NET written model for the workflow that needs to generate Word documents. Documents have a lot of fixed formatted text and about a dozen fields that are to be filled automatically during the workflow process.

    So far I have managed to create an XSD schema that contains my custom properties definitions and placed the schema fields in the right places in the document.

    Using WML2XSLT.EXE tool I have created XSLT transform.

    When I save the document with XML data only option I get not only my custom fields in the document but all the fixed text that is contained inside the main XML element.

    When I apply XSLT to this XML I get the document formatted fine but the problem is that I want only the data in the custom field to be contained in the XML data only document.

    This way the end user can change the formatting if likes to and generate new XSLT as long as he doesn’t tamper any XML tags in the document.

    I read about this procedure in Word 2003 SDK and went through Memo Styles Sample but the sample is to simple 🙁

    (http://www.microsoft.com/downloads/details.aspx?FamilyId=4267E2FF-58C0-49DD-BB2A-02C729C68DD0&displaylang=en)

    I hope I made it clear this time 🙂

    Dinko

  19. BrianJones says:

    Hey Dinko, I think it’s a bit more clear. Could you maybe provide a quick example of what your data looks like, and what the Word document looks like? Is it something like this:

    <dinko>

     <field1>Complete</field1>

     <field2>Submitted</field2>

    </dinko>

    and the result WordprocessingML looks like this (In shorthand):

    <w:body>

     <w:p>

       <w:t>Status: </t>

       <field1><w:t>Complete</w:t></field1>

     </w:p>

     <w:p>

       <w:t>Approval: </t>

       <field2><w:t>Submitted</w:t></field2>

     </w:p>

    </w:body>

    What I’m trying to understand is where the rich formatting is coming into play. Are the users formatting the values of one of your nodes, or is it somewhere else in the file that you don’t care about? When you save, do you want to through away just the formatting? Or do you want to throw away the data that they’ve edited too?

    -Brian

  20. dinko.fabricni@perpetuum.hr says:

    Hi,

    The problem is that I have something like this when I save XML data only:

    <dinko>

     Document title

     some formatted text, document body that never changes

    <field1>Complete</field1>

     Text text text text text text etc   <field2>Submitted</field2>

    </dinko>

    I want to generate XML document like this:

    <dinko>

    <field1>Complete</field1>

    <field2>Submitted</field2>

    </dinko>

    and apply XSLT that contains all the formatting and text. When the end user wants to change something in the template, he would have to edit it in Word, save it as XML, use WML2XSLT.EXE to generate XSLT and the workflow application would againg just create the same XML:

    <dinko>

    <field1>Complete</field1>

    <field2>Submitted</field2>

    </dinko>

    .

    Or in simpler words, I want my end users to be able to change fonts, looks and appereance of a document as long as they leave all the necessary XML tags inside the document.

    My problem is that main element <dinko> wraps all the text in the document and I don’t wan’t to do that.

  21. BrianJones says:

    Have you tried to use the option "ignore mixed content"?

    It’s one of the XML options you can set, and it will basically treat all mixed content as presentation text, and only preserve the content that is in leaf nodes.

    Let me know if that is the type of functionality you are looking for.

    -Brian

  22. dinko.fabricni@perpetuum.hr says:

    This was to easy 🙂

    It works just the way I need it to.

    Thank you for your time!

  23. Kris says:

    Hello Brian,

    My problem is much similar to what Dinko explained in the previous posts. Except that I want to know how I can programmatically load the DataOnly xml(whose values change) and apply the same XSLT (generated by WML2XSLT) to get the word 2003 document.

    Thanks,

    Kris

  24. Denise Grayson says:

    Using the data only xslt given above, my elements are filled with the namespaces i.e.

    <DISA_Header xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml&quot; xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core&quot; xmlns:aml="http://schemas.microsoft.com/aml/2001/core&quot; xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint&quot; xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:wsp="http://schemas.microsoft.com/office/word/2003/wordml/sp2&quot; xmlns:ns6="urn:schema4">DEFENSE INFORMATION SYSTEMS AGENCY PACIFIC (DISA-PAC)THEATER NETOPS CENTER (TNC) NETDEFENSE (ND) PACIFIC</DISA_Header>

    I was wanting to remove the namespaces from the element tag…how could I do that?

    Thanks,

    Denise

  25. crg says:

    How do I get around the following limitation of Word ML converting each line feed/carriage return into a space?  

    Do I have to write a program/script to replace the line feeds with <w:br/>???

    Is there an easier way to handle this within WordML??

    I have the following

    <w:t>

    line 1

    line 2

    </w:t>

  26. BrianJones says:

    Hi Denise, there is a tag (or attribute) in XSLT that allows you to specify namespaces that you’d like to have ommitted from the output. You could probably add that on and it should work the way you want.

    CRG, if you want to enforce those line feeds, then you’ll need to replace them with the <w:br/> element.