Converting WordprocessingML into HTML (for easy viewing)


Many people have asked me if there is an easy way to go from Word XML into XHTML.


I’ve already mentioned that we have a tool available that transforms from Word 2003 XML into HTML. You can download it here: http://www.microsoft.com/downloads/details.aspx?FamilyID=19676b18-1bcd-4852-93ba-0b5a203ea731&displaylang=en


This is actually a pretty cool tool. There are a number of ways you can extend it. By default it will add a behavior to IE so that any XML file that has the Word PI (processing instruction) will automatically have an XSLT applied that converts it into HTML that can then be rendered by the browser. It also does work to store the embedded images to a temp location so they can be referenced by the resulting HTML.


You can also write your own XSLTs and register them for the viewer to use. Then when you open a Word XML file, you will have the choice of XSLTs to apply. This is just handled with the schema library (the same way you can register XSLTs for Word to apply when it opens your XML).


Additionally, if you want to register some views for your own XML that you want applied in the browser, you can put a PI in the files that identifies it as yours and register with the tool that you want it to also render your files instead of the default XSLT that IE will apply.


There are other folks out there who are also building tools on top of Word XML. Here’s a blog I was just pointed at this morning where Oleg has worked on modifying an earlier version of the XSLT that we had released: http://www.tkachenko.com/blog/archives/000195.html


If anyone else has a tool they’ve built on top of WordprocessingML or SpreadsheetML please send me the links. I’d love to take a look.


-Brian

Comments (13)

  1. Sorry for sidetracking the thread. I found this ^^^ article ^^^ linked from SlashDot. What do you think?

  2. BrianJones says:

    No problem, although I would like to keep this thread more focused on questions around the XSLTs…

    I’ve actually seen a number of discussions over the past year or two around the need for formula support in OpenDocument. I’ve actually stayed away from commenting directly on the OpenDocument schema. I think the use of XML for a document format is great, and I don’t want to take anything away from what they’ve done.

    I did recently start questioning the licenses a bit, but that was because I was curious to compare their license with ours. I’ve had some people comment on the two being nearly identical and others have said they are dramatically different; so I just wanted to take a look.

    From reading the article, it sounds like the thought was that they would standardize around the presentation aspect of the formats only. It’s a bit unfortunate since the result of a formula does affect the ultimate display. In fact, formula results are often the most important part of the spreadsheet.

    Did the original StarOffice format have formulas defined in their schema? Did they decide only push some of the schema through OASIS?

    If this is an area folks are interested in, let me know. I can post some examples of Excel’s schema for formulas…

    -Brian

  3. Ian Morrish says:

    Using the SharePoint XML web part to render WordML

    http://www.wssdemo.com/Pages/WordContent.aspx?menu=Web%20Parts%20-%20WSS

    Using the Data View Web Part to render WordML and other information (document library version info etc) similar to a wiki

    http://www.wssdemo.com/wiki

  4. BrianJones says:

    SlashDotJunkie – Same comments apply. I was able to get to the article because you had mapped the URL to your username.

    Ian – Thanks for the links! I’m going to start to pull together a collection of links. I used to have a bunch of them but can’t find them now, so I have to start over. :-)

  5. Craig Ringer says:

    The MS Word XML format looks really interesting. I’ve found your writing about it here fascinating – especially on things like merging it with custom in-house schema and transforming it with XSLT.

    I’m responsible for much of the information management at work, the possibilities there are exciting. Currently, I’m forced to convert Word docs to plain text at quite a few points in our workflow. In many cases it’d be strongly preferable to do so with more control (such as I could get with a custom XSLT filter), and in other places I’d prefer not to do so at all, instead storing the Word doc with some additional markup. Being able to insert my own markup would be incredibly valuable, and I’m already thinking of how I can tweak the story archive database to use it…

    I’m curious about how Word internally handles "foreign" markup. I’ve always feared the potentially large memory use, complexity, and performance cost of using an XML DOM to keep the document structure in memory. On the other hand, I don’t really see how else to preserve foreign markup – but yet I don’t see how to avoid potentially huge memory costs and the fiddlyness of working on a DOM tree. It’d be fascinating to know how Word achieves this, and what you found challenging along the way.

    Many people who implement XML formats seem to use XML as a substitute for the basic binary structure, but still think the same way – the app still reads the file into memory then writes out an entirely new one (SAX-style), it fails with unrecognised markup or ignores it, rather than preserving it, etc. I’m pleasantly surprised that Office will be different, and I think many developers, including those who want to build on top of Office or build products that work with it, could learn a lot from how you did it.

    I’m also very excited by the possibility of writing a simple tool to extract all images from a Word doc. This has driven me nuts for years, as clients at work often send Word docs (sometimes containing nothing but an image) when asked to "please send each image separately" for publishing. Extracting them to date has consisted of ugly workarounds.

    See, when I say I’m an MS customer, that’s not just some "so you should listen to me" thing. As if – the revenue from me and my employer is tiny. The point is that I care about what you’re doing and find it really interesting and exciting. Sufficiently so in this case that for the first time ever I’m seeing features that make an upgrade from Word 2000 on our oldest Windows clients look attractive.

  6. G. Tarazi says:

    When will we get converting the InfoPath XHTML to WordML properly, and on the server side?

    Here is an example:

    Create a form template in InfoPath, with a rich text field.

    Fill it couple of times, and save the xml files.

    Write an xslt to view the files in Word, a report for example.

    Results, xhtml cannot be viewed, cannot be converted, and if you use some client side scripts, to perform a copy and paste in the background (in Word automation), the results are horrible.

    Having word ml going to html is nice, but having InfoPath’s xhtml gowning to word ml for reporting is critical!!!! And still missing, even 2 years after the products release.

    Hey people, Word is a nice software, and it can become a great reporting engine, but as long as xhtml cannot be viewed in Word, it is damn unattractive!

  7. Sadly, my developer moves are not agile enough to be timely with this post: I am still working on my utility that converts WordProcessingML into XHTML using VSTO 1.x. Please see "Dr. Peter Sefton of The University of Southern Queensland calls Brian Jones of Microsoft “Glib”" here:

    http://www.kintespace.com/rasxlog/?p=198

    Please excuse the title of the Blog post. I have been told that, instead of humour, insult comes out…

  8. Giuseppe gRILLO says:

    you excuse to me, are one student Italian of the facontà of ingegeneria of the university of the Calabria, I have read your message on:  http://blogs.msdn.com/brian_jones/archive/2005/09/30/475794.aspx

    I would want sapre if it exists eventually and if you could supply it to me, a xslT that from WordML it translate the document in HTML

    Thanks  giugrillo@libero.it

  9. BrianJones says:

    Giuseppe, if you download this viewer http://www.microsoft.com/downloads/details.aspx?FamilyID=19676b18-1bcd-4852-93ba-0b5a203ea731&displaylang=en

    you’ll see that an XSLT also comes along that goes from WordML to HTML.

  10. meet says:

    hi i need a code using whih i can extract dat from doc to xml file Can anyone give me solution??

    Thaknx

    mail me on plz:)

    meetesh.mishra@gmail.com

  11. The sharepoint team recently posted an article up on OpenXMLdeveloper.org on how they allow you to convert