Transforming Open XML Word-Processing Documents to XHtml

Over the next couple of weeks, I'm going to spend some time writing some LINQ to XML code to transform pen XML word-processing documents to XHtml.  Just for fun, as I go, I'm going to post my progress, posting the code, talking about the issues I come across, and in general, being transparent about this development process.  I welcome your thoughts and opinions.  And shortly, we'll have a useful chunk of code that we can use in a variety of cool scenarios.

This is one in a series of posts on transforming Open XML WordprocessingML to XHtml.  You can find the complete list of posts here.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThis code will be part of the PowerTools for Open XML project, so will be released under the Ms-PL license, so I'll be posting zip files there.

A few example scenarios:

  • Convert word docs to Html, populate a SharePoint wiki.
  • Find some text in a word doc, grab the three paragraphs before and after, change the formatting of the Open XML text that we found, and convert the chunk to Html, then display the results, with the found text highlighted in some fashion.
  • Build a specialized Html converter for my blog, which puts a 'Copy Code' button above each code snippet.

So, as I start, here are some thoughts and ideas I have about this project:

I'll try to transform to XHTML that is validated against the schema, unless I run into blocking issues.

There already is code plus an XSLT style sheet that can convert Open XML word-processing docs to HTML.  This is the CodePlex/OpenXmlViewer project.  I have different goals from that project – that project aims for high fidelity (the resulting HTML looks as close as possible to the original word-processing document), and is (I think) primarily used as a browser plug-in.  My goals – less effort spent on full fidelity, and more on making this easy and convenient for developers to modify and enhance for specific development efforts.

Also, I want to be able to convert a small selected chunk of a word-processing document, whereas the OpenXmlViewer project converts entire documents.

Finally, I want a chunk of code that is super-easy for a C# developer to customize and incorporate into another application.

I'm going to write this code as a pure functional transform that uses recursion.  After a fairly long selection process, I've settled on this approach for a variety of reasons.  I don't want to use XSLT, as I'm going to add extension points where developers can interject their own custom transformations for specific pieces.  For example, a developer can provide a lambda expression (delegate) for images – the lambda gets an image as an argument – you can do what you want with the image – post it on a server, or whatever, and then return the link to the transform.  This will give wide latitude in how you deal with images.

I've discarded the approach of using annotations for doing document-centric transforms, as it has performance issues when used for large documents.  (Actually, I'm not completely sure about this, but I've had the sense of this as I've written various transforms on larger documents.)  In contrast, pure functional recursive transforms are blindingly fast.  The code that I wrote in the recursive style to accept revisions does no less than seven successive transformations, producing entirely new trees, and the code is fast.  On a Dell D600, 2Ghz, single core, it processes extremely large documents (800 pages) in less than a second.

The disadvantage of using recursive LINQ transforms is that not too many developers are comfortable with this style of development.  However, if I do this properly, developers won't need to plumb the depths of the transform, and instead can use it as a 'black box'.  Besides, if I make this process transparent, maybe more developers will understand the power of this approach.

One key aspect of the approach I'll take: I'll accept all tracked revisions before doing the conversion.  This will make my code much simpler to write.  The resulting code will be more robust.  I've been postponing writing the HTML converter until I had a revision accepter that I am satisfied with.

I'm also considering doing an initial transformation to the simplest word-processing markup possible.  For example, if there are two adjacent runs that have the same formatting, I can combine them into a single run.  I'll also discard superfluous markup, such as proofing errors.  If I simplify the markup, then there are more possibilities for straight one-to-one conversions between the Open XML markup and HTML.

I think that it would be useful to preserve bookmarks and internal links, and construct the corresponding markup in HTML.

For this initial version, I'm going to discard comments.  It could be interesting to build a conversion that surfaces comments, but this isn't one of the main scenarios.  In most cases, we don't want comments placed into the resulting HTML, I think.

I probably will also discard footnotes and endnotes in the resulting HTML.  These are interesting, but probably only to a small subset of developers.  If there is a lot of demand for this, then I can enhance the code later to incorporate these conversions.  But I'd have to decide how I would want them to be rendered in HTML, and this is a more complicated decision.

I want this code to have the highest fidelity that I can accomplish without jumping through too many hoops.  Key goals – preservation of textual content – if there's text in the source document, the text shows up in the same place, with the same font, in the resulting HTML.  Images should convert and show up in the same place, and if there is an easy way to make the text flow in the same fashion as the source document, the HTML will do so.

Because I'm writing this in the pure functional recursive style, you can almost prove that a) the code can't fail, because I'll make every effort to reduce 'points of possible failure', and b) the code can only produce valid XHtml.  This adds robustness and reliability to applications that use this code.

Finally, I want to write this transform in the smallest amount of code possible.  My off-the-cuff estimate is that the conversion should be 1000 lines of code or less.  But we'll see how large the code becomes as I progress.

Anyway, on to more research and coding.  I'll post the next update in about a week.  This is going to be fun!