Transforming Open XML Word-Processing Documents to XHtml (Post #2)

Last week, I blogged about a small project that I'm embarking on: to make a reasonably accurate transform from Open XML word-processing markup to XHTML.  I wrote about the approach that I'll be taking, and my initial thoughts about how to proceed.  I've done a bit of research, and this week, I'll lay out more details about the approach that I'll take.

This is one in a series of posts on transforming Open XML WordprocessingML to XHtml.  You can find the complete list of posts here.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCOne small note about this series of blog posts – these are going to be much more ad-hoc than my usual posts.  If I go down the wrong path, then you'll see this J.  Also, I'm not going to spend too much time writing and re-writing the posts.

One of the key aspects of the approach that I'll take is to use the power of CSS:

  • I'll generate a CSS style for every block-level style that is used in the word-processing document.  I'll try, as far as possible, to generate the appropriate CSS style that will render the word-processing document accurately.  These styles will be applied to p and div elements.
  • I'll generate a CSS style for in-line styles.  These styles will be applied to span elements.
  • I'll preface the generated style names with Ptoxml (PowerTools for Open XML), i.e. PtoxmlNormal, PtoxmlH1, etc.  If this generated markup is embedded in other HTML, this will prevent class name collisions.
  • The classes for block-level and in-line styles will be generated in an internal style sheet.
  • If there is direct formatting applied to a paragraph or to a run within a paragraph, I'll generate the appropriate CSS as an in-line style.  My goal here is not to generate a document where the content is perfectly separated from the presentation.  Instead, my goal is to provide a conversion of a small chunk of word-processing markup to usable XHTML that can be used programmatically in a variety of contexts.  Open XML word-processing markup has 'cascading' semantics – a paragraph can be of a specific style, and the user can override aspects of that style for a paragraph.  This is a direct parallel to the semantics of CSS – a paragraph can be of a specific class, and can be overridden for a paragraph.

One key aspect of the approach that I'm going to take: I am not going to translate numbered/bulleted items from word-processing markup to li elements in the Xhtml.  Instead, I'm going to generate paragraphs of a particular class, and format that class using CSS as appropriate, so that numbered items and bulleted lists are rendered properly.  While numbered items that are formatted in a simple way translate to li elements in the Xhtml markup, the capabilities of numbered items in word-processing markup are rich (RICH!), and as soon as the markup uses more than the most rudimentary capabilities, the translation breaks down.  This has been one of the biggest complaints about other projects that convert Open XML to html – that numbered items aren't translated properly.  I could go down the road of translating rudimentary numbered items to li elements, and then translate the more rich variations into paragraphs, but this is messy.  Instead, I believe that I'm going to discard using li elements altogether.

As I've researched how I'll implement this, I've decided on a few limitations:

  • Multi-column layout will be converted to single column layout.  We're more concerned about accurately surfacing the content than exact representation.
  • Themes will be converted to straight CSS styles, both at the class level, and where over-ridden, at the in-line level.  The abstraction of themes won't be carried over to the XHTML.
  • In cases where there is no direct correspondence between CSS styles and the specific representation in word-processing markup, I'll simplify the representation to whatever can be represented in CSS.  For instance, there are a lot of varieties of underline styles in word-processing markup.  All underline styles will be transformed to a simple underline in the generated CSS and XHTML.
  • Open XML word-processing markup has an abstraction of tabs – you can display the ruler above your document, below the ribbon, and insert tabs, then place tabs in your document.  There is no corresponding abstraction in CSS/XHTML.  This could be approximated using spaces, but at best, it will be inaccurate – text won't align properly vertically.  My personal experience is that these days, people prefer tables for laying out text instead of tabs, and tables do translate properly.  I think that for phase 1, I'll not attempt any sort of hacked conversion, but I've not yet decided on how to convert these.  It could be neat to convert tabbed text to tables with invisible grids with merged cells, but I'm not sure how this would work practically.  One problem is that numbered/bulleted items in word-processing markup makes use of physical tabs – if I don't have a way to render tabs, there will be a small loss in fidelity of placement of text of numbered items.

I'm sure that I'll discover other places where I will want to place limits on the transform.

The last thing I'll present in this post is the skeleton for the conversion.  The following code will do a simplistic transform of simple Open XML documents to simple XHTML.  I can then build and extend this code, handling more and more sophisticated varieties of markup.  For a detailed explanation of how this type of transform works, see the post, Recursive Pure Functional Transforms of XML.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

namespace HtmlConverter
{
public static class Extensions
{
public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument partXDocument = part.Annotation<XDocument>();
if (partXDocument != null)
return partXDocument;
using (Stream partStream = part.GetStream())
using (XmlReader partXmlReader = XmlReader.Create(partStream))
partXDocument = XDocument.Load(partXmlReader);
part.AddAnnotation(partXDocument);
return partXDocument;
}

public static string StringConcatenate(this IEnumerable<string> source)
{
StringBuilder sb = new StringBuilder();
foreach (string s in source)
sb.Append(s);
return sb.ToString();
}
}

public static class W
{
public static XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";
public static XName body = w + "body";
public static XName document = w + "document";
public static XName p = w + "p";
public static XName pPr = w + "pPr";
public static XName r = w + "r";
public static XName rPr = w + "rPr";
public static XName t = w + "t";
public static XName tbl = w + "tbl";
public static XName tc = w + "tc";
public static XName tr = w + "tr";
public static XName txbxContent = w + "txbxContent";
public static XName val = w + "val";
public static XName pStyle = w + "pStyle";
public static XName b = w + "b";
}

public static class Xhtml
{
public static XNamespace xhtml = "https://www.w3.org/1999/xhtml";
public static XName html = xhtml + "html";
public static XName head = xhtml + "head";
public static XName title = xhtml + "title";
public static XName body = xhtml + "body";
public static XName p = xhtml + "p";
public static XName h1 = xhtml + "h1";
public static XName h2 = xhtml + "h2";
public static XName A = xhtml + "A";
public static XName href = "href";
public static XName b = xhtml + "b";
public static XName table = xhtml + "table";
public static XName border = "border";
public static XName tr = xhtml + "tr";
public static XName td = xhtml + "td";
}

public static class HtmlConverter
{
public static object ConvertToHtmlTransform(WordprocessingDocument wordDoc,
XNode node)
{
XElement element = node as XElement;
if (element != null)
{
if (element.Name == W.document)
return new XElement(Xhtml.html,
new XElement(Xhtml.head,
new XElement(Xhtml.title, "Test.docx")
),
element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e))
);

// transform the w:body element to the XHTML h:body element
if (element.Name == W.body)
return new XElement(Xhtml.body,
element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

// transform every Heading1 styled paragraph to the XHTML h:h1 element
if (element.Name == W.p && (string)element
.Elements( W.pPr)
.Elements(W.pStyle)
.Attributes(W.val)
.FirstOrDefault() == "Heading1")
return new XElement(Xhtml.h1,
element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

// transform every Heading2 styled paragraph to the XHTML h:h2 element
if (element.Name == W.p && (string)element
.Elements(W.pPr)
.Elements(W.pStyle)
.Attributes(W.val)
.FirstOrDefault() == "Heading2")
return new XElement(Xhtml.h2,
element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

// transform w:p to h:p
if (element.Name == W.p)
return new XElement(Xhtml.p,
element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

// transform every text run that is styled as bold to the XHTML h:b element
if (element.Name == W.r &&
element.Elements(W.rPr).Elements(W.b).Any())
return new XElement(Xhtml.b,
element.Elements(W.t).Select(e => (string)e).StringConcatenate());

// transform every text run that is not styled as bold to a text node that
// contains the text of the paragraph.
if (element.Name == W.r &&
!element.Elements(W.rPr).Elements(W.b).Any())
return new XText(element.Elements(W.t)
.Select(e => (string)e).StringConcatenate());

// transform w:tbl to h:tbl
if (element.Name == W.tbl)
return new XElement(Xhtml.table,
new XAttribute(Xhtml.border, 1),
element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

// transform w:tr to h:tr
if (element.Name == W.tr)
return new XElement(Xhtml.tr,
element.Elements().Select(e => ConvertToHtmlTransform(wordDoc, e)));

// transform w:tc to h:td