Enabling Better Transformations by Simplifying Open XML WordprocessingML Markup

Article
02/08/2010

When transforming Open XML markup to another XML vocabulary (such as XHtml), you can sometimes simplify the transform by first transforming the original document to a new, valid WordprocessingML document that contains much simpler markup, and therefore is easier to process. WordprocessingML markup has many capabilities, such as revision tracking, content controls, and comments. You may not be interested in those capabilities for the specific transform that you are writing, and you can make your transform simpler and more robust by first removing markup that is irrelevant to your transform. This blog post describes a utility class, MarkupSimplifier, which is part of the PowerTools for Open XML project. You can find MarkupSimplifier in the HtmlConverter.zip download, under the downloads tab at PowerTools for Open XML.

This is one in a series of posts on transforming Open XML WordprocessingML to XHtml. You can find the complete list of posts here.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCI'm not interested in writing the simplifications back to the source document. In Simplifying Open XML WordprocessingML Queries by First Accepting Revisions I show how to open an Open XML document that we can modify, but modifications will not be written back to the source document. By working on an in-memory document, I can make any number of transformations to the document without touching the source document on disk.

The use of MarkupSimplifier is easy. You create and initialize a SimplifyMarkupSettings object, and pass an open word-processing document and the settings to the MarkupSimplifier.SimplifyMarkup method. When the method returns, the open document is simplified.

SimplifyMarkupSettings settings = new SimplifyMarkupSettings
{
RemoveComments = true,
RemoveContentControls = true,
RemoveEndAndFootNotes = true,
RemoveFieldCodes = false,
RemoveLastRenderedPageBreak = true,
RemovePermissions = true,
RemoveProof = true,
RemoveRsidInfo = true,
RemoveSmartTags = true,
RemoveSoftHyphens = true,
ReplaceTabsWithSpaces = true,
};
MarkupSimplifier.SimplifyMarkup(wordDoc, settings);

Perhaps the most useful simplification that this performs is to merge adjacent runs with identical formatting. There is not an option for this, as MarkupSimplifier will always merge adjacent runs with identical formatting. Open XML applications, including Word, can arbitrarily split runs as necessary. If you, for instance, add a comment to a document, runs will be split at the location of the start and end of the comment. After MarkupSimplifier removes comments, it can merge runs, resulting in simpler markup. Tracked revisions also cause runs to be split. After runs are split, even if the reason that they were initially split goes away, they generally stay split.

The following is markup for a paragraph that contains a tracked revision and a comment:

<w:pw:rsidR="00DD5B8D"
w:rsidRDefault="00A71735">
<w:r>
<w:txml:space="preserve">This </w:t>
</w:r>
<w:insw:id="0"
w:author="Eric White (OFFICE)"
w:date="2010-01-30T06:00:00Z">
<w:r>
<w:txml:space="preserve">is </w:t>
</w:r>
</w:ins>
<w:r>
<w:txml:space="preserve">a </w:t>
</w:r>
<w:commentRangeStartw:id="1"/>
<w:r>
<w:t>tes</w:t>
</w:r>
<w:commentRangeEndw:id="1"/>
<w:r>
<w:rPr>
<w:rStylew:val="CommentReference"/>
</w:rPr>
<w:commentReferencew:id="1"/>
</w:r>
<w:r>
<w:t>t.</w:t>
</w:r>
</w:p>

After accepting revisions and simplification, the markup looks like this:

<w:p>
<w:r>
<w:t>This is a test.</w:t>
</w:r>
</w:p>

This is easier to transform to another XML vocabulary such as XHtml.

Varieties of Simplifying Transformations

In the next sections, I'll detail the varieties of transformations that you can make to simplify markup.

Note: MarkupSimplifier relies on tracked revisions first being accepted. I've detailed the semantics of revision tracking markup in Accepting Tracked Revisions in Open XML WordprocessingML Documents. That article also explains how to get C# code to accept revisions.

Remove Content Control and Smart Tags Markup

Content controls, and smart tags are extremely powerful features of Open XML. In some scenarios, you may be interested in those pieces of markup, but in the case of my WordprocessingML => XHtml transform, I am not. I can imagine scenarios where I want to enable advanced web applications, driving them from content controls, but this would be a different transform from the one that I'm writing. These features increase the hierarchical depth of the content that they contain, and therefore complicate a transform of the content. The following example shows how the content of a content control is at a different hierarchical level from the paragraph immediately preceding it, but it is logically part of the same collection of block level content, and you need to process it as such.

<w:p>
<w:r>
<w:t>Paragraph one.</w:t>
</w:r>
</w:p>
<w:sdt>
<w:sdtPr>
<w:idw:val="367332987"/>
<w:placeholder>
<w:docPartw:val="DefaultPlaceholder_22675703"/>
</w:placeholder>
</w:sdtPr>
<w:sdtContent>
<w:p>
<w:r>
<w:t>Paragraph two.</w:t>
</w:r>
</w:p>
</w:sdtContent>
</w:sdt>
<w:p>
<w:r>
<w:t>Paragraph three.</w:t>
</w:r>
</w:p>

After removing the content control, the resulting XML is easier to process:

<w:p>
<w:r>
<w:t>Paragraph one.</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Paragraph two.</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Paragraph three.</w:t>
</w:r>
</w:p>

Remove Rsid Information

Rsid elements and attributes enable a specific scenario – merging two documents that were forked from the same document and edited by different users. In the transform to XHtml, I am not interested in that information, so I can simplify the document by removing Rsid elements and attributes. While technically not required for the XHtml transform to work properly, removing those elements and attributes make the resulting source document easier to examine to determine whether the transform worked properly.

Some time ago, in the blog post Removing Rsid Elements and Attributes before Comparing Open XML Documents, I discussed the reason for those elements, and the benefits of removing these elements and attributes before comparing documents.

Remove Comments

In the transform that I'm currently writing, I don't process comments in any form. I'm primarily interested in a reasonable fidelity transform of just the contents of a document. I can simplify my transform by removing comments before the transform.

Remove End and Foot Notes

End and foot notes could be important for a high-fidelity transform, but in my case, I am not going to produce any HTML markup for them, so removing them makes my life simpler.

Replace Tabs with Spaces

WordprocessingML has the capability for the document to specify hard tabs at specific locations, and if a w:tab element is in the paragraph, the next run is appropriately aligned with the hard tab. However, HTML has no notion of hard tabs. Replacing tabs with spaces is one of those tough choices. You can attempt to simulate hard tabs by inserting non-breaking spaces that have a specific font size, but this is at best an approximation. The resulting document very well will be rendered such that the tab is ragged. Text is positioned slightly differently depending on the text preceding the hard tab. My XHtml transform is focused on extracting the important content, and disregarding some aspects of formatting, so I elected to replace hard tabs with spaces. I'll revisit this issue in the future when I write a transform that contains a more accurate representation of formatting.

Removing Field Codes

WordprocessingML markup contains a wide variety of field codes that have powerful functionality. In my Html transform, I need to process hyperlink field codes, so I don't remove field codes in the simplification transform. But I can imagine other transforms where I don't want to process field codes, so it is an option of the MarkupSimplifier class. In the future, I'd like to modify this so that field codes can be selectively removed based on the Field type.

I'm sure that I'll be enhancing this class over time. I can do this without breaking existing code by adding members to the SimplifyMarkupSettings class, in such a way that the default behavior remains the same.