Mastering Text in Open XML Word-Processing Documents

Processing text in Open XML word-processing documents seems deceptively simple at first – you have the body of the document, paragraphs and tables in the body, and rows and cells in tables, just like HTML, right?  Then it seems deceptively hard – you see the markup for revision tracking, numbered and bulleted lists, content controls, markup that doesn't affect text, such as bookmarks and comments, and so on.  Styles might seem like they don't impact text, but in the case of numbered and bulleted lists, they do.  Actually, the truth is, it is somewhere around the middle.  There is a lot to keep track of, but each one of these features, taken by itself is not very complicated.

This is one in a series of posts on transforming Open XML WordprocessingML to XHtml.  You can find the complete list of posts here.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC

That said, there are some basic ideas and abstractions that can simplify how you think about word-processing markup.  These abstractions are relevant regardless of whether you are working with word-processing markup using the Open XML SDK 2.0 strongly-typed object model, using the Open XML SDK with LINQ to XML, or using some other platform, such as Java or PHP.  We can write some code that will help us to deal with these abstractions.  The code will 'surface' just those elements that you are interested in, and surface them in an organized, predictable manner.  In the MSDN article, Mastering Text in Open XML WordprocessingML Documents, I present C# code written with both LINQ to XML and with the Open XML SDK 2.0 strongly-typed object model.  It is not a lot of code.  Because the semantics of a few useful methods are defined carefully, they are easy to implement in whatever language and platform that you are using.