Additional Notes on Parsing WordML with XLinq

Article
08/04/2006

[Blog Map] This blog is inactive. New blog: EricWhite.com/blog

( This is a note added on 8/1/2008 - I just want to acknowege that the approach taken in this and the previous blog post is the wrong approach. :-)

I first posted this on August 1, 2006, before I had the necessary functional programming epiphanies. To see the correct approach, go through this tutorial.

----------------------------------------------------------------------

Here are a few more notes on parsing WordML with XLinq.

First, in the blog entry, I didn't call out that the Word doc CodeInDoc.xml is attached to the blog entry. It is.

Second, Steve Eichert has some good additional points to make about code that reads WordML.

Third, when you look at the WordML, those pesky namespace prefixes make it somewhat difficult to see the structure of the XML. The WordML fragment that I showed in the previous blog is:

<w:p>
<w:pPr>
<w:pStyle w:val="Code" />
</w:pPr>
<w:r>
<w:t>using</w:t>
</w:r>
<w:proofErr w:type="gramEnd" />
<w:r>
<w:t> System;</w:t>
</w:r>
<aml:annotation aml:id="0" w:type="Word.Comment.End" />
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference" />
<w:rFonts w:ascii="Times New Roman" w:h-ansi="Times New Roman" />
<wx:font wx:val="Times New Roman" />
</w:rPr>
<aml:annotation aml:id="0" aml:author="Eric White" aml:createdate="2006-08-01T11:50:00Z" w:type="Word.Comment" w:initials="EW">
<aml:content>
<w:p>
<w:pPr>
<w:pStyle w:val="CommentText" />
</w:pPr>
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference" />
</w:rPr>
<w:annotationRef />
</w:r>
<w:r>
<w:t><Test </w:t>
</w:r>
<w:proofErr w:type="spellStart" />
<w:r>
<w:t>SnipId</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r>
<w:t>="000101" TestId="0001"/></w:t>
</w:r>
</w:p>
</aml:content>
</aml:annotation>
</w:r>
</w:p>
<w:proofErr w:type="gramStart" />
<w:p>
<w:pPr>
<w:pStyle w:val="Code" />
</w:pPr>
<w:r>
<w:t>using</w:t>
</w:r>
<w:proofErr w:type="gramEnd" />
<w:r>
<w:t> </w:t>
</w:r>
<w:proofErr w:type="spellStart" />
<w:r>
<w:t>System.Collections.Generic</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r>
<w:t>;</w:t>
</w:r>
</w:p>
<w:proofErr w:type="gramStart" />
<w:p>
<w:pPr>
<w:pStyle w:val="Code" />
</w:pPr>
<w:r>
<w:t>using</w:t>
</w:r>
<w:proofErr w:type="gramEnd" />
<w:r>
<w:t> </w:t>
</w:r>
<w:proofErr w:type="spellStart" />
<w:r>
<w:t>System.Text</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r>
<w:t>;</w:t>
</w:r>
</w:p>

Pull the file into your favorite editor, and substitute w: with nothing:

<pPr>
<pStyle val="Code" />
</pPr>
<r>
<t>using</t>
</r>
<proofErr type="gramEnd" />
<r>
<t> System;</t>
</r>
<aml:annotation aml:id="0" type="Word.Comment.End" />
<r>
<rPr>
<rStyle val="CommentReference" />
<rFonts ascii="Times New Roman" h-ansi="Times New Roman" />
<wx:font wx:val="Times New Roman" />
</rPr>
<aml:annotation aml:id="0" aml:author="Eric White" aml:createdate="2006-08-01T11:50:00Z" type="Word.Comment" initials="EW">
<aml:content>

<pPr>
<pStyle val="CommentText" />
</pPr>
<r>
<rPr>
<rStyle val="CommentReference" />
</rPr>
<annotationRef />
</r>
<r>
<t><Test </t>
</r>
<proofErr type="spellStart" />
<r>
<t>SnipId</t>
</r>
<proofErr type="spellEnd" />
<r>
<t>="000101" TestId="0001"/></t>
</r>

</aml:content>
</aml:annotation>
</r>

<proofErr type="gramStart" />

<pPr>
<pStyle val="Code" />
</pPr>
<r>
<t>using</t>
</r>
<proofErr type="gramEnd" />
<r>
<t> </t>
</r>
<proofErr type="spellStart" />
<r>
<t>System.Collections.Generic</t>
</r>
<proofErr type="spellEnd" />
<r>
<t>;</t>
</r>

<proofErr type="gramStart" />

<pPr>
<pStyle val="Code" />
</pPr>
<r>
<t>using</t>
</r>
<proofErr type="gramEnd" />
<r>
<t> </t>
</r>
<proofErr type="spellStart" />
<r>
<t>System.Text</t>
</r>
<proofErr type="spellEnd" />
<r>
<t>;</t>
</r>

It is significantly easier to see the structure of the XML. Of particular interest are the and <t> elements.

One point that would clarify the WordML even more would be to write a couple of lines of XLinq code that would suck the proofErr elements out. After reading in the elements into an XML tree, you can remove the elements by:

foreach (XElement z in wml.Descendants("proofErr").ToList())
z.Remove();

You have to copy the results of the iteration into a list due to the Halloween problem. This will be addressed in the docs.

Additional Notes on Parsing WordML with XLinq

Additional resources