Parsing WordML using XLinq

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

( This is a note added on 4/17/2008 - I just want to acknowege that the approach taken in this blog post is the wrong approach!!  :-)

I first posted this on August 1, 2006, before I had the necessary functional programming epiphanies.  To see the correct approach, go through this tutorial.

---------------------------------------------------------------------- 

Recently, I had a problem where there wasn't a code testing harness that would do exactly what I wanted. I want to grab my code snippet directly from my word document, compile it, run it, and validate the output. 

In more technical terms, I want to parse some WordML to grab text formatted with a given style. Further, I want to put a comment on the first line of the formatted text, and be able to grab the comment. The comment will contain the metadata that tells how to compile and run the code.

My word docs are stored in WordML (which is XML). My experiment was to see how easy it would be to pick apart the WordML using XLinq. This is the result.

First, I needed to see what the WordML looked like. If you open a WordML file, it is saved without any indenting, making it difficult to see the element tags, and the structure of the document. So I used the following program to indent the file:

using System;
using System.Collections.Generic;
using System.Text;
using System.Xml;

namespace Indent
{
class Program
{
static void Main(string[] args)
{
foreach (string s in args)
{
XmlDocument doc = new XmlDocument();
doc.Load(s);
string newName = s.Substring(0, s.Length - 4) + "_Indented.xml";
XmlTextWriter writer = new XmlTextWriter(newName, null);
writer.Formatting = Formatting.Indented;
doc.Save(writer);
}
}
}
}

The word doc that we're using for this sample is attached to this blog entry.

After building this little ap, running it, and looking at my re-formatted WordML file, I see:

<w:p>
<w:pPr>
<w:pStyle w:val="Code" />
</w:pPr>
<w:r>
<w:t>using</w:t>
</w:r>
<w:proofErr w:type="gramEnd" />
<w:r>
<w:t> System;</w:t>
</w:r>
<aml:annotation aml:id="0" w:type="Word.Comment.End" />
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference" />
<w:rFonts w:ascii="Times New Roman" w:h-ansi="Times New Roman" />
<wx:font wx:val="Times New Roman" />
</w:rPr>
<aml:annotation aml:id="0" aml:author="Eric White" aml:createdate="2006-08-01T11:50:00Z" w:type="Word.Comment" w:initials="EW">
<aml:content>
<w:p>
<w:pPr>
<w:pStyle w:val="CommentText" />
</w:pPr>
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference" />
</w:rPr>
<w:annotationRef />
</w:r>
<w:r>
<w:t>&lt;Test </w:t>
</w:r>
<w:proofErr w:type="spellStart" />
<w:r>
<w:t>SnipId</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r>
<w:t>="000101" TestId="0001"/&gt;</w:t>
</w:r>
</w:p>
</aml:content>
</aml:annotation>
</w:r>
</w:p>
<w:proofErr w:type="gramStart" />
<w:p>
<w:pPr>
<w:pStyle w:val="Code" />
</w:pPr>
<w:r>
<w:t>using</w:t>
</w:r>
<w:proofErr w:type="gramEnd" />
<w:r>
<w:t> </w:t>
</w:r>
<w:proofErr w:type="spellStart" />
<w:r>
<w:t>System.Collections.Generic</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r>
<w:t>;</w:t>
</w:r>
</w:p>
<w:proofErr w:type="gramStart" />
<w:p>
<w:pPr>
<w:pStyle w:val="Code" />
</w:pPr>
<w:r>
<w:t>using</w:t>
</w:r>
<w:proofErr w:type="gramEnd" />
<w:r>
<w:t> </w:t>
</w:r>
<w:proofErr w:type="spellStart" />
<w:r>
<w:t>System.Text</w:t>
</w:r>
<w:proofErr w:type="spellEnd" />
<w:r>
<w:t>;</w:t>
</w:r>
</w:p>

I can see where the word comment is. It is stored on a Word annotation:

<aml:annotation aml:id="0" aml:author="Eric White" aml:createdate="2006-08-01T11:50:00Z" w:type="Word.Comment" w:initials="EW">

So in XLinq, I can issue a query to select all annotations:

var commentNodes =
from annos in wordDoc.Descendants(aml + "annotation")
where (string)annos.Attribute(w + "type") == "Word.Comment"
select annos;

Word breaks up text, but it is easy to re-assemble: Paragraphs are contained in 'p' elements. Text is contained in 't' elements. The following XLinq code assembles text:

StringBuilder comment = new StringBuilder();

foreach (var p in commentNode.Descendants(w + "p"))
{
foreach (var t in p.Descendants(w + "t"))
comment.Append(t.Value);
comment.Append("\n");
}

Once we have found and extracted the relevant comment, we then need to jump up two ancestors:

var codePara = commentNode.Parent.Parent;

Now, we have the node of the first paragraph of the code in the word doc. The logic next consists of:

  • If we are still on a paragraph styled code
  • Get all the text in the paragraph
  • Get rid of all annotations that are not Word.Insertion
  • Assemble the text
  • Move on to the next paragraph

This is the code to do this:

while(true)
{
XElement c1, c2;

if (codePara.Name.LocalName == "proofErr")
{
codePara = (XElement)codePara.NextNode;
continue;
}

// if there is a pPr that has a pStyle with val="Code"
if (
((c1 = codePara.Element(w + "pPr")) != null) &&
((c2 = c1.Element(w + "pStyle")) != null) &&
((string)c2.Attribute(w + "val") == "Code")
)
{
// select all of the nodes that have content
var interestingPieces =
from s in codePara.Elements()
where (s.Name == w + "r") ||
((s.Name == aml + "annotation") &&
((string)s.Attribute(w + "type") == "Word.Insertion"))
select s;

        // get rid of all annotations that are just comments
List<XElement> le = new List<XElement>();
foreach (var i in interestingPieces)
{
var e = i.Element(aml + "annotation");
if (e != null)
{
if ((string)e.Attribute(w + "type") == "Word.Comment")
continue;
else
le.Add(i);
}
else
le.Add(i);
}

foreach (var t in le.Descendants(w + "t"))
code.Append(t.Value);
code.Append("\n");

codePara = (XElement)codePara.NextNode;
if (codePara == null)
break;
if (!(codePara is XElement))
break;
}
else
break;
}

 

 

The above code works even when change tracking has been turned on, and there are changes in the text. The entire program follows:

using System;
using System.Collections.Generic;
using System.Text;
using System.Query;
using System.Xml.XLinq;
using System.Data.DLinq;

namespace WordMLReader
{
class Program
{
static void WordMLReader(string fn)
{
XElement wordDoc = null;
try {
wordDoc = XElement.Load(fn);
}
catch (System.Xml.XmlException e)
{
Console.WriteLine(e.ToString());
return;
}

            XNamespace aml = "https://schemas.microsoft.com/aml/2001/core";
XNamespace w = "https://schemas.microsoft.com/office/word/2003/wordml";

            var commentNodes =
from annos in wordDoc.Descendants(aml + "annotation")
where (string)annos.Attribute(w + "type") == "Word.Comment"
select annos;

foreach (var commentNode in commentNodes)
{
StringBuilder comment = new StringBuilder();
StringBuilder code = new StringBuilder();

                foreach (var p in commentNode.Descendants(w + "p"))
{
foreach (var t in p.Descendants(w + "t"))
comment.Append(t.Value);
comment.Append("\n");
}

                var codePara = commentNode.Parent.Parent;

                while(true)
{
XElement c1, c2;

                    if (codePara.Name.LocalName == "proofErr")
{
codePara = (XElement)codePara.NextNode;
continue;
}

                    // if there is a pPr that has a pStyle with val="Code"
if (
((c1 = codePara.Element(w + "pPr")) != null) &&
((c2 = c1.Element(w + "pStyle")) != null) &&
((string)c2.Attribute(w + "val") == "Code")
)
{
// select all of the nodes that have content
var interestingPieces =
from s in codePara.Elements()
where (s.Name == w + "r") ||
((s.Name == aml + "annotation") &&
((string)s.Attribute(w + "type") == "Word.Insertion"))
select s;

                        // get rid of all annotations that are just comments
List<XElement> le = new List<XElement>();
foreach (var i in interestingPieces)
{
var e = i.Element(aml + "annotation");
if (e != null)
{
if ((string)e.Attribute(w + "type") == "Word.Comment")
continue;
else
le.Add(i);
}
else
le.Add(i);
}

                        foreach (var t in le.Descendants(w + "t"))
code.Append(t.Value);
code.Append("\n");

                        codePara = (XElement)codePara.NextNode;
if (codePara == null)
break;
if (!(codePara is XElement))
break;
}
else
break;
}

                Console.WriteLine("============= This is the code =============");
Console.WriteLine(code);
Console.WriteLine("============================================");
Console.WriteLine("");
Console.WriteLine("============= This is the comment =============");
Console.WriteLine(comment);
Console.WriteLine("===============================================");

            }
}

        static void Main(string[] args)
{
WordMLReader("CodeInDoc.xml");
}
}
}

 

 

 

When you have the attached word doc, and you run the code, you see:

============= This is the code =============
using System;
using System.Collections.Generic;
using System.Text;
using System.Query;
using System.Xml.XLinq;
using System.Data.DLinq;

namespace WordMLReader
{
class Program
{
static void (string[] args)
{
Console.WriteLine("Hello");
}
}
}

============================================

============= This is the comment =============
<Test SnipId="000101" TestId="0001"/>

===============================================

 

CodeInDoc.xml