Finding Paragraphs by Style Name or Content in an Open XML Word Processing Document

About a week ago, I posted a very interesting guest post by Bob McClellan, where he discussed some code that allows you to more easily move/insert/delete paragraphs in Open XML documents.  He is in the process of putting together a PowerShell cmdlet that demonstrates this functionality within the PowerTools for Open XML open source project.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCHaving PowerShell cmdlets that enable us to slice and dice Open XML documents is super.  However, it is only half the story, in the context of PowerShell.  The other half is finding the paragraph numbers of relevant paragraphs in a document.  Once we have this functionality, we can write short PowerShell scripts to accomplish scenarios such as:

·         Split a document on every paragraph styled ‘Heading1’ or ‘Heading2’.

·         Find a range of paragraphs, and programmatically remove it from a document.

·         Find a paragraph in one document, and insert it in a desired position in another document.

My task in the PowerTools for Open XML project is to write the LINQ queries to find paragraphs, and then Bob McClellan will encapsulate those queries into a new cmdlet that finds paragraphs.

I’ve thought for a while that it would be useful to illuminate my thought process to develop such a query, written in the functional style.  I’ve been looking for an example or task where I don’t immediately know all the details of the resulting query that I’m writing, and then expose my thought process, as I accomplish this task.  Of course, folks who already are LINQ or FP experts already know about this thought process.  I’m targeting this post to those who are just getting going in writing complex queries in the functional style.  This post will be primarily an educational tool.

As I list each successive iteration of the query as I develop it, I’ll highlight the changed parts, so you can see what I’ve changed.

(Update Feb 20, 2009 - After review, we continued to alter this query, so I documented that process as well.  You can find the continuation of this post here).

Definition of the Task

An overview of the task:  Find specific paragraphs (technically child elements of the w:body element) based on certain criteria – a style name, or specific text contained in the paragraph.  The return value of this method will be an array of zero-based integers that correspond to child elements of w:body that match the specified selection criteria.  We’re not just looking for paragraphs – we’re really looking for child elements of the w:body element, which might include tables, and content controls, among other things.

For the upcoming PowerTools release, I’m not going to ‘gold-plate’ this search functionality.  It’s enough to get something useful up and going.  After all, the PowerTools for Open XML is an open source project that’s primary goal is to provide examples and guidance, and we can change cmdlets in a much more agile way than in a traditional commercial project.

A developer very well may want to search on paragraph style, or on paragraph contents, or both.  We’ll allow the caller of this method to pass one or more search strings for paragraph style, and the function will return the list of all paragraphs found with any of the specified search strings.  Ditto for search strings for paragraph content.  And if search criteria is specified for finding both styles and paragraph content, the method will return the intersection of the results.

The method returns an array of integers, which are zero-based indexes into the collection of child elements of the w:body element.

This code uses the Open XML SDK, so you will need to download it and add a reference to it.

Here is the prototype of the method:

static int[] SearchInDocument(WordprocessingDocument doc,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)

In addition to that method, for convenience in development, I’ll add two more overloads where I can pass in the filename:

static int[] SearchInDocument(string filename,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)

static int[] SearchInDocument(string filename, string styleSearchString,
string contentSearchString)

This post is going to be a quite long one – if you’re just interested in the final results, you can skip to the end.  J

Step 1

The shell of a program in which we'll write the query looks like this:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

class Program
{
static int[] SearchInDocument(WordprocessingDocument doc,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
return new[] { 0 };
}

static int[] SearchInDocument(string filename,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
using (WordprocessingDocument doc =
WordprocessingDocument.Open(filename, false))
return SearchInDocument(doc, styleSearchString, contentSearchString);
}

static int[] SearchInDocument(string filename, string styleSearchString,
string contentSearchString)
{
return SearchInDocument(filename,
styleSearchString != null ? new List<string>() { styleSearchString } : null,
contentSearchString != null ? new List<string>() { contentSearchString } : null);
}

static void Main(string[] args)
{
SearchInDocument("Test.docx", "Normal", "Hello");
}
}

Step 2

We’re now ready to start writing the query.  First, we’ll project a new collection of an anonymous type that includes the XElement node of the child of w:body, and the index of the node, using the overload of the Select method that includes the index.  We’ll also include the GetXDocument extension method:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public static class LocalExtensions
{
public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader sr = new StreamReader(part.GetStream()))
using (XmlReader xr = XmlReader.Create(sr))
xdoc = XDocument.Load(xr);
part.AddAnnotation(xdoc);
return xdoc;
}
}

class Program
{
static int[] SearchInDocument(WordprocessingDocument doc,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

var q1 = doc
.MainDocumentPart
.GetXDocument()
.Root
.Element(w + "body")
.Elements()
.Select((p, i) =>
new
{
Element = p,
Index = i
}
);

foreach (var item in q1)
{
Console.WriteLine(item.Index);
}
Environment.Exit(0);

return new[] { 0 };
}

static int[] SearchInDocument(string filename,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
using (WordprocessingDocument doc =
WordprocessingDocument.Open(filename, false))
return SearchInDocument(doc, styleSearchString, contentSearchString);
}

static int[] SearchInDocument(string filename, string styleSearchString,
string contentSearchString)
{
return SearchInDocument(filename,
styleSearchString != null ? new List<string>() { styleSearchString } : null,
contentSearchString != null ? new List<string>() { contentSearchString } : null);
}

static void Main(string[] args)
{
SearchInDocument("Test.docx", "Normal", "Hello");
}
}

This outputs a list of indexes:

0
1
2
3
4
5
6
7
8
Press any key to continue . . .

Step 3

I’m also going to need the default style name, so we’ll add a new query to determine it.  Then we'll add the style name for each paragraph to our projection of the anonymous type.  This code uses the approach in this post that shows code that will work reliably if nodes in the XML tree being queried may or may not exist.  Notice that I converted the lambda expression in the call to Select to a statement lambda expression:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public static class LocalExtensions
{
public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader sr = new StreamReader(part.GetStream()))
using (XmlReader xr = XmlReader.Create(sr))
xdoc = XDocument.Load(xr);
part.AddAnnotation(xdoc);
return xdoc;
}
}

class Program
{
static int[] SearchInDocument(WordprocessingDocument doc,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

var defaultStyleName = (string)doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(style =>
(string)style.Attribute(w + "type") == "paragraph" &&
(string)style.Attribute(w + "default") == "1")
.First()
.Attribute(w + "styleId");

var q1 = doc
.MainDocumentPart
.GetXDocument()
.Root
.Element(w + "body")
.Elements()
.Select((p, i) =>
{
var styleNode = p.Elements(w + "pPr").Elements(w + "pStyle").FirstOrDefault();
var styleName = styleNode != null ?
(string)styleNode.Attribute(w + "val") :
defaultStyleName;
return new
{
Element = p,
Index = i,
StyleName = styleName
};
}
);

foreach (var item in q1)
{
Console.WriteLine("{0}:{1}", item.Index, item.StyleName);
}
Environment.Exit(0);

return new[] { 0 };
}

static int[] SearchInDocument(string filename,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
using (WordprocessingDocument doc =
WordprocessingDocument.Open(filename, false))
return SearchInDocument(doc, styleSearchString, contentSearchString);
}

static int[] SearchInDocument(string filename, string styleSearchString,
string contentSearchString)
{
return SearchInDocument(filename,
styleSearchString != null ? new List<string>() { styleSearchString } : null,
contentSearchString != null ? new List<string>() { contentSearchString } : null);
}

static void Main(string[] args)
{
SearchInDocument("Test.docx", "Normal", "Hello");
}
}

Step 4

Next, I need to get searchable text from each child node.  This code uses a couple of overloads of the StringConcatenate extension method, also defined in the MyExtensions class:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public static class LocalExtensions
{
public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item));
return sb.ToString();
}

public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func, string separator)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item)).Append(separator);
return sb.ToString().Trim(separator.ToCharArray());
}

public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader sr = new StreamReader(part.GetStream()))
using (XmlReader xr = XmlReader.Create(sr))
xdoc = XDocument.Load(xr);
part.AddAnnotation(xdoc);
return xdoc;
}
}

class Program
{
static int[] SearchInDocument(WordprocessingDocument doc,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";
XName r = w + "r";
XName ins = w + "ins";

var defaultStyleName = (string)doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(style =>
(string)style.Attribute(w + "type") == "paragraph" &&
(string)style.Attribute(w + "default") == "1")
.First()
.Attribute(w + "styleId");

var q1 = doc
.MainDocumentPart
.GetXDocument()
.Root
.Element(w + "body")
.Elements()
.Select((p, i) =>
{
var styleNode = p.Elements(w + "pPr").Elements(w + "pStyle").FirstOrDefault();
var styleName = styleNode != null ?
(string)styleNode.Attribute(w + "val") :
defaultStyleName;
return new
{
Element = p,
Index = i,
StyleName = styleName
};
}
);

var q2 = q1
.Select(i =>
{
string text = null;
if (i.Element.Name == w + "p")
text = i.Element.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element);
else
text = i.Element
.Descendants(w + "p")
.StringConcatenate(p => p
.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element), Environment.NewLine
);

return new
{
Element = i.Element,
StyleName = i.StyleName,
Index = i.Index,
Text = text
};
}
);

foreach (var item in q2)
{
Console.WriteLine("{0}:{1}", item.Index, item.Text);
}
Environment.Exit(0);

return new[] { 0 };
}

static int[] SearchInDocument(string filename,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
using (WordprocessingDocument doc =
WordprocessingDocument.Open(filename, false))
return SearchInDocument(doc, styleSearchString, contentSearchString);
}

static int[] SearchInDocument(string filename, string styleSearchString,
string contentSearchString)
{
return SearchInDocument(filename,
styleSearchString != null ? new List<string>() { styleSearchString } : null,
contentSearchString != null ? new List<string>() { contentSearchString } : null);
}

static void Main(string[] args)
{
SearchInDocument("Test.docx", "Normal", "Hello");
}
}

Step 5

Before we can start searching for styles or content, we also need to handle one more aspect.  Styles can be based on other styles, so if we’re searching for a style named “Code”, we also want to find paragraphs that have a style that inherits from “Code”.  We can write a small method to concatenate all inherited styles, separating them with the tab character.  The tab character isn’t a valid part of a style name, so we’re safe to use the tab character to delimit styles:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public static class LocalExtensions
{
public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item));
return sb.ToString();
}

public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func, string separator)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item)).Append(separator);
return sb.ToString().Trim(separator.ToCharArray());
}

public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader sr = new StreamReader(part.GetStream()))
using (XmlReader xr = XmlReader.Create(sr))
xdoc = XDocument.Load(xr);
part.AddAnnotation(xdoc);
return xdoc;
}
}

class Program
{
static IEnumerable<string> GetInheritedStyles(WordprocessingDocument doc, string styleName)
{
string localStyleName = styleName;
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

yield return styleName;
while (true)
{
XElement style = doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(e => (string)e.Attribute(w + "type") == "paragraph" &&
(string)e.Element(w + "name").Attribute(w + "val") == localStyleName)
.FirstOrDefault();

if (style == null)
yield break;

var basedOn = (string)style
.Elements(w + "basedOn")
.Attributes(w + "val")
.FirstOrDefault();

if (basedOn == null)
yield break;

yield return basedOn;
localStyleName = basedOn;
}
}

static int[] SearchInDocument(WordprocessingDocument doc,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";
XName r = w + "r";
XName ins = w + "ins";

var defaultStyleName = (string)doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(style =>
(string)style.Attribute(w + "type") == "paragraph" &&
(string)style.Attribute(w + "default") == "1")
.First()
.Attribute(w + "styleId");

var q1 = doc
.MainDocumentPart
.GetXDocument()
.Root
.Element(w + "body")
.Elements()
.Select((p, i) =>
{
var styleNode = p.Elements(w + "pPr").Elements(w + "pStyle").FirstOrDefault();
var styleName = styleNode != null ?
(string)styleNode.Attribute(w + "val") :
defaultStyleName;
return new
{
Element = p,
Index = i,
StyleName = styleName
};
}
);

var q2 = q1
.Select(i =>
{
string text = null;
if (i.Element.Name == w + "p")
text = i.Element.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element);
else
text = i.Element
.Descendants(w + "p")
.StringConcatenate(p => p
.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element), Environment.NewLine
);

return new
{
Element = i.Element,
StyleName = i.StyleName,
Index = i.Index,
Text = text
};
}
);

var q3 = q2
.Select(i =>
new
{
Element = i.Element,
StyleName = i.StyleName,
Index = i.Index,
Text = i.Text,
InheritedStyles = GetInheritedStyles(doc, i.StyleName)
.StringConcatenate(s => s, "\t")
}
);

foreach (var item in q3)
{
Console.WriteLine("{0}:{1}", item.Index, item.InheritedStyles);
}
Environment.Exit(0);

return new[] { 0 };
}

static int[] SearchInDocument(string filename,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
using (WordprocessingDocument doc =
WordprocessingDocument.Open(filename, false))
return SearchInDocument(doc, styleSearchString, contentSearchString);
}

static int[] SearchInDocument(string filename, string styleSearchString,
string contentSearchString)
{
return SearchInDocument(filename,
styleSearchString != null ? new List<string>() { styleSearchString } : null,
contentSearchString != null ? new List<string>() { contentSearchString } : null);
}

static void Main(string[] args)
{
SearchInDocument("Test.docx", "Normal", "Hello");
}
}

Step 6

We’re finally ready to search for content.  As you can see in the code below, I’ve written a function, ContainsAny that returns true if the string to search contains any of a collection of the string to search for.  And there’s a bit of code to implement the appropriate behavior if the caller of the function specifies both a style to search for, and content to search for.  If both are specified, then the search method includes only paragraphs that contain both the specified style and contain the specified text.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.Xml.Linq;
using DocumentFormat.OpenXml.Packaging;

public static class LocalExtensions
{
public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item));
return sb.ToString();
}

public static string StringConcatenate<T>(this IEnumerable<T> source,
Func<T, string> func, string separator)
{
StringBuilder sb = new StringBuilder();
foreach (T item in source)
sb.Append(func(item)).Append(separator);
return sb.ToString().Trim(separator.ToCharArray());
}

public static XDocument GetXDocument(this OpenXmlPart part)
{
XDocument xdoc = part.Annotation<XDocument>();
if (xdoc != null)
return xdoc;
using (StreamReader sr = new StreamReader(part.GetStream()))
using (XmlReader xr = XmlReader.Create(sr))
xdoc = XDocument.Load(xr);
part.AddAnnotation(xdoc);
return xdoc;
}
}

class Program
{
static bool ContainsAny(string stringToSearch, IEnumerable<string> searchStrings)
{
foreach (var s in searchStrings)
if (stringToSearch.Contains(s))
return true;
return false;
}

static IEnumerable<string> GetInheritedStyles(WordprocessingDocument doc, string styleName)
{
string localStyleName = styleName;
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";

yield return styleName;
while (true)
{
XElement style = doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(e => (string)e.Attribute(w + "type") == "paragraph" &&
(string)e.Element(w + "name").Attribute(w + "val") == localStyleName)
.FirstOrDefault();

if (style == null)
yield break;

var basedOn = (string)style
.Elements(w + "basedOn")
.Attributes(w + "val")
.FirstOrDefault();

if (basedOn == null)
yield break;

yield return basedOn;
localStyleName = basedOn;
}
}

static int[] SearchInDocument(WordprocessingDocument doc,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
XNamespace w =
"https://schemas.openxmlformats.org/wordprocessingml/2006/main";
XName r = w + "r";
XName ins = w + "ins";

var defaultStyleName = (string)doc
.MainDocumentPart
.StyleDefinitionsPart
.GetXDocument()
.Root
.Elements(w + "style")
.Where(style =>
(string)style.Attribute(w + "type") == "paragraph" &&
(string)style.Attribute(w + "default") == "1")
.First()
.Attribute(w + "styleId");

var q1 = doc
.MainDocumentPart
.GetXDocument()
.Root
.Element(w + "body")
.Elements()
.Select((p, i) =>
{
var styleNode = p.Elements(w + "pPr").Elements(w + "pStyle").FirstOrDefault();
var styleName = styleNode != null ?
(string)styleNode.Attribute(w + "val") :
defaultStyleName;
return new
{
Element = p,
Index = i,
StyleName = styleName
};
}
);

var q2 = q1
.Select(i =>
{
string text = null;
if (i.Element.Name == w + "p")
text = i.Element.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element);
else
text = i.Element
.Descendants(w + "p")
.StringConcatenate(p => p
.Elements()
.Where(z => z.Name == r || z.Name == ins)
.Descendants(w + "t")
.StringConcatenate(element => (string)element), Environment.NewLine
);

return new
{
Element = i.Element,
StyleName = i.StyleName,
Index = i.Index,
Text = text
};
}
);

var q3 = q2
.Select(i =>
new {
Element = i.Element,
StyleName = i.StyleName,
Index = i.Index,
Text = i.Text,
InheritedStyles = GetInheritedStyles(doc, i.StyleName).StringConcatenate(s => s, "\t")
}
);

int[] q4 = null;
if (styleSearchString != null)
q4 = q3
.Where(i => ContainsAny(i.InheritedStyles, styleSearchString))
.Select(i => i.Index)
.ToArray();

int[] q5 = null;
if (contentSearchString != null)
q5 = q3
.Where(i => ContainsAny(i.Text, contentSearchString))
.Select(i => i.Index)
.ToArray();

int[] q6 = null;
if (q4 != null && q5 != null)
q6 = q4.Intersect(q5).ToArray();
else
q6 = q5 != null ? q5 : q4;

return q6;
}

static int[] SearchInDocument(string filename,
IEnumerable<string> styleSearchString, IEnumerable<string> contentSearchString)
{
using (WordprocessingDocument doc =
WordprocessingDocument.Open(filename, false))
return SearchInDocument(doc, styleSearchString, contentSearchString);
}

static int[] SearchInDocument(string filename, string styleSearchString,
string contentSearchString)
{
return SearchInDocument(filename,
styleSearchString != null ? new List<string>() { styleSearchString } : null,
contentSearchString != null ? new List<string>() { contentSearchString } : null);
}

static void Main(string[] args)
{
int[] results = SearchInDocument("Test.docx", "Normal", "Hello");
foreach (var i in results)
Console.WriteLine(i);
}
}

We've finished.  This is code written in the functional style that searches for paragraphs of a particular style, or that contain specified content.  It finds all paragraphs with any specified style.  It also finds all paragraphs with any specified content.  It then returns the intersection of the results of the two searches. 

The final code is attached.

(Update: Feb 20, 2009 - subsequent to finishing this post, I maintained this code, modifying and improving it.  As with this post, I documented the steps I took.  You can find the continuation of this post here.)

Program.cs