One of the more common scenarios related to a Wordprocessing document is the need to sanitize a document in order to remove personally identifiable information. What do I mean by personally identifiable information? Well, I am talking about, among other things, comments, revisions, personal information such as author name, and hidden text. This type of content may need to be stripped out of a document before the document gets sent outside a corporation.
This scenario is so important to Office that we added a Document Inspector feature in Office 2007, which is able to find and remove these types of personally identifiable information. You can find this feature by clicking the Office button | Prepare | Inspect Document. Here is what the feature looks like:
How do I perform the same actions programmatically, let's say on the server? Well, here is where the Open XML SDK can help. Today I am going to show you how to remove comments within a Wordprocessing document. This post is similar to Eric's post on using LINQ to remove comments from a document, except I will show you a solution that builds on top of version 2 of the Open XML SDK.
Imagine I have a document that has multiple comments, where some of the comments may even contain images. If you crack open the package you will notice that a Wordprocessing document that contains comments will have the following content:
- The document will contain a Comments part, which contains the content of every comment
- If applicable, the Comments part will reference other parts associated with a given comment. For example, if a comment contains an image, the comments part will reference an image part
- The main document part will contain references to comments via a comments reference element
- The main document will demarcate regions that are associated with a comment via comment range start and end elements
Here is a screenshot of an example document with comments:
To remove comments from a Wordprocessing document we need to take the following actions:
- Open up the Wordprocessing document via the Open XML SDK
- Access the main document part, which will give us access to all other parts within the package
- Delete the Comments part and all parts referenced by the Comments part
- Find all elements within the main document part associated with comments
- Delete all those found elements
- Save changes made to the document
My post will talk about using version 2 of the SDK.
If you just want to jump straight into the code, feel free to download this solution here.
The following code snippet accomplishes all six tasks discussed in the Solution section above. This code snippet builds upon some of the topics discussed in the Traversing in the Open XML SDK DOM and Open XML SDK... The Basics posts. In particular, the Descendants() method is used to find specific elements associated with comments and the generic OpenXmlElement class is used for manipulation. Another thing to note is that deleting a part via the Open XML SDK, not only deletes the part, but all parts referenced by that part as well.
static void RemoveComments(string filename)
//Open up the document
using (WordprocessingDocument myDoc = WordprocessingDocument.Open(filename, true))
//Access main document part
MainDocumentPart mainPart = myDoc.MainDocumentPart;
//Delete the comment part, plus any other part referenced, like image parts
//Find all elements that are assoicated with comments
IEnumerable<OpenXmlElement> elementList = mainPart.Document.Descendants()
.Where(el => el is CommentRangeStart ||
el is CommentRangeEnd ||
el is CommentReference);
//Delete every found element
foreach (OpenXmlElement e in elementList)
Putting everything together and running my code, I will end up with a document that is completely devoid of comments. Sweet!
Here is a screenshot of the final document: