Using the Open XML SDK and LINQ to XML to Remove Personal Information from an Open XML Wordprocessing Document

This post presents some code to remove personal information from an Open XML word processing document.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCNote: this post contains interesting information for LINQ to XML developers even if you are not interested in removing personal information from a document.  It demonstrates some easy ways to write robust code in LINQ to XML that behaves properly even if we don't know the original state of the document.  For instance, it shows how to write very short, robust code to remove an element that doesn't fail if the element doesn't already exist.

One thing to note: The Ecma 376 Open XML standard states that what comprises personal information is application defined.  To determine what I needed to do for the code in this post, I created a document normally, then created a copy and removed personal information, and then compared the two using OpenXmlDiff.

After creating a document with personal information, and using the code attached to this page to remove it, Word 2007 reports that no personal information exists in the document.

Here is what the code attached to this page does:

  • In the extended file properties part (/docProps/app.xml), set the company name to the empty string.
  • In the extended file properties part, set the TotalTime element to "0"
  • In the core file properties part (/docProps/core.xml), set the dc:creator and cp:lastModifiedBy to the empty string.
  • In the core file properties part, set cp:revision to "1"
  • In the settings part (/word/settings.xml), add two empty elements, w:removePersonalInformation, w:removeDateAndTime.  These elements indicate to an application that personal information has been removed, and should not be re-added.  These elements should be added at the correct spot in the sequence, so that XSD validation will work properly.

This code uses some interesting but somewhat obscure LINQ to XML tricks to do this work in a robust fashion.  For instance, the following code uses the Elements and Remove extension methods in such a way that if the Company element doesn't exist, it works just fine, and doesn't throw an exception:

extendedFilePropertiesXDoc
.Elements(x + "Properties")
.Elements(x + "Company")
.Remove();

The way that this works - calling the first Elements(x + "Properties") method returns IEnumerable<XElement> that contains either 0 or 1 element in the collection.  The next call into Elements uses the LINQ to XML Elements extension method, which returns a collection that contains all "Company" elements that are children of any elements of the source collection.  Well, the first collection either contains 0 or 1 item in the collection, so this returns a new collection that contains either 0 or 1 "Company" elements.  Then we dot into the Remove extension method, which removes all elements in the collection.  The Remove extension method uses "snapshot semantics", which means that it materializes to a list, and then removes them.  It is fine to call the Remove extension method with an empty collection, or a collection containing a single element.

So, by using this idiom, we can write code that removes the element if it exists, and know that the code will not fail if the element to be removed doesn't exist.

The following code uses a similar approach to set a variable to the TotalTime element.  If the element doesn't exist, then the totalTime variable is set to null.

 

XElement totalTime = extendedFilePropertiesXDoc
.Elements(x + "Properties")
.Elements(x + "TotalTime")
.FirstOrDefault();
if (totalTime != null)
totalTime.Value = "0";

The way that this works is that the following expression returns a collection that contains either 0 or 1 elements:

extendedFilePropertiesXDoc
.Elements(x + "Properties")
.Elements(x + "TotalTime")

Then we "dot" into the FirstOrDefault extension method.  If there is an element in the collection, it returns it (thereby converting the collection into a singleton).  If there is no element in the collection, FirstOrDefault returns the default value for the items in the collection.  XElement is a reference type, and default value of a reference type is null, so this idiom returns either the element of interest, or null if it doesn't exist.

The following code uses extension methods in a similar way - the Nodes call returns all child nodes of the relevant element.  The call into the OfType<T> extension method then selects just for XText nodes.  The result may contain a collection of text nodes that we want to set to the empty string, or it may contain an empty collection, in which case, the code will work just fine and not throw an exception.

foreach (var textNode in coreFilePropertiesXDoc.Elements(cp + "coreProperties")
.Elements(dc + "creator")
.Nodes()
.OfType<XText>())
textNode.Value = "";

Finally, the following code shows how to add elements at a specific position in a sequence, where the preceding elements may or may not exist:

// add the new elements in the right position. Add them after the following three elements
// (which may or may not exist in the xml document).
XElement settings = documentSettingsXDoc.Root;
XElement lastOfTop3 = settings.Elements()
.Where(e => e.Name == w + "writeProtection" ||
e.Name == w + "view" ||
e.Name == w + "zoom")
.InDocumentOrder()
.LastOrDefault();
if (lastOfTop3 == null)
{
// none of those three exist, so add as first children of the root element
settings.AddFirst(
settings.Elements(w + "removePersonalInformation").Any() ?
null :
new XElement(w + "removePersonalInformation"),
settings.Elements(w + "removeDateAndTime").Any() ?
null :
new XElement(w + "removeDateAndTime")
);
}
else
{
// one of those three exist, so add after the last one
lastOfTop3.AddAfterSelf(
settings.Elements(w + "removePersonalInformation").Any() ?
null :
new XElement(w + "removePersonalInformation"),
settings.Elements(w + "removeDateAndTime").Any() ?
null :
new XElement(w + "removeDateAndTime")
);
}

We need to add our new elements after any of the specified three elements, so we select for the three elements, sort them in document order, and then get the last one.  If none of the three exist, then lastOfTop3 will be set to null:

XElement lastOfTop3 = settings.Elements()
.Where(e => e.Name == w + "writeProtection" ||
e.Name == w + "view" ||
e.Name == w + "zoom")
.InDocumentOrder()
.LastOrDefault();

And then, of course, we want to add our new elements only if they don't already exist, so we use the Elements axis method, and then use the Any extension method to tell us if the element exists or not:

    lastOfTop3.AddAfterSelf(
settings.Elements(w + "removePersonalInformation").Any() ?
null :
new XElement(w + "removePersonalInformation"),
settings.Elements(w + "removeDateAndTime").Any() ?
null :
new XElement(w + "removeDateAndTime")
);

topic

Using this approach, we add the elements in the right place, and don't add them if they already exist.

Some people don't reallize that by calling the Any extension method, we have created a very cheap operation.  Any doesn't iterate through its entire source - it just iterates until it gets one item in the source collection, in which case it returns true, or it sees that the source collection is empty, in which case it returns false.  So it is a very cheap operation.

Once you get your head around these idioms, you can write short and robust code to modify XML trees.

The code attached to this page also contains a bool function that tells you whether a document contains personal information.

Code is attached.

RemovePersonalInformation.cs