We can potentially make extensive modifications to Open XML word processing documents. Many scenarios would benefit from the ability to move paragraphs around (both inter and intra-document), insert paragraphs, and delete paragraphs. In practice, the process can be quite daunting because paragraphs often contain markup that refers to something outside of the paragraph. For example, a paragraph might contain markup indicating that a comment begins inside that paragraph. But because the comment ends in a later paragraph, another paragraph contains the markup indicating the end of the comment. And the comment markup contains a reference to another one of the Open XML parts (the Comments part, of course), with a zero-based index. There are a fair number of cases of markup in a paragraph that have relations to markup outside the paragraph, including comments, bookmarks, images, etc. When you insert, delete, and move paragraphs, you must keep this markup in sync. This isn’t easy.
(Update March 27, 2009 - updated the Headers and Footers section, and added the Charts section.)
This blog post details much of what you must do to keep such inter-related markup in sync. It also presents some sample code (from the PowerTools for Open XML open source project), that shows one approach to solving this problem. It is useful to have example code like this. It can serve to educate about the Open XML specification, and perhaps inspire other innovative approaches to solving this problem.
Note from Eric White: This is a guest post written by Bob McClellan, who is one of the open source developers on the PowerTools for Open XML project on CodePlex. This code is the basis of some new cmdlets that enable composability of documents – take any number of source documents, specify some subset of each document in each source document, and compose a new, valid Open XML document. The code takes care of a myriad of details to make sure that the new document is valid.
This is one of the most important Open XML posts that I’ve put on my blog. The details handed by the code presented in this post enable true document composability.
The gist of the approach presented here is that we don’t actually modify any existing documents. Instead, we start with any number of source documents, extract an arbitrary number of paragraphs (from any location in each document), and assemble a new, valid Open XML document. If a paragraph contains a reference to an image, the image is moved to the new package, and the appropriate fix-ups are applied to the markup that refers to the image. Ditto for comments, and many other cases. We can use this basic function to do other interesting things – to delete a range of paragraphs in the middle of the document, we can specify the source document twice, first specifying a range up to the section that we want to delete, and then specifying a range of paragraphs after the deleted range. After assembling the new document, we have a valid Open XML word processing document with the unwanted paragraphs removed.
In the next section, I will explain how to use the example code. The final section explains generally how the code in the example works.
To get this example, you will need to download it from the CodePlex web site for the “PowerTools for OpenXML” project. (The PowerTools contain a number of PowerShell cmdlets that manipulate Open XML documents.) The link is www.CodePlex.com/PowerTools (then click on the “Releases” tab, and look for “DocumentBuilder.zip” under “Downloads and Files”). The file that you will download is a compressed file containing the entire folder structure including the source code, sample documents and Visual Studio solution and project. You can open the solution file in any version of Visual Studio 2008 that supports C#, including the Visual C# 2008 Express.
In addition, this code uses the Open XML SDK Version 1 or Version 2. In order to use this code, you need to download and install one of the versions of the SDK. In addition, you may need to update the reference to the assembly in the project.
Running the Example
The example program creates a few new documents (named Test1.docx, Test2.docx, and Test3.docx) from existing documents (named Source1.docx and Source2.docx). After running the example, Test1.docx contains a few paragraphs extracted from an existing document. Test2.docx contains a set of paragraphs from the beginning and from the end of an existing document (omitting some paragraphs in the middle). Test3.docx concatenates two existing documents into a new document. This is a more generalized and effective approach for assembling documents than using altChunk.
Note: the source documents are located in the bin/debug directory under the project. The created documents are placed in the same directory. This allows you to simply open the project, fix up the reference to the Open XML SDK (if necessary), and run the program to see the newly created documents.
Each test calls the DocumentBuilder.BuildDocument method to create the new document. The code for Test2 looks like this:
using (WordprocessingDocument part1 =
List<Source> sources = new List<Source>();
sources.Add(new Source(part1, 0, 12, true));
sources.Add(new Source(part1, 49, true));
The first line is calling the Open XML SDK to open the source document. Then, the example creates a list of objects of type OpenXml.PowerTools.Source. This collection is used to define which groups of paragraphs, or entire documents, will be used to create a new document. In this case, two groups of paragraphs from the same document will be used to create the new document. The first Source object is created with a “start” value of 0 and “count” of 12. That means that twelve paragraphs will be extracted from the existing document starting with the first paragraph. The starting paragraph is numbered 0, so if you are used to thinking of the first paragraph as paragraph 1, just subtract one when specifying the starting paragraph value. The second Source object specifies only the “start” value, so it extracts all the remaining paragraphs starting at that value. The call to BuildDocument creates the new document, named “Test2.docx” in this case.
NOTE: Although I have been describing the source parameters as referring to paragraphs, that is not strictly true. It actually refers to elements that are children of the “body” element. Although paragraphs are the most common, child elements of the body element may include tables, various range elements (start and end ranges often appear as siblings of paragraphs), content controls and others (e.g. Math paragraphs). For more complex documents, you will probably need to view the XML directly in order to determine which values to use for start and count. Of course, these values can be determined programmatically by scanning the XML.
How it Works
The source code for this example consists of three files. Program.cs contains the Main function which makes calls to BuildDocument. DocumentExtensions.cs contains extensions to the WordprocessingDocument class that handle reading and writing the document parts (using the GetXDocument and FlushParts extension methods). The code in DocumentExtensions.cs will be covered in detail in a future blog post. DocumentBuilder.cs contains the BuildDocument method, two versions of BuildOpenDocument, and supporting methods. It also defines the Source class used to declare the desired sources for BuildDocument. Each of these code sections are described below in general terms. If you have more specific questions about the code, you are invited to ask questions in the comments on this post.
The first part of DocumentBuilder.cs defines the Source class. The various constructors for the class all result in the same internal structure. The member variables are a source document, the desired contents of that document (expressed as the starting paragraph number, and count of paragraphs in the constructor), and a Boolean value indicating if the final section divider should be retained. Each constructor uses a LINQ-to-XML query to define the desired contents. Note that these contents are not actually stored in the variable, but that the query itself is stored. The elements of the query will only be retrieved as needed during processing.
The next part defines the XML namespace strings used in Wordprocessing documents. These will be used throughout the code to refer to various element and attributes. Next are the public functions, BuildDocument and BuildOpenDocument. BuildDocument is used in the example and explained above. There are two overloads of BuildOpenDocument. The one that creates a file-based document is used by some of the PowerTools cmdlets. The other shows how to use a MemoryStream to create a document (which could be useful when using a SharePoint document library instead of the file system for document storage). All of the public functions call a common internal function, DoBuildDocument. Although the code may look fairly complex, it can be broken down into a few general steps.
1. Create the new document. A main document part must be created with “document” and “body” XML elements.
2. Copy parts from the first source document. A number of parts are copied from the first source document (e.g. styles) so that the new document will generally look like the first document in the list of the source documents - the first source document is essentially the “master” of the new document. The parts that are copied are core, extended and custom file properties, settings, web settings, styles, font tables, and any theme. The settings part often contains references to footnotes and endnotes that must be copied (e.g. separators). The private functions that handle these operations are CopyStartingParts, CopyFootnotes and CopyEndnotes.
3. Fix ranges. There are a number of elements that come in pairs and “mark” everything within the two elements as part of that range. Since these pairs can be “broken” by extracting only part of the range, this step fixes any broken ranges. See below for more details.
4. Copy references. Many elements refer to other document parts (e.g. images) or elements within other parts (e.g. comments). Each of these must be properly copied to the new document. Many must be translated during that copy. For example, a comment that is copied to a new document may not be able to use the same ID that it had in the original since it could be the same as the ID of another comment that was copied to the new document. See below for more details.
5. Append the source content. The corrected source contents can now be appended to the new document. The only special case here is the handling of the final section property element (sectPr). The sectPr element can appear either as a child of the body element or as a child of a paragraph (p) element. The first case should only occur once as the last child element of the body. It is essentially the default section for the document. If there are any other section breaks in the document, those will appear as sectPr elements in the last paragraph before the section break. When the code is putting together paragraphs from separate documents, it still has to be sure that these rules are followed. There are two ways to handle multiple sections – either keep them or not. If the sectPr at the end of a document is going to be kept and there are more paragraphs being concatenated after it, then the sectPr that used to end the original document must be moved into its last paragraph instead. If it is not going to be kept, then it must be left out of the new document.
The rest of the DocumentBuilder.cs file contains the code to support these steps. Here is a detailed explanation of what is happening in each of the “fix” and “copy” operations. The name or names in parenthesis are the names of the functions that handle each operation.
Fix Ranges (FixRanges, FixRange, DeleteUnmatchedRange)
Ranges can appear either inside or outside of paragraphs. The function FixRanges makes all the calls necessary to fix ranges in the document. The FixRange function is used to make sure that there is a matching start and end, based on an identifying attribute. The names of the start and end elements are also passed to the function, of course. The function iterates through all the start elements and uses a LINQ to XML query to try to find the matching end element. If it is not found, then the matching end element is copied from the old document into the last paragraph of those being extracted. The same process is then performed for each end element to see if it matches to a start element. There is one special case for comment ranges. A comment range must also contain a reference element to that same comment. If the reference element was not included in the paragraphs being copied, then one is created.
The DeleteUnmatchedRange is used to delete “move from” ranges that don’t have a matching “move to” range and vice versa. These ranges should come in pairs with the same “name” attribute. If there isn’t a complete pair, then this function deletes the range. These types of ranges are not necessary for a document, but they give additional information about how the document has been changed. Leaving in an unmatched range will not make the document invalid, but it can be confusing when viewed in Microsoft Word.
Styles (CopyReferences, MergeStyles)
It is very complicated to determine what styles are referenced in a document, especially since styles can reference other styles. It is also generally desirable to include all the styles from a document, even if they are not referenced. However, it is not valid for two styles with the same name to appear in a document. The MergeStyles function copies all styles that do not have a matching name in the new document. That means that the first appearance of a style name in the source documents will be the one used in the new document.
Font Tables (CopyReferences, MergeFontTables)
Font tables are handled in the same way as styles. The first appearance of a font table with a particular name is the one that is kept in the new document.
Footnotes and Endnotes (CopyFootnotes, CopyEndnotes)
There are, in general, two kinds of references – references to elements within a particular part or references to other parts. Footnotes and Endnotes are the first type. CopyFootnotes and CopyEndnotes iterate through all footnote or endnote references and then copy that footnote or endnote element from the source document to the new document. Of course, the ID of the new element may need to be changed so that there is no overlap from the different source documents. Rather than trying to change numbers only when there is a conflict, the new document renumbers all the elements starting at zero. Of course, the reference to the elements in the main document must be changed to match.
Comments work very much like footnotes and endnotes – all comments appear in a single part. There is one difference and that is that they can be referenced from range elements as well as a “commentReference” element. Since a “commentReference” element must always appear in the document, those are used to determine which comments to copy and then any range elements are changed to the same ID number, if they appear in the new document.
Images are the second type of reference - each image reference in the main document refers to its own separate part in the package. In the case of images, these parts are binary data, rather than XML documents, so they must be copied using block read and write operations. The OpenXML SDK will automatically generate a unique ID for the copied part. That new ID must be changed in the reference within the new document.
This is the most complex copy because the numbering part contains both numbering and abstract numbering definitions. The properties defining the numbering are in the abstract element, but some can be overridden in the numbering definition. The main document will only refer to numbering elements. The example code tries to reuse abstract numbering definitions that are the same (based on the “nsid” attribute) and it also tries to reuse numbering definitions that refer to the same abstract numbering definitions, as long as they don’t have any override elements. (This approach is very similar to Microsoft Word’s method for copying numbering elements when you copy and paste numbered paragraphs from one document to another.)
Headers and Footers (CopyHeaders, CopyFooters, CopyHeaderShapes, CopyHeaderEmbeddedObjects, CopyHeaderImages, CopyFooterShapes, CopyFooterEmbeddedObjects, CopyFooterImages)
Like images, each header or footer is its own part. To add to the complexity, these parts may also contain references to images, shapes and embedded objects. The copy of those additional references is done just like those of the main document, except that the parts are created within the context of the header or footer. In other words, the header and footer parts are essentially small documents of their own. Since the header and footer parts are XML, they can be copied as XDocuments.
Diagram elements contain four attributes that refer to other parts in the document. The “dm” attribute refers to a DiagramDataPart, the “lo” attribute refers to a DiagramLayoutDefinitionPart, the “qs” attribute refers to a DiagramStylePart and the “cs” attribute refers to a DiagramColorsPart. All four of these parts are in XML and can be copied as XDocuments.
Shapes are just another case of references to separate parts. Those parts are copied as XDocuments.
Embedded Objects (CopyEmbeddedObjects)
Embedded objects are referenced very much like Images. The referenced parts are binary data and must be copied using block read and write operations.
Custom Control Data (CopyCustomXml)
Custom controls have the ability to refer to XML data that appears in a separate “data store” part. Multiple controls may refer to the same data store part, but there can also be more than one data store part. Each data store part is associated with a GUID that is referenced from the custom control in the main document. The CopyCustomXml function creates a list of all the GUID’s that are referenced and then copies each of them from the source to the new document. The data store parts are made up of a CustomXmlPart and a CustomXmlPropertiesPart, so both must be copied along with the appropriate changes to ID’s.
Hyperlinks actually refer to an “external relationship,” rather than referring to any part within the document. These external relationships are copied from the source document for each hyperlink that is being copied to the new document. Hyperlinks can be referenced from “hyperlink” or “imagedata” elements, so both of those cases are handled in the CopyHyperlinks function.
Like headers and footers, each chart is its own part, which can be copied as an XDocument. In addition, each chart usually contains a reference to an embedded object with the data for the chart. These embedded objects are copied as binary data.
There are some documents that the example cannot handle correctly. These cases are rare and difficult to handle, so they were left out to simplify the example. Here is a list of those special cases:
1. Custom XML can be embedded in the body of a document, but this completely changes the hierarchy of elements in the resulting XML. That type of source document would require a different method for defining the paragraphs to extract. Custom XML probably wouldn’t typically be used in this type of document assembly solution, so I decided it did not need to be handled for this example.
2. The Glossary part is not copied. It is possible to create a header, for example, that refers to an entry in the glossary. If you use the code to copy from that type of document, the header text could be blank or partially blank as a result. However, it will not cause an error when opening the document in Microsoft Word.
3. Copying of themes is not fully implemented. If the theme contains references to images, then the resulting document will be invalid. This problem can be avoided by removing the theme or choosing a document without images in the theme as the first document. I plan to solve this problem in a future update.
4. This example was tested with Microsoft Word 2007. I haven’t done any testing with Office 2003.
I hope this example shows how to handle some of the complexities of the Open XML standard for Word Processing documents. Hopefully, by sharing the detailed code, you will be able to perform more of the types of document manipulations that you need to take advantage of the Open XML standard.