Inserting / Deleting / Moving Paragraphs in Open XML Wordprocessing Documents

One of the most common scenarios for Open XML is programmatically adding, deleting, and moving paragraphs in a word processing document.  A variation on this is moving or copying paragraphs from one document to another.  This programming task is complicated by the need to keep other parts of the document in sync with the data stored in paragraphs.  For example, a paragraph can contain a reference to a comment in the comments part, and if there is a problem with this reference, the document is invalid.  You must take care when moving / inserting / deleting paragraphs to maintain ‘referential integrity’ within the document.  If you are making a tool to manipulate paragraphs, then this post lists some of the constraints that you must pay attention to.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC(Update Feb 6, 2009 - The code to move/insert/delete paragraphs has been completed.  This post introduces the code, and tells where to download the code from the PowerTools for Open XML project.)

(Update March 24, 2009 - This post was updated with details of more cases of interrelated markup.)

As an example, if the comment ID has a duplicate elsewhere in the document, or if the comment ID is greater than the number of comments in the comments part, the document is invalid.  If the paragraph refers to a style that isn’t in the styles part, the document will not render as expected.

There are two types of markup that we need to pay attention to when moving paragraphs - those where markup spans multiple paragraphs, such as book marks or hyper links, and those where the paragraph contains a reference to something outside of the paragraph, such as a footnote or an image.  In some cases, such as comments, we must deal with both types of markup - comment markup can span paragraphs, and comments have an external reference to the comments part.

I have a goal of augmenting the Power Tools for Open XML to enable more sophisticated document modification tasks, such as using a document as a source of ‘boiler plate’ information, moving paragraphs from the template document to other documents as required.  In addition, I want to make it easier to add or delete paragraphs using PowerShell.  To implement this, I need to have a strategy for maintaining the integrity of documents.  The information presented in this post is the first step in putting this together.

(Update March 23, 2009 - PowerTools for Open XML v1.1 have been released.  This version of the PowerTools contains two new cmdlets, Merge-OpenXmlDocument and Select-OpenXmlString, which enable composition of a new document from existing documents, while addressing the issues of interrelated markup as detailed in this post.)

The list presented in this post probably isn’t complete – I’ll update this list with new items as necessary.

Comments

A paragraph that contains a comment must reference a valid, existing comment.  The comment w:id attributes must be unique.  In the following example, there must not be another comment that has id == “0”.

<w:p>
<w:r>
<w:txml:space="preserve">On the Insert tab, the </w:t>
</w:r>
<w:commentRangeStartw:id="0"/>
<w:r>
<w:txml:space="preserve">galleries </w:t>
</w:r>
<w:commentRangeEndw:id="0"/>
<w:r>
<w:rPr>
<w:rStylew:val="CommentReference"/>
</w:rPr>
<w:commentReferencew:id="0"/>
</w:r>
<w:r>
<w:t>include items that are designed to coordinate with the overall look of your document.</w:t>
</w:r>
</w:p>

In addition, as shown in the following example, the commentRangeStart element may be in a different paragraph from the commentRangeEnd element.  If, for example, you delete the paragraph that contains the commentRangeStart, but the commentReference and commentRangeEnd elements still exist, then the document isn’t valid.  This doesn’t prevent Word 2007 from opening the document, but we should fix up these elements if deleting or moving paragraphs.

<w:p>
<w:r>
<w:txml:space="preserve">On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your </w:t>
</w:r>
<w:commentRangeStartw:id="0"/>
<w:r>
<w:t>document.</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:txml:space="preserve">You </w:t>
</w:r>
<w:commentRangeEndw:id="0"/>
<w:r>
<w:rPr>
<w:rStylew:val="CommentReference"/>
</w:rPr>
<w:commentReferencew:id="0"/>
</w:r>
<w:r>
<w:t>can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks.</w:t>
</w:r>
</w:p>

Styles

A style must refer to a valid style that exists in the styles part.  If you copy a paragraph from one document to another, and if the new document doesn’t contain a style with the specified name (“Heading1” in the following example, then the document will not render as you expect.

In my experiments, if you have a paragraph that refers to a non-existent style, Word 2007 still opens the document, and the style reverts to the default style.  However, this isn’t the behavior that we want.  Typically, when moving a paragraph from one document to another, either you would want the paragraph to retain the formatting of one or the other of the documents.  Alternatively, you could create a new style with a different name, so that moved paragraphs retain their styling.

<w:p>
<w:pPr>
<w:pStylew:val="Heading1"/>
</w:pPr>
<w:r>
<w:t>Overview</w:t>
</w:r>
</w:p>

Font Tables

These entries are used for font substitution if the named font does not exist. Every font used in the document should appear in this table, but it is not a requirement for a valid document. Fonts are most commonly referenced from styles, but could also be referenced from paragraphs or text runs.

<w:fontsxmlns:r="https://schemas.openxmlformats.org/officeDocument/2006/relationships"xmlns:w="https://schemas.openxmlformats.org/wordprocessingml/2006/main">
<w:fontw:name="Times New Roman">
<w:panose1w:val="02020603050405020304"/>
<w:charsetw:val="00"/>
<w:familyw:val="roman"/>
<w:pitchw:val="variable"/>
<w:sigw:usb0="20002A87"w:usb1="80000000"w:usb2="00000008"w:usb3="00000000"w:csb0="000001FF"w:csb1="00000000"/>
</w:font>
<w:fontw:name="Courier New">
<w:panose1w:val="02070309020205020404"/>
<w:charsetw:val="00"/>
<w:familyw:val="modern"/>
<w:pitchw:val="fixed"/>
<w:sigw:usb0="20002A87"w:usb1="80000000"w:usb2="00000008"w:usb3="00000000"w:csb0="000001FF"w:csb1="00000000"/>
</w:font>

Bookmarks

Bookmarks can span paragraphs.  We should maintain the pairing of bookmarkStart and bookmarkEnd elements.  Neglecting to do so will not make the document invalid, but the bookmark will be lost.

<w:p>
<w:r>
<w:t>Check the doc</w:t>
</w:r>
<w:bookmarkStartw:id="0"
w:name="Book1"/>
<w:r>
<w:t>ument.</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>You</w:t>
</w:r>
<w:bookmarkEndw:id="0"/>
<w:r>
<w:t>should check.</w:t>
</w:r>
</w:p>

Hyperlinks

Hyperlinks can span paragraphs.  If you move a paragraph without fixing up the markup for hyperlinks, then you will have hyperlinks that don’t have the correct appearance or behavior.

The following shows the markup for a hyperlink:

<w:body>
<w:p>
<w:pPr>
<w:rPr>
<w:rStylew:val="Hyperlink"/>
</w:rPr>
</w:pPr>
<w:r>
<w:txml:space="preserve">On the Insert tab, </w:t>
</w:r>
<w:r>
<w:fldCharw:fldCharType="begin"/>
</w:r>
<w:r>
<w:instrTextxml:space="preserve"> HYPERLINK "https://blogs.msdn.com/ericwhite" </w:instrText>
</w:r>
<w:r>
<w:fldCharw:fldCharType="separate"/>
</w:r>
<w:r>
<w:rPr>
<w:rStyle