Inserting / Deleting / Moving Paragraphs in Open XML Wordprocessing Documents


One of the most common scenarios for Open XML is programmatically adding, deleting, and moving paragraphs in a word processing document.  A variation on this is moving or copying paragraphs from one document to another.  This programming task is complicated by the need to keep other parts of the document in sync with the data stored in paragraphs.  For example, a paragraph can contain a reference to a comment in the comments part, and if there is a problem with this reference, the document is invalid.  You must take care when moving / inserting / deleting paragraphs to maintain ‘referential integrity’ within the document.  If you are making a tool to manipulate paragraphs, then this post lists some of the constraints that you must pay attention to.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC
(Update Feb 6, 2009 – The code to move/insert/delete paragraphs has been completed.  This post introduces the code, and tells where to download the code from the PowerTools for Open XML project.)

(Update March 24, 2009 – This post was updated with details of more cases of interrelated markup.)

As an example, if the comment ID has a duplicate elsewhere in the document, or if the comment ID is greater than the number of comments in the comments part, the document is invalid.  If the paragraph refers to a style that isn’t in the styles part, the document will not render as expected.

There are two types of markup that we need to pay attention to when moving paragraphs – those where markup spans multiple paragraphs, such as book marks or hyper links, and those where the paragraph contains a reference to something outside of the paragraph, such as a footnote or an image.  In some cases, such as comments, we must deal with both types of markup – comment markup can span paragraphs, and comments have an external reference to the comments part.

I have a goal of augmenting the Power Tools for Open XML to enable more sophisticated document modification tasks, such as using a document as a source of ‘boiler plate’ information, moving paragraphs from the template document to other documents as required.  In addition, I want to make it easier to add or delete paragraphs using PowerShell.  To implement this, I need to have a strategy for maintaining the integrity of documents.  The information presented in this post is the first step in putting this together.

(Update March 23, 2009 – PowerTools for Open XML v1.1 have been released.  This version of the PowerTools contains two new cmdlets, Merge-OpenXmlDocument and Select-OpenXmlString, which enable composition of a new document from existing documents, while addressing the issues of interrelated markup as detailed in this post.)

The list presented in this post probably isn’t complete – I’ll update this list with new items as necessary.

Comments

A paragraph that contains a comment must reference a valid, existing comment.  The comment w:id attributes must be unique.  In the following example, there must not be another comment that has id == “0”.

<w:p>
  <w:r>
    <w:txml:space=preserve>On the Insert tab, the </w:t>
  </w:r>
  <w:commentRangeStartw:id=0/>
  <w:r>
    <w:txml:space=preserve>galleries </w:t>
  </w:r>
  <w:commentRangeEndw:id=0/>
  <w:r>
    <w:rPr>
      <w:rStylew:val=CommentReference/>
    </w:rPr>
    <w:commentReferencew:id=0/>
  </w:r>
  <w:r>
    <w:t>include items that are designed to coordinate with the overall look of your document.</w:t>
  </w:r>
</w:p>
 

In addition, as shown in the following example, the commentRangeStart element may be in a different paragraph from the commentRangeEnd element.  If, for example, you delete the paragraph that contains the commentRangeStart, but the commentReference and commentRangeEnd elements still exist, then the document isn’t valid.  This doesn’t prevent Word 2007 from opening the document, but we should fix up these elements if deleting or moving paragraphs.

<w:p>
  <w:r>
    <w:txml:space=preserve>On the Insert tab, the galleries include items that are designed to coordinate with the overall look of your </w:t>
  </w:r>
  <w:commentRangeStartw:id=0/>
  <w:r>
    <w:t>document.</w:t>
  </w:r>
</w:p>
<w:p>
  <w:r>
    <w:txml:space=preserve>You </w:t>
  </w:r>
  <w:commentRangeEndw:id=0/>
  <w:r>
    <w:rPr>
      <w:rStylew:val=CommentReference/>
    </w:rPr>
    <w:commentReferencew:id=0/>
  </w:r>
  <w:r>
    <w:t>can use these galleries to insert tables, headers, footers, lists, cover pages, and other document building blocks.</w:t>
  </w:r>
</w:p>
 

Styles

A style must refer to a valid style that exists in the styles part.  If you copy a paragraph from one document to another, and if the new document doesn’t contain a style with the specified name (“Heading1” in the following example, then the document will not render as you expect.

In my experiments, if you have a paragraph that refers to a non-existent style, Word 2007 still opens the document, and the style reverts to the default style.  However, this isn’t the behavior that we want.  Typically, when moving a paragraph from one document to another, either you would want the paragraph to retain the formatting of one or the other of the documents.  Alternatively, you could create a new style with a different name, so that moved paragraphs retain their styling.

<w:p>
  <w:pPr>
    <w:pStylew:val=Heading1/>
  </w:pPr>
  <w:r>
    <w:t>Overview</w:t>
  </w:r>
</w:p>
 

Font Tables

These entries are used for font substitution if the named font does not exist. Every font used in the document should appear in this table, but it is not a requirement for a valid document. Fonts are most commonly referenced from styles, but could also be referenced from paragraphs or text runs.

<w:fontsxmlns:r=http://schemas.openxmlformats.org/officeDocument/2006/relationshipsxmlns:w=http://schemas.openxmlformats.org/wordprocessingml/2006/main>
  <w:fontw:name=Times New Roman>
    <w:panose1w:val=02020603050405020304/>
    <w:charsetw:val=00/>
    <w:familyw:val=roman/>
    <w:pitchw:val=variable/>
    <w:sigw:usb0=20002A87w:usb1=80000000w:usb2=00000008w:usb3=00000000w:csb0=000001FFw:csb1=00000000/>
  </w:font>
  <w:fontw:name=Courier New>
    <w:panose1w:val=02070309020205020404/>
    <w:charsetw:val=00/>
    <w:familyw:val=modern/>
    <w:pitchw:val=fixed/>
    <w:sigw:usb0=20002A87w:usb1=80000000w:usb2=00000008w:usb3=00000000w:csb0=000001FFw:csb1=00000000/>
  </w:font>

Bookmarks

Bookmarks can span paragraphs.  We should maintain the pairing of bookmarkStart and bookmarkEnd elements.  Neglecting to do so will not make the document invalid, but the bookmark will be lost.

<w:p>
  <w:r>
    <w:t>Check the doc</w:t>
  </w:r>
  <w:bookmarkStartw:id=0
                   w:name=Book1/>
  <w:r>
    <w:t>ument.</w:t>
  </w:r>
</w:p>
<w:p>
  <w:r>
    <w:t>You</w:t>
  </w:r>
  <w:bookmarkEndw:id=0/>
  <w:r>
    <w:t>should check.</w:t>
  </w:r>
</w:p>
 

Hyperlinks

Hyperlinks can span paragraphs.  If you move a paragraph without fixing up the markup for hyperlinks, then you will have hyperlinks that don’t have the correct appearance or behavior.

The following shows the markup for a hyperlink:

<w:body>
  <w:p>
    <w:pPr>
      <w:rPr>
        <w:rStylew:val=Hyperlink/>
      </w:rPr>
    </w:pPr>
    <w:r>
      <w:txml:space=preserve>On the Insert tab, </w:t>
    </w:r>
    <w:r>
      <w:fldCharw:fldCharType=begin/>
    </w:r>
    <w:r>
      <w:instrTextxml:space=preserve> HYPERLINK “https://blogs.msdn.com/ericwhite” </w:instrText>
    </w:r>
    <w:r>
      <w:fldCharw:fldCharType=separate/>
    </w:r>
    <w:r>
      <w:rPr>
        <w:rStyle

Comments (17)

  1. Suite à la PDC 2008 et au workshop Open XML donné par Microsoft à Redmond ( Doug , encore mille excuses

  2. Sten says:

    Also… headers, footers, footnotes, document variables, images, lists

  3. EricWhite says:

    Thanks, Sten – I’ll update this.

    -Eric

  4. EricWhite says:

    Sten,

    I’ve updated the post with the information on footnotes, images, and lists.  However, please excuse my ignorance :)  I’m not clear on what you are refering to with headers, footers, and document variables.  How are these features represented in markup that has a reference to something outside of the context of a single paragraph?

    Thanks, Eric

  5. Sten says:

    Document variables (accessible in Word 2003 through File|Properties); if copied to another document the text would come over, but the variable reference would be invalid or lost.

    The "w:instr" attribute contains a reference to settings.xml:

    <w:fldSimple w:instr=" DOCVARIABLE UvarTypeReport * Upper * MERGEFORMAT ">

     <w:r w:rsidR="00AD0F9A">

       <w:rPr>

         <w:sz w:val="24"/>

       </w:rPr>

       <w:t>COMPLETE</w:t>

     </w:r>

    </w:fldSimple>

    Headers and Footers are outside the document.xml, each is a part of its own and are referenced using w:headerReference or w:footerReference; Not only they are referenced but there is the w:type attribute which affects what is shown when the document is loaded in Word. In the sample below the header rId7 is not rendered in Word and rId8 is:

    <w:p w:rsidR="00AD0F9A" w:rsidRDefault="00AD0F9A">

     <w:pPr>

       <w:sectPr w:rsidR="00AD0F9A">

         <w:headerReference w:type="default" r:id="rId7"/>

         <w:headerReference w:type="first" r:id="rId8"/>

         <w:footerReference w:type="first" r:id="rId9"/>

         <w:pgSz w:w="12240" w:h="15840" w:code="1"/>

         <w:pgMar w:top="4896" w:right="1800" w:bottom="720" w:left="864" w:header="720" w:footer="576" w:gutter="0"/>

         <w:paperSrc w:first="15" w:other="15"/>

         <w:pgNumType w:start="1"/>

         <w:cols w:space="720"/>

         <w:titlePg/>

       </w:sectPr>

     </w:pPr>

    </w:p>

  6. Doug Mahugh says:

    Zeyad Rajabi has started a series of very useful hands-on posts over on Brian Jones’s blog about working

  7. Sten says:

    Style references are present in runs as well:

    w:r/w:rPr/w:rStyle[@val=’StyleName’]

    I haven’t verified it yet, but since there are table styles they are probably referenced from within tables

  8. DocumentBuilder is an example class that’s part of the PowerTools for Open XML project that enables you

  9. Mohamed Ali Khan says:

    Hi Eric,

    I have a open xml paragraph Markup like this.

    <w:p w:rsidR="00A60A58" w:rsidRPr="00810B68" w:rsidRDefault="00A60A58" w:rsidP="00017828">

       <w:pPr>

           <w:pStyle w:val="equation" /></w:pPr>

       <w:r w:rsidRPr="00810B68">

           <w:tab/></w:r>

       <w:r w:rsidRPr="00810B68">

           <w:rPr>

               <w:position w:val="-24" /></w:rPr>

           <w:object w:dxaOrig="2380" w:dyaOrig="560">

               <v:shape id="_x0000_i1073" type="#_x0000_t75" style="width:118.4pt;height:27.45pt" o:ole="">

                   <v:imagedata r:id="rId103" o:title="" /></v:shape>

               <o:OLEObject Type="Embed" ProgID="Equation.DSMT4" ShapeID="_x0000_i1073" DrawAspect="Content" ObjectID="_1439405570" r:id="rId104" />

           </w:object>

       </w:r>

       <w:r w:rsidRPr="00810B68">

           <w:tab/></w:r>

       <w:r w:rsidR="006F3ADF">

           <w:fldChar w:fldCharType="begin" /></w:r>

       <w:r w:rsidR="00782DAD">

           <w:instrText xml:space="preserve">MACROBUTTON MTPlaceRef * MERGEFORMAT</w:instrText>

       </w:r>

       <w:r w:rsidR="006F3ADF">

           <w:fldChar w:fldCharType="begin" /></w:r>

       <w:r w:rsidR="009E6C23">

           <w:instrText xml:space="preserve">SEQ MTEqn h * MERGEFORMAT</w:instrText>

       </w:r>

       <w:r w:rsidR="006F3ADF">

           <w:fldChar w:fldCharType="end" /></w:r>

       <w:bookmarkStart w:id="5" w:name="ZEqnNum679168" />

       <w:r w:rsidR="00782DAD">

           <w:instrText>(3.</w:instrText>

       </w:r>

       <w:fldSimple w:instr=" SEQ MTEqn c * Arabic * MERGEFORMAT ">

           <w:r w:rsidR="00410805">

               <w:rPr>

                   <w:noProof/></w:rPr>

               <w:instrText>3</w:instrText>

           </w:r>

       </w:fldSimple>

       <w:r w:rsidR="00782DAD">

           <w:instrText>)</w:instrText>

       </w:r>

       <w:bookmarkEnd w:id="5" />

       <w:r w:rsidR="006F3ADF">

           <w:fldChar w:fldCharType="end" /></w:r>

    </w:p>

    and the innertext is

    "Note that the disagreement error  is not available to node i unless it is pinned to the leader node (that is,), whereas the local neighborhood tracking error  GOTOBUTTON ZEqnNum679168  * MERGEFORMAT  REF ZEqnNum679168 * Charformat ! * MERGEFORMAT (3.3) is known to each node i."

    But i need only this text from the innerText,

    Note that the disagreement error  is not available to node i unless it is pinned to the leader node (that is,), whereas the local neighborhood tracking error (3.3) is known to each node i.

    which is similar to the text displayed in Microsoft Word.

    Please help me in this.

    Thank you.

  10. EricWhite says:

    Hi Mohamed,

    Looking at your markup I think maybe you got a different paragraph than the one with the inner text you show – the paragraph markup you show has an ActiveX object with a math equation in it.  But in any case, the answer to your question is that innerText is not helpful to you when you want to retrieve the actual text of a paragraph.  Instead, what you want to do is to find all descendant w:t elements, and concatenate the text value of them.  This will give you the actual text of the paragraph.  One exception to this is that if your paragraph contains tracked revisions, you will first want to accept tracked revisions, and then do the concatenation of the w:t elements.  You can use the RevisionAccepter class of PowerTools for Open XML to do this.  See http://powertools/codeplex.com.

    Cheers, Eric

  11. Mohamed Ali Khan says:

    Hi Eric,

    Sorry for the wrong one.

    This is the relevant Markup.

    <w:p w:rsidR="00FA19E4" w:rsidRDefault="00FA19E4" w:rsidP="00CD12D6">

       <w:r>

           <w:t xml:space="preserve">Note that the disagreement error</w:t>

       </w:r>

       <w:r w:rsidRPr="00810B68">

           <w:rPr>

               <w:position w:val="-10" /></w:rPr>

           <w:object w:dxaOrig="859" w:dyaOrig="279">

               <v:shape id="_x0000_i1101" type="#_x0000_t75" style="width:41.7pt;height:15.15pt" o:ole="">

                   <v:imagedata r:id="rId151" o:title="" /></v:shape>

               <o:OLEObject Type="Embed" ProgID="Equation.DSMT4" ShapeID="_x0000_i1101" DrawAspect="Content" ObjectID="_1439405598" r:id="rId159" />

           </w:object>

       </w:r>

       <w:r>

           <w:t xml:space="preserve">is not available to node</w:t>

       </w:r>

       <w:r w:rsidRPr="00FA19E4">

           <w:rPr>

               <w:i/></w:rPr>

           <w:t>i</w:t>

       </w:r>

       <w:r>

           <w:t xml:space="preserve">unless it is pinned to the leader node</w:t>

       </w:r>

       <w:r w:rsidR="00C50F07">

           <w:t xml:space="preserve">(that is,</w:t>

       </w:r>

       <w:r w:rsidR="00C50F07" w:rsidRPr="00C50F07">

           <w:rPr>

               <w:position w:val="-12" /></w:rPr>

           <w:object w:dxaOrig="639" w:dyaOrig="360">

               <v:shape id="_x0000_i1102" type="#_x0000_t75" style="width:27.45pt;height:15.15pt" o:ole="">

                   <v:imagedata r:id="rId160" o:title="" /></v:shape>

               <o:OLEObject Type="Embed" ProgID="Equation.DSMT4" ShapeID="_x0000_i1102" DrawAspect="Content" ObjectID="_1439405599" r:id="rId161" />

           </w:object>

       </w:r>

       <w:r w:rsidR="00C50F07">

           <w:t>)</w:t>

       </w:r>

       <w:r>

           <w:t>, whereas the local neighborhood trac</w:t>

       </w:r>

       <w:r>

           <w:t>k</w:t>

       </w:r>

       <w:r>

           <w:t xml:space="preserve">ing error</w:t>

       </w:r>

       <w:r w:rsidR="006F3ADF">

           <w:fldChar w:fldCharType="begin" /></w:r>

       <w:r>

           <w:instrText xml:space="preserve">GOTOBUTTON ZEqnNum679168 * MERGEFORMAT</w:instrText>

       </w:r>

       <w:r w:rsidR="006F3ADF">

           <w:fldChar w:fldCharType="begin" /></w:r>

       <w:r w:rsidR="00954790">

           <w:instrText xml:space="preserve">REF ZEqnNum679168 * Charformat ! * MERGEFORMAT</w:instrText>

       </w:r>

       <w:r w:rsidR="006F3ADF">

           <w:fldChar w:fldCharType="separate" /></w:r>

       <w:r w:rsidR="00410805">

           <w:instrText>(3.3)</w:instrText>

       </w:r>

       <w:r w:rsidR="006F3ADF">

           <w:fldChar w:fldCharType="end" /></w:r>

       <w:r w:rsidR="006F3ADF">

           <w:fldChar w:fldCharType="end" /></w:r>

       <w:r>

           <w:t xml:space="preserve">is known to each node</w:t>

       </w:r>

       <w:r w:rsidRPr="00FA19E4">

           <w:rPr>

               <w:i/></w:rPr>

           <w:t>i</w:t>

       </w:r>

       <w:r>

           <w:t>.</w:t>

       </w:r>

    </w:p>

  12. Mohamed Ali Khan says:

    (Splitted the Post into Two)

    and the innertext is

    "Note that the disagreement error  is not available to node i unless it is pinned to the leader node (that is,), whereas the local neighborhood tracking error  GOTOBUTTON ZEqnNum679168  * MERGEFORMAT  REF ZEqnNum679168 * Charformat ! * MERGEFORMAT (3.3) is known to each node i."

    But i need only this text from the innerText,

    Note that the disagreement error  is not available to node i unless it is pinned to the leader node (that is,), whereas the local neighborhood tracking error (3.3) is known to each node i.

    Note: Here the text from "<w:instrText>(3.3)</w:instrText>" also required.

    How can i avoid these Field Codes & Text which is not required. And i tried using Regex also but with no success.

    which is similar to the text displayed in Microsoft Word Application.

    Please help me in this.

    Thank you.

  13. Mohamed Ali Khan says:

    (Splitted the Post into two coz of Length)

    and the innertext is

    "Note that the disagreement error  is not available to node i unless it is pinned to the leader node (that is,), whereas the local neighborhood tracking error  GOTOBUTTON ZEqnNum679168  * MERGEFORMAT  REF ZEqnNum679168 * Charformat ! * MERGEFORMAT (3.3) is known to each node i."

    But i need only this text from the innerText,

    Note that the disagreement error  is not available to node i unless it is pinned to the leader node (that is,), whereas the local neighborhood tracking error (3.3) is known to each node i.

    Note: Here the text from "<w:instrText>(3.3)</w:instrText>" also required.

    How can i avoid these Field Codes & Text which is not required. And i tried using Regex also but with no success.

    which is similar to the text displayed in Microsoft Word Application.

    Please help me in this.

    Thank you.

  14. EricWhite says:

    Hi Mohamed,

    What language are you using?  C#?  VB.NET?  XSLT?  If using C# / VB, which version, and which version of the .NET framework?

    I'll need that information in order to help…

    Cheers, Eric

  15. Mohamed Ali Khan says:

    Hi Eric,

    I'm using C# with .NET Framework 4 and Open XML SDK 2.0.

    Thanks.

  16. EricWhite says:

    Hi Mohamed,

    I recommend reading the following MSDN article:

    msdn.microsoft.com/…/ff686712(v=office.14).aspx

    If you can assimilate everything in that article, you will have a good grasp on how to query the text of paragraphs.

    Cheers, Eric

  17. Mohamed Ali Khan says:

    Hi Eric,

    Thanks for the reply.The Post was helpful and i'll try to use it. But, currently, i'm excluding instrText contents with some conditions in the final text result. Its working for now. But i'll try to understand it completely and refine my logic.

    Thanks,

    Mohamed Ali Khan