Move/Insert/Delete Paragraphs in Word Processing Documents using the Open XML SDK


We can potentially make extensive modifications to Open XML word processing documents.  Many scenarios would benefit from the ability to move paragraphs around (both inter and intra-document), insert paragraphs, and delete paragraphs.  In practice, the process can be quite daunting because paragraphs often contain markup that refers to something outside of the paragraph.  For example, a paragraph might contain markup indicating that a comment begins inside that paragraph.  But because the comment ends in a later paragraph, another paragraph contains the markup indicating the end of the comment.  And the comment markup contains a reference to another one of the Open XML parts (the Comments part, of course), with a zero-based index.  There are a fair number of cases of markup in a paragraph that have relations to markup outside the paragraph, including comments, bookmarks, images, etc.  When you insert, delete, and move paragraphs, you must keep this markup in sync.  This isn’t easy.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOC
(Update March 27, 2009 – updated the Headers and Footers section, and added the Charts section.) 

This blog post details much of what you must do to keep such inter-related markup in sync.  It also presents some sample code (from the PowerTools for Open XML open source project), that shows one approach to solving this problem.  It is useful to have example code like this.  It can serve to educate about the Open XML specification, and perhaps inspire other innovative approaches to solving this problem.

Note from Eric White:  This is a guest post written by Bob McClellan, who is one of the open source developers on the PowerTools for Open XML project on CodePlex.  This code is the basis of some new cmdlets that enable composability of documents – take any number of source documents, specify some subset of each document in each source document, and compose a new, valid Open XML document.  The code takes care of a myriad of details to make sure that the new document is valid.

This is one of the most important Open XML posts that I’ve put on my blog.  The details handed by the code presented in this post enable true document composability.

The gist of the approach presented here is that we don’t actually modify any existing documents.  Instead, we start with any number of source documents, extract an arbitrary number of paragraphs (from any location in each document), and assemble a new, valid Open XML document.  If a paragraph contains a reference to an image, the image is moved to the new package, and the appropriate fix-ups are applied to the markup that refers to the image.  Ditto for comments, and many other cases.  We can use this basic function to do other interesting things – to delete a range of paragraphs in the middle of the document, we can specify the source document twice, first specifying a range up to the section that we want to delete, and then specifying a range of paragraphs after the deleted range.  After assembling the new document, we have a valid Open XML word processing document with the unwanted paragraphs removed.

In the next section, I will explain how to use the example code.  The final section explains generally how the code in the example works.

To get this example, you will need to download it from the CodePlex web site for the “PowerTools for OpenXML” project. (The PowerTools contain a number of PowerShell cmdlets that manipulate Open XML documents.) The link is www.CodePlex.com/PowerTools (then click on the “Releases” tab, and look for “DocumentBuilder.zip” under “Downloads and Files”).  The file that you will download is a compressed file containing the entire folder structure including the source code, sample documents and Visual Studio solution and project. You can open the solution file in any version of Visual Studio 2008 that supports C#, including the Visual C# 2008 Express.

In addition, this code uses the Open XML SDK Version 1 or Version 2.  In order to use this code, you need to download and install one of the versions of the SDK.  In addition, you may need to update the reference to the assembly in the project.

Running the Example

The example program creates a few new documents (named Test1.docx, Test2.docx, and Test3.docx) from existing documents (named Source1.docx and Source2.docx).  After running the example, Test1.docx contains a few paragraphs extracted from an existing document.  Test2.docx contains a set of paragraphs from the beginning and from the end of an existing document (omitting some paragraphs in the middle).  Test3.docx concatenates two existing documents into a new document.  This is a more generalized and effective approach for assembling documents than using altChunk.

Note: the source documents are located in the bin/debug directory under the project.  The created documents are placed in the same directory.  This allows you to simply open the project, fix up the reference to the Open XML SDK (if necessary), and run the program to see the newly created documents.

Each test calls the DocumentBuilder.BuildDocument method to create the new document.  The code for Test2 looks like this:

using (WordprocessingDocument part1 =
    WordprocessingDocument.Open(“Source1.docx”, false))
{
    List<Source> sources = new List<Source>();
    sources.Add(new Source(part1, 0, 12, true));
    sources.Add(new Source(part1, 49, true));
    DocumentBuilder.BuildDocument(sources, “Test2.docx”);
}


The first line is calling the Open XML SDK to open the source document.  Then, the example creates a list of objects of type OpenXml.PowerTools.Source.  This collection is used to define which groups of paragraphs, or entire documents, will be used to create a new document.  In this case, two groups of paragraphs from the same document will be used to create the new document.  The first Source object is created with a “start” value of 0 and “count” of 12.  That means that twelve paragraphs will be extracted from the existing document starting with the first paragraph.  The starting paragraph is numbered 0, so if you are used to thinking of the first paragraph as paragraph 1, just subtract one when specifying the starting paragraph value.  The second Source object specifies only the “start” value, so it extracts all the remaining paragraphs starting at that value.  The call to BuildDocument creates the new document, named “Test2.docx” in this case.

NOTE: Although I have been describing the source parameters as referring to paragraphs, that is not strictly true. It actually refers to elements that are children of the “body” element. Although paragraphs are the most common, child elements of the body element may include tables, various range elements (start and end ranges often appear as siblings of paragraphs), content controls and others (e.g. Math paragraphs).  For more complex documents, you will probably need to view the XML directly in order to determine which values to use for start and count.  Of course, these values can be determined programmatically by scanning the XML.

How it Works

The source code for this example consists of three files.  Program.cs contains the Main function which makes calls to BuildDocument.  DocumentExtensions.cs contains extensions to the WordprocessingDocument class that handle reading and writing the document parts (using the GetXDocument and FlushParts extension methods).  The code in DocumentExtensions.cs will be covered in detail in a future blog post.  DocumentBuilder.cs contains the BuildDocument method, two versions of BuildOpenDocument, and supporting methods.  It also defines the Source class used to declare the desired sources for BuildDocument.  Each of these code sections are described below in general terms.  If you have more specific questions about the code, you are invited to ask questions in the comments on this post.

The first part of DocumentBuilder.cs defines the Source class.  The various constructors for the class all result in the same internal structure.  The member variables are a source document, the desired contents of that document (expressed as the starting paragraph number, and count of paragraphs in the constructor), and a Boolean value indicating if the final section divider should be retained.  Each constructor uses a LINQ-to-XML query to define the desired contents.  Note that these contents are not actually stored in the variable, but that the query itself is stored.  The elements of the query will only be retrieved as needed during processing.

The next part defines the XML namespace strings used in Wordprocessing documents.  These will be used throughout the code to refer to various element and attributes.  Next are the public functions, BuildDocument and BuildOpenDocument.  BuildDocument is used in the example and explained above.  There are two overloads of BuildOpenDocument.  The one that creates a file-based document is used by some of the PowerTools cmdlets. The other shows how to use a MemoryStream to create a document (which could be useful when using a SharePoint document library instead of the file system for document storage).  All of the public functions call a common internal function, DoBuildDocument.  Although the code may look fairly complex, it can be broken down into a few general steps.

1.       Create the new document.  A main document part must be created with “document” and “body” XML elements.

2.       Copy parts from the first source document.  A number of parts are copied from the first source document (e.g. styles) so that the new document will generally look like the first document in the list of the source documents – the first source document is essentially the “master” of the new document.  The parts that are copied are core, extended and custom file properties, settings, web settings, styles, font tables, and any theme.  The settings part often contains references to footnotes and endnotes that must be copied (e.g. separators).  The private functions that handle these operations are CopyStartingParts, CopyFootnotes and CopyEndnotes.

3.       Fix ranges.  There are a number of elements that come in pairs and “mark” everything within the two elements as part of that range.  Since these pairs can be “broken” by extracting only part of the range, this step fixes any broken ranges.  See below for more details.

4.       Copy references.  Many elements refer to other document parts (e.g. images) or elements within other parts (e.g. comments).  Each of these must be properly copied to the new document.  Many must be translated during that copy.  For example, a comment that is copied to a new document may not be able to use the same ID that it had in the original since it could be the same as the ID of another comment that was copied to the new document.  See below for more details.

5.       Append the source content.  The corrected source contents can now be appended to the new document.  The only special case here is the handling of the final section property element (sectPr). The sectPr element can appear either as a child of the body element or as a child of a paragraph (p) element.  The first case should only occur once as the last child element of the body.  It is essentially the default section for the document.  If there are any other section breaks in the document, those will appear as sectPr elements in the last paragraph before the section break.  When the code is putting together paragraphs from separate documents, it still has to be sure that these rules are followed.  There are two ways to handle multiple sections – either keep them or not.  If the sectPr at the end of a document is going to be kept and there are more paragraphs being concatenated after it, then the sectPr that used to end the original document must be moved into its last paragraph instead. If it is not going to be kept, then it must be left out of the new document.

The rest of the DocumentBuilder.cs file contains the code to support these steps.  Here is a detailed explanation of what is happening in each of the “fix” and “copy” operations.  The name or names in parenthesis are the names of the functions that handle each operation.

Fix Ranges (FixRanges, FixRange, DeleteUnmatchedRange)

Ranges can appear either inside or outside of paragraphs.  The function FixRanges makes all the calls necessary to fix ranges in the document.  The FixRange function is used to make sure that there is a matching start and end, based on an identifying attribute.  The names of the start and end elements are also passed to the function, of course.  The function iterates through all the start elements and uses a LINQ to XML query to try to find the matching end element.  If it is not found, then the matching end element is copied from the old document into the last paragraph of those being extracted.  The same process is then performed for each end element to see if it matches to a start element.  There is one special case for comment ranges.  A comment range must also contain a reference element to that same comment.  If the reference element was not included in the paragraphs being copied, then one is created.

The DeleteUnmatchedRange is used to delete “move from” ranges that don’t have a matching “move to” range and vice versa.  These ranges should come in pairs with the same “name” attribute.  If there isn’t a complete pair, then this function deletes the range.  These types of ranges are not necessary for a document, but they give additional information about how the document has been changed.  Leaving in an unmatched range will not make the document invalid, but it can be confusing when viewed in Microsoft Word.

Styles (CopyReferences, MergeStyles)

It is very complicated to determine what styles are referenced in a document, especially since styles can reference other styles.  It is also generally desirable to include all the styles from a document, even if they are not referenced.  However, it is not valid for two styles with the same name to appear in a document.  The MergeStyles function copies all styles that do not have a matching name in the new document.  That means that the first appearance of a style name in the source documents will be the one used in the new document.

Font Tables (CopyReferences, MergeFontTables)

Font tables are handled in the same way as styles.  The first appearance of a font table with a particular name is the one that is kept in the new document.

Footnotes and Endnotes (CopyFootnotes, CopyEndnotes)

There are, in general, two kinds of references – references to elements within a particular part or references to other parts.  Footnotes and Endnotes are the first type.  CopyFootnotes and CopyEndnotes iterate through all footnote or endnote references and then copy that footnote or endnote element from the source document to the new document.  Of course, the ID of the new element may need to be changed so that there is no overlap from the different source documents.  Rather than trying to change numbers only when there is a conflict, the new document renumbers all the elements starting at zero.  Of course, the reference to the elements in the main document must be changed to match.

Comments (CopyComments)

Comments work very much like footnotes and endnotes – all comments appear in a single part.  There is one difference and that is that they can be referenced from range elements as well as a “commentReference” element.  Since a “commentReference” element must always appear in the document, those are used to determine which comments to copy and then any range elements are changed to the same ID number, if they appear in the new document.

Images (CopyImages)

Images are the second type of reference – each image reference in the main document refers to its own separate part in the package.  In the case of images, these parts are binary data, rather than XML documents, so they must be copied using block read and write operations.  The OpenXML SDK will automatically generate a unique ID for the copied part.  That new ID must be changed in the reference within the new document.

Numbering (CopyNumbering)

This is the most complex copy because the numbering part contains both numbering and abstract numbering definitions.  The properties defining the numbering are in the abstract element, but some can be overridden in the numbering definition.  The main document will only refer to numbering elements.  The example code tries to reuse abstract numbering definitions that are the same (based on the “nsid” attribute) and it also tries to reuse numbering definitions that refer to the same abstract numbering definitions, as long as they don’t have any override elements.  (This approach is very similar to Microsoft Word’s method for copying numbering elements when you copy and paste numbered paragraphs from one document to another.)

Headers and Footers (CopyHeaders, CopyFooters, CopyHeaderShapes, CopyHeaderEmbeddedObjects, CopyHeaderImages, CopyFooterShapes, CopyFooterEmbeddedObjects, CopyFooterImages)

Like images, each header or footer is its own part.  To add to the complexity, these parts may also contain references to images, shapes and embedded objects.  The copy of those additional references is done just like those of the main document, except that the parts are created within the context of the header or footer.  In other words, the header and footer parts are essentially small documents of their own.  Since the header and footer parts are XML, they can be copied as XDocuments.

Diagrams (CopyDiagrams)

Diagram elements contain four attributes that refer to other parts in the document.  The “dm” attribute refers to a DiagramDataPart, the “lo” attribute refers to a DiagramLayoutDefinitionPart, the “qs” attribute refers to a DiagramStylePart and the “cs” attribute refers to a DiagramColorsPart.  All four of these parts are in XML and can be copied as XDocuments.

Shapes (CopyShapes)

Shapes are just another case of references to separate parts.  Those parts are copied as XDocuments.

Embedded Objects (CopyEmbeddedObjects)

Embedded objects are referenced very much like Images.  The referenced parts are binary data and must be copied using block read and write operations.

Custom Control Data (CopyCustomXml)

Custom controls have the ability to refer to XML data that appears in a separate “data store” part.  Multiple controls may refer to the same data store part, but there can also be more than one data store part.  Each data store part is associated with a GUID that is referenced from the custom control in the main document.  The CopyCustomXml function creates a list of all the GUID’s that are referenced and then copies each of them from the source to the new document.  The data store parts are made up of a CustomXmlPart and a CustomXmlPropertiesPart, so both must be copied along with the appropriate changes to ID’s.

Hyperlinks (CopyHyperlinks)

Hyperlinks actually refer to an “external relationship,” rather than referring to any part within the document.  These external relationships are copied from the source document for each hyperlink that is being copied to the new document.  Hyperlinks can be referenced from “hyperlink” or “imagedata” elements, so both of those cases are handled in the CopyHyperlinks function.

Charts(CopyCharts, CopyChartObjects)

Like headers and footers, each chart is its own part, which can be copied as an XDocument. In addition, each chart usually contains a reference to an embedded object with the data for the chart. These embedded objects are copied as binary data.

Caveats

There are some documents that the example cannot handle correctly.  These cases are rare and difficult to handle, so they were left out to simplify the example.  Here is a list of those special cases:

1.       Custom XML can be embedded in the body of a document, but this completely changes the hierarchy of elements in the resulting XML. That type of source document would require a different method for defining the paragraphs to extract.  Custom XML probably wouldn’t typically be used in this type of document assembly solution, so I decided it did not need to be handled for this example.

2.       The Glossary part is not copied. It is possible to create a header, for example, that refers to an entry in the glossary. If you use the code to copy from that type of document, the header text could be blank or partially blank as a result. However, it will not cause an error when opening the document in Microsoft Word.

3.       Copying of themes is not fully implemented.  If the theme contains references to images, then the resulting document will be invalid.  This problem can be avoided by removing the theme or choosing a document without images in the theme as the first document.  I plan to solve this problem in a future update.

4.       This example was tested with Microsoft Word 2007. I haven’t done any testing with Office 2003.

Conclusion

I hope this example shows how to handle some of the complexities of the Open XML standard for Word Processing documents.  Hopefully, by sharing the detailed code, you will be able to perform more of the types of document manipulations that you need to take advantage of the Open XML standard.

Comments (39)

  1. Kerry says:

    Wow! Just what we were looking for..

  2. Hi Kerry,

    When you’re ready, kindly give feedback on this code / guidance!  I’m very interested to hear if it works for you.

    -Eric

  3. Chris says:

    Hi Eric,

    I’ve been using the SDK for a little bit, attempting to do some manipulation of the word documents read in, I’ve done something pretty simple to the Source instances created..

     foreach (var content in source.Contents)

        content.Value = "hello";

    and then I use these new ‘Sources’ in the ‘BuildDocument’ method. The problem I have is that I get a blank document when I open it in Word. Opening the docx and inspecting the document.xml file shows all the data is there, just it doesn’t display.

    Have I done some wierd thing with the linking by changing the value of the XElement?

    Cheers for this and your post on the OpenXML stuff!

    Chris

  4. Hi Chris,

    Changing the value of the XElements in the source.Contents using the Value property won’t work.  Those elements need to be well-formed, valid paragraph:

       <w:p>
         <w:r>
           <w:t>Here is a detailed explanation of what is happening in each of the “fix” and “copy” operations. The name or names in parenthesis are the names of the functions that handle each operation.</w:t>
         </w:r>
       </w:p>

    Of course, all bets are off if this paragraph refers to something that isn’t in the source document, such as an image or something.  But if you want to tweak those XElement objects, keeping them valid, it should work.  I haven’t tried this, though.

    -Eric

  5. teltest says:

    This looks and works great except if one or more of the documents contains a chart – then the output file gets corrupt.  

    I tried to add some code that copied over the chartparts – this gave the chartparts in the new document new reference IDs but I cannot figure out how to change the references in the original Document body

    private static void CopyChartObjects(WordprocessingDocument oldDoc, WordprocessingDocument newDoc)

    {

       foreach (ChartPart cpItem in oldDoc.MainDocumentPart.ChartParts)

       {

           //Get original ID

           string relId = oldDoc.MainDocumentPart.GetIdOfPart(cpItem).ToString();

           ChartPart newPart = newDoc.MainDocumentPart.AddPart(cpItem);

           string newrelId = newDoc.MainDocumentPart.GetIdOfPart(newPart).ToString();

           //Substitute the newref for the old ref in the new document …

           //Kinds stuck here

       }

    }

     

     

  6. @teltest, thanks!  We’ll fix this, and post the revised code.

    -Eric

     (Update: March 19, 2009 – the fixed code has been posted to CodePlex.  (http://www.codeplex.com/powertools).

  7. Niall Little says:

    Hi Eric,

    Fantastic post and code. I ran into a couple of object ref errors with some of my sample docs in CopyNumbering that were fixed by changing to check for existence of “num” elements before setting the number ~ ln 766:

    newNumbering = newDoc.MainDocumentPart.NumberingDefinitionsPart.GetXDocument();

    elements = newNumbering.Root.Elements(ns + “num”);

    if (elements.Count() > 0)

    {

        number = elements.Max(f => ((int)f.Attribute(ns + “numId”))) + 1;

    }

    elements = newNumbering.Root.Elements(ns + “abstractNum”);

    if (elements.Count() > 0)

    {

        abstractNumber = elements.Max(f => ((int)f.Attribute(ns + “abstractNumId”))) + 1;

    }

    I guess my first doc had a number element, but no actual numbers so my 2nd doc bunked up. Anyway simple fix.

    An actual merging issue i ran in to was with different style headers. My first document had even and odd pages whereas my second doc only had default. The second doc was then using even/odd pages so the header “disappeared”. The user story i’m using this in doesn’t actually need to handle this situation, so its not an issue that i need to resolve right now, but it would be nice to understand (believe i tried!)

    Thanks for the great work, it not only functions but allowed me to understand a lot more of the inner workings word’s openxml structure by tracing through the code.

  8. bmcclellan62 says:

    Niall,

    Thanks for the feedback. I will post a fix for the numbering issue soon.

    There is an option to "keep sections" when merging documents that should change the behavior you saw with headers and footers. Headers and footers are stored in the section properties, so a new section break must be created in order to store the headers/footers from documents other than the first.

    -Bob

  9. RW says:

    I have multiple documents each with a header and each of those headers might have a different image. DocProc will merge the documents and their headers, but the images have not been merged into the new document. This is wonderful open source project. Thank you so much for posting.

    -RW

  10. Hi RW,

    You might have a slightly older version of DocProc.  There was a bug in the first version that didn’t merge the images in headers/footers into the new document.  Would you please try the most recent version available on codeplex.com/PowerTools, and let me know if it works for you?

    We are putting the final touches on a new version that gives finer control over headers/footers.  In addition, the new one will use the same image if included more than once in the merged document.

    I’m really happy that you like the project!  It’s been fun!

    -Eric

  11. Ernest Bariq says:

    Hi Eric.

    thank you for your code!

    I allways get an exeption in :

    public static XDocument GetXDocument(this OpenXmlPart part)

    {

       XDocument xdoc = part.Annotation<XDocument>();

       if (xdoc != null)

           return xdoc;

       try

       {

           using (StreamReader sr = new StreamReader(part.GetStream()))

           using (XmlReader xr = XmlReader.Create(sr))

           {

               xdoc = XDocument.Load(xr);

    this throws : root element is missing !

    Futhermore, I have a complex matrix of imbricated document that I have to be append at a certain place (replacing tokens by others files : could be wordart, images, rtf or docx) : so I have to combine your code and altchunks. It could be

    What about making a public method “InsertAt(source)”

    Ernest

  12. Hi Ernest,

    I suspect you are trying to open an invalid document.  Have you tried with a known good doc?

    -Eric

  13. Jan Selke says:

    Hi Eric.

    What a brilliant post. Unfortunatelly (at least for me) I read it too late.

    I had a lot to do with merging documents, copying headers/footers/styles/numberings… the list seems endless. It opens to me a whole new approach and i am eager to try it out real soon.

    Keep posting such great work.

    -Jan

  14. axefan says:

    Hi Eric, great work!

    I’m a bit confused regarding bookmark ids, and I could use your help.

    I’m assembling documents from sources that contain bookmarks, and I noticed using the reflector that the bookmark ids in the output document are not unique.

    For example, if I assemble a document from 3 sources that contain 1 bookmark each, the output document contains 3 bookmarks with id="0".

    Word 2007 does not seem to have any problems with these output files.  The compatibility checker did not find any problems, and Word 2003 with the compatibility pack was able to read and inspect the bookmarks correctly.

    I could add code to the class that re-sequences all bookmark ids, but the documentation on MSDN does not specify that they need to be unique, just that there must be a bookmarkEnd that is ‘subsequent to this [bookmarkStart] element in document order with a matching id attribute value’.

    So, should I worry about this?  My first instinct is to assume that duplicate ids are ok as long as the bookmark contents to not overlap.

     – Bill

  15. Hi Bill,

    I’ve taken a look at the spec, and the description for an id is "Specifies a unique identifier for an annotation within a WordprocessingML document.

    So I believe that they should be unique, and that this is a bug in DocumentBuilder.

    It will be a bit before I can get to this due to some upcoming deadlines.  Up to you whether you resequence after, fix DocumentBuilder (if you do, please let me know the fix :), or not worry about it, because Word isn’t complaining.

    -Eric

  16. axefan says:

    Thanks for the speedy reply!

    I agree that the ids should be unique at the document level, and not just within a particular context.  I’m working on a fix now.

  17. axefan says:

    Hi Eric,

    I have an easy solution to the duplicate range id bug.  Here’s a quick breakdown followed by the code parts that have changed.  Let me know what you think.

    Bug: Range ids are not unique in output document.

    Fix: In FixRange, assign a temporary unique id to each valid range encountered.

        New method: GenerateId – used by FixRange to get unique id (Uses Guid.NewGuid).

        New method: FixRangeIds – calls FixRangeId for each range type.

        New method: FixRangeId – assigns sequential id to all ranges of a specific type.

        In BuildDocument (private), add call to FixRangeIds after AppendDocument loop.

    Here are the code parts that have changed.  If you’d rather have the full source so you can ‘DIFF’ it, just let me know where to send it.

    private static void BuildDocument(List<Source> sources, WordprocessingDocument output)

    {

       // This list is used to eliminate duplicate images

       List<ImageData> images = new List<ImageData>();

       output.AddMainDocumentPart();

       XDocument mainPart = output.MainDocumentPart.GetXDocument();

       mainPart.Add(new XElement(ns + “document”, ns_attrs, new XElement(ns + “body”)));

       if (sources.Count > 0)

       {

           output.CopyStartingParts(sources[0].Document, images);

           bool lastKeepSections = false;

           foreach (Source source in sources)

           {

               output.AppendDocument(source.Document, source.Contents, source.KeepSections, lastKeepSections, images);

               lastKeepSections = source.KeepSections;

           }

           FixRangeIds(mainPart);

       }

    }

     

    private static void FixRange(XDocument oldDoc, IEnumerable<XElement> paragraphs, XName startElement, XName endElement, XName idAttribute, XName refElement)

    {

       foreach (XElement start in paragraphs.Elements(startElement))

       {

           XElement end = null;

           string rangeId = start.Attribute(idAttribute).Value;

           end = paragraphs.Elements(endElement).Where(e => e.Attribute(idAttribute).Value == rangeId).First();

           if (end == null)

           {

               end = oldDoc.Descendants().Elements(endElement).Where(o => o.Attribute(idAttribute).Value == rangeId).First();

               if (end != null)

               {

                   paragraphs.Last().Add(new XElement(end));

                   if (refElement != null)

                   {

                       XElement newRef = new XElement(refElement, new XAttribute(idAttribute, rangeId));

                       paragraphs.Last().Add(newRef);

                   }

               }

           }

           if (end != null)

           {

               rangeId = GenerateId();

               start.Attribute(idAttribute).Value = rangeId;

               end.Attribute(idAttribute).Value = rangeId;

           }

       }

       foreach (XElement end in paragraphs.Elements(endElement))

       {

           XElement start = null;

           string rangeId = end.Attribute(idAttribute).Value;

           start = paragraphs.Elements(startElement).Where(s => s.Attribute(idAttribute).Value == rangeId).First();

           if (start == null)

           {

               start = oldDoc.Descendants().Elements(startElement).Where(o => o.Attribute(idAttribute).Value == rangeId).First();

               if (start != null)

                   paragraphs.First().AddFirst(new XElement(start));

           }

           if (start != null)

           {

               rangeId = GenerateId();

               start.Attribute(idAttribute).Value = rangeId;

               end.Attribute(idAttribute).Value = rangeId;

           }

       }

    }

     

    public static String GenerateId()

    {

       string id = Guid.NewGuid().ToString();

       if (id == “00000000-0000-0000-0000-000000000000”) throw new Exception(“Unable to generate unique id!”);

       return id.Replace(“-“, “”).ToUpper();

    }

     

    private static void FixRangeIds(XDocument doc)

    {

       FixRangeId(doc, ns + “commentRangeStart”, ns + “commentRangeEnd”, ns + “id”);

       FixRangeId(doc, ns + “bookmarkStart”, ns + “bookmarkEnd”, ns + “id”);

       FixRangeId(doc, ns + “permStart”, ns + “permEnd”, ns + “id”);

       FixRangeId(doc, ns + “moveFromRangeStart”, ns + “moveFromRangeEnd”, ns + “id”);

       FixRangeId(doc, ns + “moveToRangeStart”, ns + “moveToRangeEnd”, ns + “id”);

    }

     

    private static void FixRangeId(XDocument doc, XName startElement, XName endElement, XName idAttribute)

    {

       int id = 0;

       foreach (XElement start in doc.Descendants(startElement))

       {

           string rangeId = start.Attribute(idAttribute).Value;

           XElement end = doc.Descendants(endElement).Where(e => e.Attribute(idAttribute).Value == rangeId).First();

           start.Attribute(idAttribute).Value = id.ToString();

           end.Attribute(idAttribute).Value = id.ToString();

           id++;

       }

    }

  18. axefan says:

    Hi Eric,

    I have identified and fixed (hopefully) another bug in FixRange.  Also, my version of FixRange above still contains a bug.  This post should take care of both.

    BUG #1: FixRange method misses ranges occurring at the body level (outside of any paragraph).

    Fix #1: Add two blocks to FixRange that handle ranges occurring at the body level.

           Add an extension method to Linq that enumerates paragraph elements rather than paragraph element childern.

    BUG #2: FixRange assigns temporary unique id to wrong element when fixing a broken range.

    FIX #2: Add code to FixRange to target newly created elements correctly.

    First, here’s the extension required by the code change.

    public static class Extensions

     {

         public static IEnumerable<XElement> SelfElements(this IEnumerable<XElement> source, XName name)

         {

             foreach (XElement element in source)

             {

                 if (element.Name == name)

                     yield return element;

             }

         }

     }

     

    And here‘s the new FixRange method.

     

     private static void FixRange(XDocument oldDoc, IEnumerable<XElement> paragraphs, XName startElement, XName endElement, XName idAttribute, XName refElement)

     {

         foreach (XElement start in paragraphs.SelfElements(startElement))

         {

             XElement end = null;

             string rangeId = start.Attribute(idAttribute).Value;

             IEnumerable<XElement> results = paragraphs.SelfElements(endElement).Where(e => e.Attribute(idAttribute).Value == rangeId);

             if (results.Any()) end = results.First();

             if (end == null)

             {

                 results = oldDoc.Descendants().Elements(endElement).Where(o => o.Attribute(idAttribute).Value == rangeId);

                 if (results.Any()) end = results.First();

                 if (end != null)

                 {

                     end = new XElement(end);

                     if (paragraphs.Last().Name == (ns + “sectPr”))

                     {

                         paragraphs.Last().AddBeforeSelf(end);

                     }

                     else

                     {

                         paragraphs.Last().Add(end);

                     }

                     if (refElement != null)

                     {

                         XElement newRef = new XElement(refElement, new XAttribute(idAttribute, rangeId));

                         paragraphs.Last().Add(newRef);

                     }

                 }

             }

             if (end != null)

             {

                 rangeId = GenerateId();

                 start.Attribute(idAttribute).Value = rangeId;

                 end.Attribute(idAttribute).Value = rangeId;

             }

         }

         foreach (XElement start in paragraphs.Elements(startElement))

         {

             XElement end = null;

             string rangeId = start.Attribute(idAttribute).Value;

             IEnumerable<XElement> results = paragraphs.Elements(endElement).Where(e => e.Attribute(idAttribute).Value == rangeId);

             if (results.Any()) end = results.First();

             if (end == null)

             {

                 results = oldDoc.Descendants().Elements(endElement).Where(o => o.Attribute(idAttribute).Value == rangeId);

                 if (results.Any()) end = results.First();

                 if (end != null)

                 {

                     end = new XElement(end);

                     paragraphs.Last().Add(end);

                     if (refElement != null)

                     {

                         XElement newRef = new XElement(refElement, new XAttribute(idAttribute, rangeId));

                         paragraphs.Last().Add(newRef);

                     }

                 }

             }

             if (end != null)

             {

                 rangeId = GenerateId();

                 start.Attribute(idAttribute).Value = rangeId;

                 end.Attribute(idAttribute).Value = rangeId;

             }

         }

         foreach (XElement end in paragraphs.SelfElements(endElement))

         {

             XElement start = null;

             string rangeId = end.Attribute(idAttribute).Value;

             IEnumerable<XElement> results = paragraphs.SelfElements(startElement).Where(s => s.Attribute(idAttribute).Value == rangeId);

             if (results.Any()) start = results.First();

             if (start == null)

             {

                 results = oldDoc.Descendants().Elements(startElement).Where(o => o.Attribute(idAttribute).Value == rangeId);

                 if (results.Any()) start = results.First();

                 if (start != null)

                 {

                     start = new XElement(start);

  19. Hi Bill, that is awesome!  Source code would be great, as I’ve tweaked the source a bit elsewhere, and would be great to do a diff.  You can send to white dot eric at microsoft dot com.

    -Eric

  20. semaphore_au says:

    Hi Eric,

    Sensational code. I spent weeks doing similar coding, but without the ease of XLinq. I’m gradually replacing my archaic methods with your code. I notice you are looking at the source, and I have one wee bug you may like to fix. In ImageData.WriteImage I think the line:

    s.Write(m_Image, 0, m_Image.GetUpperBound(0));

    should be:

    s.Write(m_Image, 0, m_Image.GetUpperBound(0) + 1);

    My wmf binaries were coming out corrupted, being one byte too short.

    -David

  21. Ganesh says:

    Hi Eric,

    I am trying to merging the three document, I got the merged document, I am facing one issue, All the input documents (1.docx,2.docx,3.docx) having different header, after merging, the first document header is replaced to all the pages.  All the pages having header of 1.docx.  I took the latest code from the

    http://powertools.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=26378

    DocProc.zip

    Code

    List<Source> sources = new List<Source>();

    WordprocessingDocument part1 = WordprocessingDocument.Open(“1.docx”, false);

    sources.Add(new Source(part1, 0, false));

    WordprocessingDocument part2 = WordprocessingDocument.Open(“2.docx”, false);

    sources.Add(new Source(part2, 0, false));

    WordprocessingDocument part3 = WordprocessingDocument.Open(“3.docx”, false);

    sources.Add(new Source(part3, 0, false));

    DocumentBuilder.BuildDocument(sources, “Test1.docx”);

    Please reply,

    Ganesh

  22. Hi Ganesh,  my question is, what do you want the behavior to be?  Do you want the merge to create three sections, each with its own header?  Or some other behavior?

    -Eric

  23. Ganesh says:

    Hi Eric,

    Yes, I want to merge the three documents each having one page and contains different header with styles.

    The final merged documents shows the first document header to all the pages.

    How to overcome this?

    Thanks

  24. Ganesh says:

    Hi Eric,

    One More question,  I am trying to remove the header after merging the 2 documents, Say 1.docx and 2.docx having headers and body contents.  After merging final document have  1.docx + 2.docx and should be only body contents and I dont want to include the header part.  For that I have commented the CopyHeaders method in the class documentBuilder.cs.

    CopyHeaders(oldDoc, newDoc, paragraphs, images);

    But, When i trying to open the merged document, I got the error, final document Cannot be opened because there are problem with the contents.

    Details

    Microsoft office cannot open this file because some parts are mission or invalid.

    Location Word/Document.xml Line 4023, Column 57

    How to remove the headers after merged to new document?

    thanks

    Ganesh

  25. Hi Ganesh, I’ve written a post on how to work with headers when using DocumentBuilder:

    http://blogs.msdn.com/ericwhite/archive/2010/01/08/how-to-control-sections-when-using-openxml-powertools-documentbuilder.aspx

    To delete headers, you can find all of the w:sectPr elements in the body of the document and remove them.  I’ll put something together as time permits.  However, that post may give you enough information to explicitly control which sections you include in your destination document.

    -Eric

  26. Abhishek says:

    Hello Erik,

    I am not exactly using your code, since my needs are simpler. However the idea I am using the same. Basically I have a template from which I read all the nodes starting with << and then based on inner text replace it with corresponding data from data source. Code below:

    NameTable nt = new NameTable();

    XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);

    nsManager.AddNamespace(“w”, wordmlNamespace);

    PackagePart packagePart;

    XmlNodeList nodes;

    string mergedFileName = “C:\MailMergeTemplates\Merged” + DateTime.Now.Ticks.ToString() + “.docx”;

    File.Copy(sourceFile, mergedFileName, true);

    using (Package p = Package.Open(mergedFileName))

    {

       packagePart = p.GetPart(uri);

       XmlDocument xdoc = new XmlDocument();

       xdoc.Load(packagePart.GetStream());

       nodes = xdoc.SelectNodes(“//w:t[starts-with(text(),’«’)]”, nsManager);

       int ctr = 0;

       foreach (System.Xml.XmlNode node in nodes)

       {

           if(node.FirstChild.InnerText.Contains(“Cases_First_Name”))

           {

               if (Participant.Properties.Contains(“lw_firstname”))

               {

                   node.FirstChild.InnerText = Participant.Properties[“lw_firstname”].ToString();

               }

               else

               {

                   node.FirstChild.InnerText = “”;

               }

    Now my problem is sometimes some of the merged fields might have no corresponding data (a null value in data source). In this case I am setting them to “” but now if that is the only field in the line. I get a blank line which is sometimes incorrect like in case of address line 2 coming blank I will have a empty line between address line 1 and ciy. Is there a way I can simply remove that node?

  27. Hi Abhishek,

    Instead of looking for text nodes to replace, I recommend that you use content controls to delineate the text that you will replace.  Before comparing text for your pattern, you can concatenate adjacent text nodes.  Then, you can completely replace the content control with your retrieved data.  If you want to replace the text with other text that has identical formatting, before constructing the content that will replace the content control, you can retrieve the formatting from the underlying text of the content control.

    For your case, when you have retrieved data that you want to replace with nothing, you can simply delete the content control, which will remove the empty paragraph.  This will solve another problem, which is that you are not guaranteed that the ‘<<‘ will be in a single text node.  Sooner or later, your code may fail because the << or Cases_First_Name strings will be split.

    Using DocumentBuilder with Content Controls for Document Assembly (http://blogs.msdn.com/ericwhite/archive/2009/04/21/using-documentbuilder-with-content-controls-for-document-assembly.aspx) contains an example of replacing a content control with other data.

    -Eric

  28. Irfan Change says:

    Hi Eric,

    When copying embedded objects you need to account for both EmbeddedPackagePart and EmbeddedObjectPart possibilities.  There is an easy solution however, all you need to do is reference the par

    In method: DocumentBuilder.CopyEmbeddedObjects(…) on line 973 or so, you can replace the code and declare parts as OpenXmlPart instead of EmbeddedObjectPart.

    So that the new method looks like:

    private static void CopyEmbeddedObjects(WordprocessingDocument oldDoc, WordprocessingDocument newDoc, IEnumerable<XElement> paragraphs)

    {

        foreach (XElement oleReference in paragraphs.Descendants(ns_o + “OLEObject”))

        {

            string relId = oleReference.Attribute(ns_r + “id”).Value;

            OpenXmlPart oldPart = oldDoc.MainDocumentPart.GetPartById(relId);

            OpenXmlPart newPart = newDoc.MainDocumentPart.AddEmbeddedObjectPart(oldPart.ContentType);

            using (Stream oldObject = oldPart.GetStream(FileMode.Open, FileAccess.Read))

            using (Stream newObject = newPart.GetStream(FileMode.Create, FileAccess.ReadWrite))

            {

                int byteCount;

                byte[] buffer = new byte[65536];

                while ((byteCount = oldObject.Read(buffer, 0, 65536)) != 0)

                    newObject.Write(buffer, 0, byteCount);

            }

            oleReference.Attribute(ns_r + “id”).Value = newDoc.MainDocumentPart.GetIdOfPart(newPart);

        }

    }

     

    Thanks, Irfan!  I’ve updated the source on CodePlex.

    -Eric

  29. Rocky M. says:

    Hi Eric,

    Thank you very much for the guideline for merging docx files. I was referred by Jason (author of docx4j) to this article and I have implemented document merging in Java based on this guideline and docx4j.

    There is one more item I'd like to add to the merging. It's for Content Controls (sdt objects).

    For each Content Controls, it is assigned a unique ID by Word. For merging, the ID may be duplicated, e.g. to copy a Content Control to a new one. When using Word to open the merged document with duplicated Content Control IDs, some of the Content Controls are not recognized by Word and it will display as a common text, no blue box surrounded. When save this document, those Content Controls are lost and only the content will remain.

    My solution is to re-assign the Content Control IDs, after merging, to make sure the ID is unique. It's working fine for me.

    Please let me know if you have better idea.

    Cheers!

    Rocky

  30. axefan says:

    Hi All,

    I found an external relationship that is not currently handled.  Linked images.

    Here's an updated verison of CopyImages that handles them properly.

    private static void CopyImages(WordprocessingDocument oldDoc, WordprocessingDocument newDoc, IEnumerable<XElement> paragraphs, List<ImageData> images)

    {

       foreach (XElement imageReference in paragraphs.Descendants(ns_a + "blip"))

       {

           string relId = imageReference.Attribute(ns_r + "embed").Value;

           ImagePart oldPart = (ImagePart)oldDoc.MainDocumentPart.GetPartById(relId);

           ImageData temp = ManageImageCopy(oldPart, images);

           if (temp.ResourceID == null)

           {

               ImagePart newPart = newDoc.MainDocumentPart.AddImagePart(oldPart.ContentType);

               temp.ResourceID = newDoc.MainDocumentPart.GetIdOfPart(newPart);

               // Copy link reference

               XAttribute link = imageReference.Attribute(ns_r + "link");

               if (link != null)

               {

                   String linkId = link.Value;

                   ReferenceRelationship rel = sourceDoc.MainDocumentPart.GetReferenceRelationship(linkId);

                   targetDoc.MainDocumentPart.AddExternalRelationship(rel.RelationshipType, rel.Uri, rel.Id);

               }

               temp.WriteImage(newPart);

           }

           imageReference.Attribute(ns_r + "embed").Value = temp.ResourceID;

       }

    }

  31. axefan says:

    Hi Eric,

    I'm not sure if you're still taking input on this, but I've found another missed case.  Is this the best place to provide feedback?

    Anyway, I'm getting really close to finishing my implementation, but I recently noticed several source documents that contain bulleted lists where the numbering part was not created properly in the output file.

    The original CopyNumbering method misses the case where a numbering paragraph does not contain a numbering properties element (w:numPr).  Instead, it has a paragraph style reference (w:pStyle) to a style definition that contains the numbering properties element (w:numPr).

    The fix is easy, just add another loop that processes all w:pStyle elements were the parent (w:p) does not contain a w:numPr element.  This gives us all the paragraph styles for paragraphs not handled by the current code.  Next, get the matching style from the style part and get the style's w:numPr element and w:numId element if any.  If there is no w:numPr or w:numId element, then the paragraph does not contain a list item.

    Once you have the w:numId element from the style, the rest of the code is the same.

    My only question at this point is this:  If a paragraph contains both a numbering properties element and a style that references a numbering properties element (seems to be the most common case), which one does Word use?  Currently, my implementation favors a numbering properties element in the paragraph, but that could easily be changed.

  32. Kieron Dye says:

    Hi Eric,

    Think that I've found a solution to the "Theme with images" problem. It turns out that currently SDK does not create a part relationship when you call ThemePart.AddImagePart. The simple fix is to add the following code after the image has been added:

    doc.MainDocumentPart.ThemePart.CreateRelationshipToPart(newPart, temp.ResourceID);

    Hope this helps.

  33. Eugene Pliskin says:

    It looks like DocumentBilder currently produces invalid document if source document footer or comment part contains references to hyperlinks or pictures.

    References to the hyperlinks or pictures are copied into output document footer or comment, but hyperlinks or images do not.

    Such references in the target document point to non-existent hyperlink or image.

  34. S M Anisha says:

    HAI

    I am trying to take out only the open xml parts of the document from the particular sources List.Help me!

    Thanks.

  35. guy-from-jackson says:

    this did not help me at all…. partly because i'm from jackson and can't read….but nevertheless, you suck

  36. E Robinson says:

    didnt understand a word of it. Far too complicated and not any examples

  37. Abbas says:

    Can i use this to write a java code?

  38. Pascal says:

    Hello Eric . I'm trying to mailmerge a document with OpenXml. Unfortunetly i can't really understand your code (i'm noob C# developper). My informations came from a CSV and i already got informations in an object list like this :

         static void Main(string[] args)

           {

               string fileName = @"C:Usersstagiaire.ikosoftDocumentsModele1.dotx";

               using (WordprocessingDocument pkgDoc = WordprocessingDocument.Open(fileName, true))

               {

                   string file_name = @"C:Usersstagiaire.ikosoftDocumentstest.csv";

                   StreamReader sr = new StreamReader(file_name);

                   string line = null;

                   Customer c = new Customer

                   {

                       nom = null,

                       prenom = null,

                       code = null,

                   };

                   while ((line = sr.ReadLine()) != null)

                   {

                       string[] tokens = line.Split(',');

                       c.nom = tokens[0];

                       c.prenom = tokens[1];

                       c.code = tokens[2];

                   }

                   sr.Close();

               }

           }

    How can i now Merge this ? If you got some tips that should be good for me

    Thank You, Pascal