Embedding Any File Type, Like PDF, in an Open XML File

In my last post, I showed you guys how to embed an Excel spreadsheet within a Word document without the need to invoke an OLE Server. In today's post I am going to show you how to embed any file in an Open XML file. Specifically, I am going to show you how to embed a PDF file into a Word document. Note that this approach requires you to invoke an OLE Server to embed the file into an Open XML file.

My post will talk about using version 2 of the SDK.

If you just want to jump straight into the code, feel free to download this solution here.

<

Solution

To embed a PDF file into a Word document we can take the following actions:

  1. Create a template in Word that contains a content control that will be used to demarcate the region where the embedded object will be inserted
  2. Open up the Word document via the Open XML SDK and access its main document part
  3. Invoke the OLE server application associated with PDF files to create an IStorage and an image of the embedded object
  4. Add an image part to the document
  5. Feed the data from the generated image into the added image part
  6. Add an embedded object part to the document
  7. Feed the data from the generated IStorage into the embedded object part
  8. Determine the prog id associated with the application associated with PDF files
  9. Create a paragraph that contains the embedded object
  10. Locate the content control that will contain the embedded object
  11. Swap out the content control for the newly created paragraph
  12. Save changes made to the Word document

Note that the steps outlined above are just one method to accomplish this scenario. The steps above are very similar to my previous post showing you how to embed an Excel spreadsheet within a Word document. The main difference is in how we go about adding the embedded object to the Word document. No application, at least on my computer, has written out a subkey IPersistStorageType under HKCR\CLSID\{Apps_OLE_Storage_CLSID} for PDF files, which means there is no way for us to know the required structure of an IStorage containing a PDF file. Instead we are required to rely on the OLE server application associated with PDF files to generate the appropriate IStorage.

For the sake of this example, let's say I am starting with the following Word document:

Embed1

This document contains a content control, named "EmbedObject," which will contain my embedded object. In addition, let's say I have the following PDF file I wish to embed:

Embed2

The Code

As mentioned in my previous post, embedding an object in a document requires both a visual representation of the object and the underlying data. In this post, I am going to show you how to generate the IStorage and the image representing the embedded object by invoking the OLE Server associated with PDF files. To create the underlying data for a non-Office embedded object we need to look up the prog id of the application associated with the file format extension. To get this data we need to look under \HKCR\.XXX within the registry, where XXX is the file format extension (ex. PDF). Under this path you should see at least two sub keys: "(Default)" and "Content Type." The value specified for "(Default)" represents the prog id of the application associated with the file format. On my computer, the prog id associated with PDF files is "AcroExch.Document."

Since we don't know the structure of the embedded object we shouldn't use the content type associated with the file format extension. Instead, we should use the generic content type for embedded objects, which is "application/vnd.openxmlformats-officedocument.oleObject."

Our next step is to create the IStorage and an image representation for the embedded object. As mentioned in the Solution section above, we need to invoke the OLE Server associated with PDF files. Below is the C++ code needed to accomplish this task:

//********** This snippet is C++ code *************// HRESULT PackageOleObject(LPCTSTR inputFile, LPCTSTR outputFile) { HRESULT hr = S_OK; IStoragePtr pStorage = NULL; IOleObjectPtr pOle = NULL; IDataObjectPtr pdo = NULL; FORMATETC fetc; STGMEDIUM stgm; HENHMETAFILE hmeta;   // Create a compound storage document. hr = StgCreateStorageEx ( outputFile, STGM_READWRITE | STGM_SHARE_EXCLUSIVE | STGM_CREATE | STGM_TRANSACTED, STGFMT_DOCFILE, 0, NULL, NULL, IID_IStorage, reinterpret_cast<void**>(&pStorage)); CheckHr(hr);      // Create OLE package from file. hr = OleCreateFromFile(CLSID_NULL, inputFile, ::IID_IOleObject, OLERENDER_NONE, NULL, NULL, pStorage, (void**)&pOle);   hr = OleRun(pOle); CheckHr(hr);   hr = pOle->QueryInterface(IID_IDataObject, (void**)&pdo); CheckHr(hr);   fetc.cfFormat = CF_ENHMETAFILE; fetc.dwAspect = DVASPECT_CONTENT; fetc.lindex = -1; fetc.ptd = NULL; fetc.tymed = TYMED_ENHMF;   stgm.hEnhMetaFile = NULL; stgm.tymed = TYMED_ENHMF; hr = pdo->GetData(&fetc, &stgm); CheckHr(hr);   // Create image metafile for object. CopyEnhMetaFile(stgm.hEnhMetaFile, emfFile);   hr = pStorage->Commit(STGC_DEFAULT ); CheckHr(hr);   pOle->Close(0); DeleteEnhMetaFile(stgm.hEnhMetaFile); DeleteEnhMetaFile(hmeta);      return hr; }

The above C++ code snippet will create two output files that represent the IStorage and the image representation for our embedded object.

We are now ready to accomplish the rest of the steps. Here is how you add the appropriate image data and embedded object data to a Word file:

using (WordprocessingDocument myDoc = WordprocessingDocument.Open(output, true)) { MainDocumentPart mainPart = myDoc.MainDocumentPart;   //Note that I created this emf file using my C++ solution ImagePart imagePart = mainPart.AddImagePart(ImagePartType.Emf); imagePart.FeedData(File.Open("output.emf", FileMode.Open));   EmbeddedObjectPart embeddedObjectPart = mainPart.AddEmbeddedObjectPart(@"application/vnd.openxmlformats-officedocument.oleObject");   //Note that I created this bin file using my C++ solution embeddedObjectPart.FeedData(File.Open("input.pdf.bin", FileMode.Open));   ... }

I should note that both the image and the embedded data were created using my C++ code that I showed you earlier in this post. The next step is to create a paragraph that represents our embedded object. Using the Document Reflector to help me out, I was able to create the following method:

static Paragraph CreateEmbeddedPDFParagraph(string imageId, string embedId, string progId) { Paragraph p = new Paragraph( new Run( new EmbeddedObject( new V.Shapetype( new V.Stroke() { JoinStyle = V.StrokeJoinStyleValues.Miter }, new V.Formulas( new V.Formula() { Equation = "if lineDrawn pixelLineWidth 0" }, new V.Formula() { Equation = "sum @0 1 0" }, new V.Formula() { Equation = "sum 0 0 @1" }, new V.Formula() { Equation = "prod @2 1 2" }, new V.Formula() { Equation = "prod @3 21600 pixelWidth" }, new V.Formula() { Equation = "prod @3 21600 pixelHeight" }, new V.Formula() { Equation = "sum @0 0 1" }, new V.Formula() { Equation = "prod @6 1 2" }, new V.Formula() { Equation = "prod @7 21600 pixelWidth" }, new V.Formula() { Equation = "sum @8 21600 0" }, new V.Formula() { Equation = "prod @7 21600 pixelHeight" }, new V.Formula() { Equation = "sum @10 21600 0" }), new V.Path() { AllowGradientShape = V.BooleanValues.T, ConnectionPointType = OVML.ConnectValues.Rectangle, AllowExtrusion = V.BooleanValues.F }, new OVML.Lock() { Extension = V.ExtensionHandlingBehaviorValues.Edit, AspectRatio = OVML.BooleanValues.T } ) { Id = "_x0000_t75", CoordinateSize = "21600,21600", Filled = V.BooleanValues.F, Stroked = V.BooleanValues.F, OptionalNumber = 75, PreferRelative = V.BooleanValues.T, EdgePath = "m@4@5l@4@11@9@11@9@5xe" }, new V.Shape( new V.ImageData() { Title = "", RelationshipId = imageId } ) { Id = "_x0000_i1025", Style = "width:459pt;height:594pt", Ole = V.BooleanEntryWithBlankValues.Empty, Type = "#_x0000_t75" }, new OVML.OleObject() { Type = OVML.OLEValues.Embed, ProgId = progId, ShapeId = "_x0000_i1025", DrawAspect = OVML.OLEDrawAspectValues.Content, ObjectId = "_1309181277", Id = embedId } ) { DxaOriginal = (UInt32Value)9180U, DyaOriginal = (UInt32Value)11881U }) ); return p; }

The last step of the solution is to swap out the content control for this newly created paragraph. Here is the code snippet to accomplish this task:

Paragraph p = CreateEmbeddedPDFParagraph( mainPart.GetIdOfPart(imagePart), mainPart.GetIdOfPart(embeddedObjectPart), "AcroExch.Document");   SdtBlock sdt = mainPart.Document.Descendants<SdtBlock>() .Where(s => s.GetFirstChild<SdtProperties>().GetFirstChild<Alias>().Val.Value .Equals("EmbedObject")).First();   OpenXmlElement parent = sdt.Parent; parent.InsertAfter(p, sdt); sdt.Remove(); mainPart.Document.Save();

End Result

Running this code I should end up with a document that looks like the following:

Embed3

Upon activating the embedded object I will see the following:

Embed4

Let me know if you guys are interested in more solutions around embedded objects.

Zeyad Rajabi

Added video to blog post