Document Creation and Conversion with the OpenXML SDK and SharePoint 2010 Word Automation Services – Part 2

A long time ago, I wrote Part 1 of this post, based on the presentation I did at the 2010 SharePoint Conference in Sydney. If you're following along with the code, you may want to review that post so you can set up your development environment to match mine.

To quickly recap, the last post showed how to create Word (OpenXML, docx) documents programmatically and write them to disk using the OpenXML SDK (and therefore without the requirement for Word/Office on the machine creating the documents).

In this part, I'll extend the solution to write the documents to a document library in SharePoint and then use Word Automation Services to automatically convert the docx files to PDF format.

Setup

To follow along with this walkthrough without changes, setup your SharePoint instance (at https://localhost) as follows:

  • Create a document library called "Created Docs" in the root site
  • Create a document library called "Converted Docs" in the root site

If you use a remote server, and/or a different site or library names, you'll need to adjust some of the URI and path strings in the code below to make it work.

Writing to a SharePoint Library

Carrying on from last time, wire up the click event of the CreateOneSharePointDocumentButton:

Click event handler

  1. private void CreateOneDocumentOnSharePointButton_Click(object sender, EventArgs e)
  2. {
  3.     gen.CreateOneDocumentOnSharePoint();
  4. }

Generate a stub for the CreateOneDocumentOnSharePoint() method in the DocGenerator class using the Ctrl+. technique made possible by the Visual Studio 2010 Productivity Power Tools.

Switch to and add a using statement to the DocGenerator class to give you access to the SharePoint Client Libraries - giving it an alias will help disambiguate the File class later:

  1. using SPC = Microsoft.SharePoint.Client;

Using the SharePoint Client Libraries, it's very easy to write documents to a document library, and there's no need to write a document to a local drive. This means we'll use a different overload of the OpenXML SDK's WordprocessingDocument.Create() method that writes, not to a file, but to a MemoryStream.

CreateOneDocumentOnSharePoint

  1. internal void CreateOneDocumentOnSharePoint()
  2. {
  3.   SPC.ClientContext clientContext = new SPC.ClientContext("https://localhost");
  4.   string fileUrl = "/Created Docs/MyVeryVeryCoolDoc.docx";
  5.  
  6.   sw.Reset();
  7.   sw.Start();
  8.  
  9.   using (MemoryStream ms = gen.CreatePackage())
  10.   {
  11.     ms.Seek(0, SeekOrigin.Begin);
  12.     SPC.File.SaveBinaryDirect(clientContext, fileUrl, ms, true);
  13.   }

In this code, you create a new SharePoint.Client.ClientContext that gives access to the site (in this case at https://localhost, but if you've got things set up differently, change it here).

Create an overload of the CreatePackage() method in the DocumentCreator class that creates and populates a MemoryStream:

CreatPackage Overload

  1. internal MemoryStream CreatePackage()
  2. {
  3.  
  4.   MemoryStream ms = new MemoryStream();
  5.   using (WordprocessingDocument package =
  6.     WordprocessingDocument.Create
  7.     (ms, WordprocessingDocumentType.Document))
  8.   {
  9.     CreateParts(package);
  10.   }
  11.  
  12.   return ms;
  13.  
  14. }

Move the pointer to the start of the MemoryStream and call the File.SaveBinaryDirect() method passing in the ClientContext, a string indicating where the file should be written, the stream and a boolean that tells SharePoint whether or not to overwrite an existing file with the same name.

Running the app and clicking the One document in SharePoint button shows that it's very fast - in my case 102ms

Writing 1 document to SharePoint took 0.1 seconds

Writing lots of documents is fast too - add an event handler to the CreateOneSharePointDocumentButton:

Click Event Handler

  1. private void CreateManyDocumentsOnSharePointButton_Click(object sender, EventArgs e)
  2. {
  3.   gen.CreateManyDocumentsOnSharePointInParallel((int)NumberOfDocumentsToCreate.Value);
  4.  
  5. }

And add a CreateManyDocumentsOnSharePointInParallel() method that uses a Parallel.For() loop to call CreatePackage() and File.SaveBinaryDirect() for as many files as you create:

Create lots of SharePoint Docs

  1. internal void CreateManyDocumentsOnSharePointInParallel(int NumberOfDocs)
  2. {
  3.   SPC.ClientContext clientContext = new SPC.ClientContext("https://localhost");
  4.   string fileUrl = "/Created Docs/MyEvenCoolerDoc{0:D5}.docx";
  5.  
  6.   sw.Reset();
  7.   sw.Start();
  8.  
  9.   Parallel.For(0, NumberOfDocs, i =>
  10.   {
  11.  
  12.     using (MemoryStream ms = gen.CreatePackage())
  13.     {
  14.       ms.Seek(0, SeekOrigin.Begin);
  15.       SPC.File.SaveBinaryDirect(clientContext, string.Format(fileUrl, i), ms, true);
  16.     }
  17.  
  18.   });
  19.  
  20.   sw.Stop();
  21.  
  22.   System.Windows.Forms.MessageBox.Show(string.Format(
  23.       "Wrote {3} documents to SharePoint ({1}{2}) in {0} ms (using parallel processing)",
  24.       sw.ElapsedMilliseconds,
  25.       clientContext.Url,
  26.       fileUrl,
  27.       NumberOfDocs));
  28.  
  29. }

This is also pretty fast - in my case 40ms per document.

Writing 100 documents to SharePoint (in parallel) took 4 seconds

Navigating to the document library shows all those documents sitting just where you'd expect to see them:

Cool documents created en-masse

Converting Word Documents to a Fixed Format (PDF or XPS)

PDFUp until now, we've not had to use Word (or any other Office client) as all we've been doing is generating documents, not rendering them. Just like you can create an HTML document without requiring a browser, it's perfectly valid to create a Word document (or any other OpenXML format document) without using Word.

However, to view the document, or to create a fixed version of it like PDF or XPS, it's necessary to render it. Up until the release of SharePoint 2010, the highest fidelity way to do this was to open the document in Word. Of course, doing that on the server was fraught with difficulty. Word is not designed to be a server-side tool - it throws (sometimes modal) dialogs, it spends a lot of resources on updating the screen and it's not optimised for multi-processor, large memory scenarios. When there is a user interacting with Word though, the bottleneck is rarely the computer.

The SharePoint team addressed this problem with the Word Automation Services feature in SharePoint 2010 (standard edition and higher). Word Automation Services is the client code from Word with the UI bits stripped out and optimised to run as a server process. All of the rendering engine is available for SharePoint to use without any of the issues (both technical and from a licensing point of view) of using Word on a server. There's lots of great info on Word Automation Services on MSDN and elsewhere. Here's the list of resources I provided in the first post in this series:

Word Automation Services (WAS) document conversion jobs run as as an asynchronous server-side job that can either be scheduled automatically (for example, when a document is placed in a folder) or programmatically. Either way, the job won't start immediately, just the next time the WAS scheduler runs. The frequency of the scheduler running is set in Central Administration - see the links above for details on how to set it up. I set it to the minimum interval - one minute.

Interacting programmatically with the service is pretty straightforward, but there are two gotchas:

  1. the .NET libraries are 3.5 only, so the project you create must be a .NET 3.5 project, and
  2. the calls will fail (with cryptic exceptions) if it's not a 64-bit call, so you must target either x64 or Any processor type, not x86.

Create a new console application and make sure that the target framework is 3.5.

Create a new console application targetting Framework 3.5

Open the Visual Studio Configuration Manager dialog by dropping down the Solution Configurations drop-down on the Visual Studio Standard toolbar (or choosing Configuration Manager from the Build menu):

Choose Configuration Manager

Next, add a Solution Platform:

image

to target Any CPU (or x64)

image

Now you're ready to start building.

Converting a single document to PDF

Add references to the Microsoft.SharePoint and Microsoft.Office.Word.Server assemblies.

Add using statements for those assemblies:

  1. using Microsoft.SharePoint;
  2. using Microsoft.Office.Word.Server.Conversions;

Add a couple of static string properties to the class that you can adjust to suit the way you've got your SharePoint setup configured:

  1. // If you manually installed Word Automation Services, then replace the name
  2. // in the following line with the name that you assigned to the service when
  3. // you installed it.
  4. static string cWordServicesName = "Word Automation Services";
  5. static string siteUrl = "https://localhost";

Now you can initiate the conversion of a single document:

Convert a single document

  1. private static void SingleConv()
  2. {
  3.   using (SPSite spSite = new SPSite(siteUrl))
  4.   {
  5.     ConversionJob job = new ConversionJob(cWordServicesName);
  6.     job.UserToken = spSite.UserToken;
  7.     job.Settings.UpdateFields = true;
  8.     job.Settings.OutputFormat = SaveFormat.PDF;
  9.     job.AddFile(siteUrl + "/Created%20Docs/MyAwesomeDoc.docx",
  10.         siteUrl + "/Converted%20Docs/MyAwesomeDoc.pdf");
  11.     job.Start();
  12.     Console.WriteLine("Job ID: {0} started", job.JobId);
  13.     Console.WriteLine("Press the any key ...");
  14.     Console.ReadKey();
  15.   }
  16. }

There are a few things to note here.

Firstly, you get a reference to the Site using the SharePoint libraries, not the SharePoint Client libraries that we used to write the Word docs to the list in the first place.

Next, you need to pass a user token to the new ConversionJob, and you get that from the SPSite user token.

Third, you specify the output format using the SaveFormat enumeration.

Finally, remember the service is performed asynchronously and so although you get a Job ID back, you don't get any more information about the job status (more on that when we do bulk conversions)

Converting documents to PDF en-masse

Converting whole libraries at once is also very easy. The ConversionJob class has an AddLibrary() method that takes as parameters a source and destination SPList object.

Converting whole libraries

  1. private static void BulkConv()
  2. {
  3.   using (SPSite spSite = new SPSite(siteUrl))
  4.   {
  5.     Console.WriteLine("Starting conversion job");
  6.     ConversionJob job = new ConversionJob(cWordServicesName);
  7.     job.UserToken = spSite.UserToken;
  8.     job.Settings.UpdateFields = true;
  9.     job.Settings.OutputFormat = SaveFormat.PDF;
  10.     job.Settings.OutputSaveBehavior = SaveBehavior.AlwaysOverwrite;
  11.     SPList listToConvert = spSite.RootWeb.Lists["Created Docs"];
  12.     SPList listToPopulate = spSite.RootWeb.Lists["Converted Docs"];
  13.     job.AddLibrary(listToConvert, listToPopulate);
  14.     job.Start();
  15.     Console.WriteLine("Bulk conversion job {0} started", job.JobId);
  16.     ConversionJobStatus status = new ConversionJobStatus(cWordServicesName,
  17.         job.JobId, null);
  18.     Console.WriteLine("Number of documents in conversion job: {0}", status.Count);
  19.     while (true)
  20.     {
  21.       System.Threading.Thread.Sleep(5000);
  22.  
  23.       status.Refresh();
  24.       if (status.Count == status.Succeeded + status.Failed)
  25.       {
  26.         Console.WriteLine("{2} Completed, Successful: {0}, Failed: {1}",
  27.             status.Succeeded, status.Failed, DateTime.Now);
  28.         break;
  29.       }
  30.       Console.WriteLine("{2} In progress, Successful: {0}, Failed: {1}",
  31.           status.Succeeded, status.Failed, DateTime.Now);
  32.     }
  33.  
  34.     Console.ReadKey();
  35.   }
  36. }

Checking the status of the job is straightforward (as long as you have the JobId - a GUID uniquely identifying this conversion job). The ConversionJobStatus object holds information about the conversion job including how many documents are to be converted, how many have been converted successfully and how many have failed. Calling the Refresh() method gets the most up-to-date status and you can use that to poll for completion. Remember that jobs only start every <n> minutes, where n is a setting in SharePoint Central Administration

Converting documents is an asynchronous process

The result is a SharePoint list full of PDF files, created without ever needing to open Word.

A library full of converted PDFs

A converted PDF in Adobe Reader

Conclusion

The combination of the OpenXML SDK and Word Automation Services makes server-side document creation simple, scalable and efficient. This is definitely a tool worth adding to your arsenal.

Source Code

I've zipped up the two solutions - the document creation (.NET 4.0) WinForms project and the document conversion (.NET3.5) project for you to download and play with. Notice that they are NOT production ready - they're illustrative only. Use them at your peril, your mileage may vary, contents may be hot no guarantees etc … you know the drill.

Document Creation Solution Download (241kB)
Document Conversion Solution Download (115kB)