PowerTools for Open XML Technical Overview

Open XML PowerTools

Design Goals

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCProcessing Open XML documents using PowerShell is a powerful and compelling approach to creating and modifying Open XML documents  on a server.

The Open XML PowerTools were designed to make it easy to create and modify documents.  The commands (or cmdlets) fall into five main categories:

·         Get information: Get-OpenXmlBackground, Get-OpenXmlHeader, etc.

·         Set information: Set-OpenXmlBackground, Set-OpenXmlHeader, etc.

·         Modify document: Add-OpenXmlDigitalSignature, Lock-OpenXmlDocument, Remove-OpenXmlComment, etc.

·         Create document: Export-OpenXmlSpreadsheet and Export-OpenXmlWordprocessing

·         Transform: Export-OpenXmlToHtml

In order to make these types of commands work properly within the PowerShell environment, the following objectives were set:

1.       You can chain together the “Set” and “Modify” commands using the pipe (|) capability of PowerShell.

2.       You can use the “Create” cmdlets as the first command in the chain.

3.       Command chains should be able to work on any number of documents.

4.       The “Get” command should be able to pipe to the “Set” command of the same or similar type. In this case, the Get command will only operate on a single document, but the Set could operate on any number of documents.

Examples

An example of command chaining:

Accept-OpenXmlChange –Path *.docx -PassThru | Add-OpenXmlDigitialSignature –Certificate me.pfx –PassThru | Set-OpenXmlHeader –HeaderType First –HeaderPath first.xml -PassThru | Lock-OpenXmlDocument

An example of Get/Set pairing:

Get-OpenXmlStyle –Path origin.docx | Set-OpenXmlStyle –Path new_*.docx

There are other examples of simple pairing that are not Get/Set, but the concept is the same. For example:

Get-OpenXmlBackground –Path origin.docx –Image | Add-OpenXmlPicture –Path new.docx –InsertionPoint '/w:document/w:body/w:p/w:r[1]'

Classes

The Open XML PowerTools use just a few classes to reach these design goals.

DocumentFormat.OpenXml.Packaging.OpenXmlPackage: This class is the base class for WordprocessingDocument, SpreadsheetDocument and PresentationDocument.  These are the classes used to chain commands together.  It is important to realize that PowerShell pipes use objects, rather than text, for communication between commands.  As a result, these objects can have a current state, in addition to providing information through their properties.  These document objects are created whenever a “-Path” parameter is specified.  As a result, each object represents an open document connected to a particular file on the disk, very much as if you had opened that document in an editor application.  These documents could be opened for editing or for viewing only and must be closed when work is done.

The Open XML PowerTools commands maintain good control over these document objects so that explicit calls to open or close documents are generally not necessary.  Here are the rules for these objects:

1.       “Get” commands always open the documents as read-only and close them when processing is complete.

2.       “Set” and “Manipulate” commands open the documents for read/write if the “-Path” parameter is specified.  If the “-PassThru” parameter is used, then the document is left open and the objects are written out for piping, otherwise the documents are closed when processing is complete.

3.       “Create” commands create a new document file and the object associated with it.  If the “-PassThru” parameter is used, then the document is left open and the object is writeen out for piping, otherwise the document is closed when creation is complete.

In other words, documents are opened and closed within the command unless “-PassThru” passes it to the next command and then it becomes the responsibility of the next command to close it or pass it on.  Because the object is passed from command to command, the file is still associated with the object and can be treated as the same document that was modified by the previous command.

System.Drawing.Color: This object is used to pipe together the output from Get-OpenXmlBackground to Set-OpenXmlBackground when the “-Color” parameter is used.  Using this object makes it possible for the Set-OpenXmlBackground command to accept three different possible objects as input from the pipe.  (The other two are System.IO.FileInfo and the document objects discussed above.)  A string type could have been used, but then it could have been confusing as to whether that string referred to a file name, so this specific object was used instead.  In addition, this object provides a number of properties and methods that could be useful in the PowerShell environment in cases where the color was not just being passed to a “Set” command.

System.IO.FileInfo: This object is used to pipe together the output from Get-OpenXmlBackground to Set-OpenXmlBackground when the “-Image” parameter is used.  Images can only be passed as file-based objects, so that requires that the Get-OpenXmlBackground command create a file that contains the image.  That image is then read by the Set-OpenXmlBackground command and then the file is closed when the processing is complete.  The FileInfo object was chosen to represent the file because it is the same object that is output by built-in commands, like Get-ChildItem (alias DIR).

XDocument: An XDocument is a simpler object that represents a block of XML.  These objects are not associated with files in any way, so they do not need to be opened or closed.  Commands that accept an XDocument as input from the pipe will also support reading XML from a file, but that file can never be written to the output and remain open.  The header and footer commands are examples of commands that use this object.

String: A few commands, like Get-OpenXmlWatermark, use this object.

Summary

The chaining goals were met by careful use of the document objects for piping.  The Get/Set pairing goal was met by using specific objects to pipe the two commands together.  Notice that there could have been a conflict between these two goals for the “Set” commands since they need to support both document and specific information objects as input.  However, PowerTools is designed to support multiple input object types with ease. For example, look at Set-OpenXmlBackground:

·         When it receives a Document as input, it knows to manipulate that document and the “-Path” parameter should not be used.  Instead the –ImagePath or –ColorName parameters are used to define what should be set on the document or documents passed to it.

·         When it receives a Color object, it knows to set the color on the document or documents specified by the “-Path” parameter (and then they may be passed on if -PassThru is specified).

·         When it receives a FileInfo object, it knows to read the image from that file and set it on the document or documents specified by the “-Path” parameter.

With these relatively simple rules, all of the design goals are met in an intuitive way.

Implementation Details

One of the strengths of PowerTools is that the commands can be implemented in a very straightforward way that reflects the design goals.  The ability to pass full objects through the pipe simplifies the code greatly.

OpenXmlCmdlet Class

This base class supports the Path parameter that is used to specify document files and the “Document” parameter that represents documents passed as input through the pipe.  The base class essentially converts the files specified by the path into equivalent document objects so that the specific cmdlets can be implemented independently of these two methods of specifying documents.  The base class also handle creating backup files automatically, since that can only be done when the Path parameter is used.  It also calls ShouldProcess for any document that is being opened for modification.  The method that handles all this functionality is AllDocuments, which is called in the main processing loop of each specific cmdlet.

OpenXmlDocument Class

This is the base class for each of the specific document objects (WordprocessingDocument and SpreadsheetDocument).  It contains a few static functions to support “virtual” construction of the proper specific object.  It also contains methods for each of the operations that may be performed by specific cmdlets.  If the method can be supported for both Wordprocessing and Spreadsheet documents, then it is implemented in the base class.  Otherwise, a virtual function is used to allow specific implementation for each document type.

“Accessor” Classes

These classes are used to retrieve information from a specific part of a document or to modify the XML in a specific part of a document.  For example, the StyleAccessor class can get or replace the entire style library or add specific styles to the library.  It also contains functions to update the styles needed for the Index and Table cmdlets.  These classes contain no member variables other than the document itself so that there can never be a problem of synchronizing the state of the object with the current values in the document.

DocumentManager Class

This is the base class for the WordprocessingDocumentManager and the SpreadsheetDocumentManager classes.  Any functionality that doesn’t fall into a specific Accessor class or that can be applied to either document type, falls into these classes.  For example, the method to transform a wordprocessing document into HTML is in the WordprocessingDocumentManager class.

Cmdlet Classes

Each cmdlet class follows a very specific, and relatively simple, structure.

1.       The cmdlet verb and name are specified. If the cmdlet can change the document, SupportsShouldProcess is set to true.

2.       The cmdlet class is defined with OpenXmlCmdlet as the base class.

3.       Any member variables are defined.  These should match parameters for the cmdlet exactly, or nearly so.  Member variables for documents and the SuppressBackups parameters are already implemented in the base class.

4.       Methods to support each parameter are defined and implemented to set the appropriate member variables.  If the cmdlet modifies documents, then SuppressBackups must be defined to set the member variable in the base class.  The “Set” cmdlets generally must support either a parameter that specifies the value to be set or an equivalent object that specifies that value that is input through the pipe.  If the value comes from a file when it is specified as a parameter, then the name of the parameter should have Path at the end of it.

5.       The ProcessRecord method is overridden to implement the cmdlet functionality.  This function contains the primary error handling implemented with try-catch statements.  The main loop is generally a foreach loop on the base class method AllDocuments.  If the document is modified, then FlushParts should be called after processing and then the PassThru option should be used to determine if the object is closed or written to the output. If it is not modified, then the document is simply closed.

As you will see when you look at the code, the individual cmdlets are very short.  This helps emphasize error handling and the parameters of the cmdlet separate from the processing done by the cmdlet.  It also allows the processing to vary for different document types since the cmdlet calls a method defined in the base class of the document.