Open XML File Formats: What is it, and how can I get started?

While being at Tech Ed, a lot of people were interested in finding a way to programmatically generate documents without Interop. Some of the business scenarios contemplated generating over 5,000 documents and some IT professionals were interested in finding the best option. A great option to solve this business need is: The Open XML File Formats.

Some people have been following the news and are even ahead of most of us already building solutions to generate documents using the Open XML File Formats. Some other people are not familiar with this technology and want to learn more about this, so here is a quick introduction for those of you who want to learn more about: What is it, and how you can get started. I have to warn you that this is going to be a long blog entry, but I promise it's worth the reading.

What is it?

The new formats improve file and data management, data recovery, and interoperability with line-of-business systems. They extend what is possible with the binary files of earlier versions. Any application that supports XML can access and work with data in the new file format. The application does not need to be part of the Microsoft Office system or even a Microsoft product. Users can also use standard transformations to extract or repurpose the data. In addition, security concerns are drastically reduced because the information is stored in XML, which is essentially plain text. Thus, the data can pass through corporate firewalls without hindrance.

The new Open XML File Formats take advantage of the Open Packaging Conventions, which describe the method for packaging information in a file format and describe metadata, parts, and relationships. The new Open XML Format, with a few minor exceptions, is written entirely in XML and is contained in a .zip file. This creates significant advantages over the old binary file format:

  • The file size is much smaller because of ZIP compression.
  • The file is much more robust because it is broken up into different document parts. Should one part become damaged (for example, a part describing headers), the rest of the document remains intact and still opens successfully.
  • The file is easier to work with programmatically because of the new structure. For example, it is easier to access embedded content, such as images, because they are stored in their native format inside the file.
  • Custom XML is also easier to work with because it is stored in its own part, separate from the XML that describes the bulk of a document.

The old binary file format was created when priorities in software differed from the priorities of today. Back then, the ability to transfer a Word document from computer to computer using a floppy disc ranked very high, and the tight structure of a binary format worked well. As software advanced, other priorities became clear, such as the ability to write code against a file format and make it as robust as possible. XML is a clear solution.

Microsoft began to address this issue in previous versions of Microsoft Office by introducing SpreadSheetML and WordprocessingML. However, only now, with the 2007 release of Microsoft Office, have the goals that were conceived as far back as 1999 been accomplished fully. By including the XML File Format inside a ZIP container, the benefit of a small compressed file format is also realized. Excel 2007 and PowerPoint 2007 share this new file format technology, described by the Open Packaging Conventions. Together, the shared formats are called the Microsoft Office Open XML Formats. The new Word 2007 XML Format is the default file format, although the old binary file format is still available in the 2007 Microsoft Office system.

An easy way to look inside the new file format is to save a Word 2007 document in the new default format and then rename the file with a .zip extension. By double-clicking the renamed file, you can open and look at its contents. Inside the file, you can see the document parts that make up the file, along with the relationships that describe how the parts interact with one another. However, it is important to note that, with a few exceptions defined within the Open Packaging Conventions, the actual file directory structure is arbitrary. The relationships of the files within the package, not the file structure, are what determine file validity. You can rearrange and rename the parts of an Word 2007 file inside its .zip container if you update the relationships properly so that the document parts continue to relate to one another as designed. If the relationships are accurate, the file opens without error. The initial file structure in a Word 2007 file is simply the default structure created by Word. This default structure enables developers to determine the composition of Word 2007 files easily.

Contents of a sample document in a ZIP file

How can I get started?

The easiest way to modify a Word 2007 XML file programmatically is to use the System.IO.Packaging class in the Microsoft® Windows® Software Development Kit (SDK) for Beta 2 of Windows Vista and WinFX Runtime Components. Using this technology, you can easily update header and footer files programmatically across numerous Word 2007 documents stored on a server.

We published recently some resources that might be of your interest if you are trying to learn more about the Open XML File Formats:

Open XML Snippets

Microsoft Office Excel Snippets

  • Excel: Add Custom UI: This snippet adds a custom UI Ribbon part to a given workbook.
  • Excel: Delete Comments by a specific User: This snippet deletes all comments from a given user from a given workbook.
  • Excel: Delete Worksheet: This snippet deletes the specified worksheet from within a given workbook and resets the selected worksheet to the next one on the list. Returns true if successful, false if failure.
  • Excel: Delete Excel 4.0 Macro sheets: This snippet deletes all the Excel 4.0 Macro (XLM) sheets from a given workbook.
  • Excel: Retrieve hidden rows or columns: This snippet returns a list of hidden row numbers or column names from a given workbook and worksheet.
  • Excel: Export Chart: Given a workbook and title of a chart, this snippet exports the chart as a Chart (.crtx) file.
  • Excel: Get Cell Value: Given a workbook, worksheet and cell address, this snippet returns the value of the cell as a string.
  • Excel: Get Comments as XML: Given a workbook, this snippet returns all the comments as an XmlDocument.
  • Excel: Get Hidden Worksheets: This snippet returns a list containing the name and type of all hidden sheets in a given workbook.
  • Excel: Get Worksheet Information: This snippet returns a list containing the name and type of all sheets in a given workbook.
  • Excel: Get Cell for Reading: Given a workbook, worksheet and cell address, this snippet demonstrates how to navigate to the cell to retrieve its contents. The cell must exist for the function to find it.
  • Excel: Get Cell for Writing: Given a workbook, worksheet and cell address, this snippet demonstrates how to navigate to the cell to set its value. If the cell does not exist, the snippet creates it.
  • Excel: Insert Custom XML: Given a workbook and a custom XML value, this snippet inserts the custom XML into the workbook.
  • Excel: Insert Header or Footer: Given a workbook, worksheet and text to insert and a header or footer type, this snippet inserts the header or footer with the given text into the worksheet.
  • Excel: Insert a Numeric Value into a Cell: Given a workbook, worksheet, cell address and numeric value, this snippet inserts the value into the cell.
  • Excel: Insert a String Value into a Cell: Given a workbook, worksheet, cell address and string value, this snippet inserts the value into the cell.
  • Excel: Set Recalc Option: Given a workbook and a RecalcOption, this snippet sets the recalculation property to the new option.

Microsoft Office PowerPoint Snippets

  • PowerPoint: Delete Comments by User: Given a presentation and a user name, this snippet deletes all comments by that user.
  • PowerPoint: Delete Slide by Title: Given a presentation and slide title, this snippet deletes the first instance of a slide with that title (titles are not unique).
  • PowerPoint: Get Slide Count: This snippet returns the number of slides in a given presentation.
  • PowerPoint: Get Slide Titles: Given a presentation, this snippet returns a list of the slide titles in the order presented.
  • PowerPoint: Modify Slide Title: Given a presentation, old slide title, and new slide title, this snippet changes the first instance of a slide with the given title to the new value. The snippet returns true if successful, false if not successful.
  • PowerPoint: Reorder Slides: Given a presentation, an original position, and a new position, attempt to place the slide from the original position into the new position within the deck. If the original position is outside the range of the number of slides in the deck, use the last slide. If the new position is outside the range of slides in the deck, put the selected slide at the end of the deck. The snippet returns the loctation wher the slide was placed, or -1 on failure.
  • PowerPoint: Replace Image: Given a presentation, slide title and image file, this snippet replaces the first image on the slide with the given image.
  • PowerPoint: Retrieve Slide Location by Title: Given a presentation and a slide title, this snippet returns the 0-based location of the first slide with a matching title.

Microsoft Office Word Snippets

  • Word: Accept Revisions: Given a document and an author name, this snippet accepts the revisions by that author.
  • Word: Add Header: Given a document and a stream containing valid header content, add the stream content as a header in the document.
  • Word: Convert DOCM to DOCX: Given a macro-enabled document (.docm), this snippet removes the VBA project and converts the file to a macro-free Word Document (.docx).
  • Word: Remove Comments: Given a Word Document, this snippet removes all the comments.
  • Word: Remove Headers and Footers: This snippet removes all headers and footers from a given Word document.
  • Word: Remove Hidden Text: This snippet removes any hidden text in a given document.
  • Word: Replace Style: Given a document and valid header content, this snippet adds the content as a header in the document.
  • Word: Retrieve Application Property: Given a document name and an app property, this snippet returns the value of the property.
  • Word: Retrieve Core Property: Given a document name and a core property, this snippet returns the value of the property.
  • Word: Retrieve Custom Property: Given a document name and a custom property, this snippet returns the value of the property.
  • Word: Retrieve Table of Contents: Given a document name, this snippet returns a table of contents as an XmlDocument.
  • Word: Set Application Property: This snippet sets a property’s value given a document name, application property and value. The snippet returns the old value if successful.
  • Word: Set Core Property: Given a document name, a core property, and property value, this snippet sets the property value.
  • Word: Set Custom Property: Given a document name, a custom property, and a value, this snippet sets the property’s value. If the property does not exist, create it. Returns true if successful, false if not.
  • Word: Set Print Orientation: Given a document name, this snippet sets the print orientation for all sections in the document.

Download them here!

Finally, if you want to stay current with new resources to work with the Open XML File Formats, go to the XML in Office Developer Portal. We launched this portal recently to create a special section of the MSDN Office Developer Center where you will find bloggers, technical articles, code samples, developer documentation, and multimedia presentations on working with XML in Office.

Happy Office XML programming!