Document Parsers in SharePoint (1 of 4): Overview

Now that I’ve talked about the built-in XML parser, and how you can use it to promote and demote document properties for XML files, you might be thinking: what about custom files types that aren’t XML? What if I’ve got proprietary binary file types from which I want to promote and demote properties to the SharePoint list?

We’ve got you covered there as well.

For the next four entries, I’m going to go over in detail how to construct and register a custom parser that enables you to promote and demote properties between your custom file types and Windows SharePoint Services.

This information will get rolled into the next update of the WSS SDK, so consider this a preview if case you want to work with the parser framework right now.

Custom Document Parser Overview

Managing the metadata associated with your document is one of the most powerful advantages of storing your enterprise content in WSS. However, keeping that information in synch between the document library level and in the document itself is a challenge. WSS provides the document parser infrastructure, which enables you to create and install custom document parsers that can parse your custom file types and update a document for changes made at the document library level, and vice versa. Using a document parser for your custom file types helps ensure that your document metadata is always up-to-date and synchronized between the document library and the document itself.

A document parser is a custom COM assembly that, by implementing the WSS document parser interface, does the following when invoked by WSS:

· Extracts document property values from a document of a certain file type, and pass those property values to WSS for promotion to the document library property columns.

· Receives document properties, and demote those property values into the document itself.

This enables users to edit document properties in the document itself, and have the property values on the document library automatically updated to reflect their changes. Likewise, users can update property values at the document library level, and have those changes automatically written back into the document.

I’ll talk about how WSS invokes document parsers, and how those parsers promote and demote document metadata, in my next entry.

Parser Requirements

For WSS to use a custom document parser, the document parser must meet the following conditions:

· The document parser must be a COM assembly that implements the document parser interface.

I’ll go over the details of the IParser interface in a later entry.

· The document parser assembly must be installed and registered on each front-end Web server in the WSS installation.

· You must add an entry for the document parser in DOCPARSE.XML, the file that contains the list of document parsers and the file types with which each is associated.

And I’ll give you the specifics of the document parser definition schema in a later entry as well. All in good time.

Parser Association

WSS selects the document parser to invoke based on the file type of the document to be parsed. A given document parser can be associated with multiple file types, but you can associate a given file type with only one parser.

To specify the file type or types that a custom document parser can parse, you add a node to the Docparse.XML file. Each node in this document identifies a document parser assembly, and specifies the file type for which it is to be used. You can specify a file type by either file extension or program ID.

If you specify multiple document parsers for the same file type, WSS invokes the first document parser in the list associated with the file type.

WSS includes built-in document parsers for the following file types:

· OLE: includes DOC, XLS, PPT, MSG, and PUB file formats

· Office 2007 XML formats: includes DOCX, DOCM, PPTX, PPTM, XLSX and XLSM file formats

· XML

· HTM: includes HTM, HTML, MHT, MHTM, and ASPX file formats

You cannot create a custom document parser for these file types. With the XML parser, you can use content types to specify which document properties you want to map to which content type columns, and where the document properties reside in your XML documents.

Parser Deployment

To guarantee that WSS is able to invoke a given parser whenever necessary, you must install each parser assembly on each front end Web server in your WSS installation. Because of this, you can specify only one parser for a given file type across a WSS installation.

The document parser framework does not include the ability to package and deploy a custom document parser as part of a SharePoint feature.

In my next post, I’ll discuss how the document parser actually parses documents and interacts with WSS.