Document Parsers in SharePoint (2 of 4): How Parsers Process Documents

Read part one here.

In my last entry, I gave you a brief overview of what document parsers are in Windows SharePoint Services V3, and a high-level look at what you need to do to build a custom document parser for your own custom file types. Today we’re going to start digging a little deeper, and examine how a parser interacts with WSS in detail.

Document Parser Processing

When a file is uploaded, or move or copied to a document library, WSS determines if a parser is associated with the document's file type. If one is, WSS invokes the parser, passing it the document to be parsed, and a property bag object. The parser extracts the properties and matching property values from the document, and adds them to the property bag object. The parser extracts all the properties from the document.

WSS accesses the document property bag and determines which properties match the columns for the document. It then promotes those properties, or writes the document property value to the matching document library column. WSS only promotes the properties that match columns applicable to the document. The columns applicable to a document are specified by:

· The document's content type, if it has been assigned a content type.

· The columns in the document library, if the document does not have a content type.

WSS also stores the entire document property collection in a hash table; this hash table can be accessed programmatically by using the SPFile.Properties properties. There is no user interface to access the document properties hash table.

The following figure illustrates the document parsing process. In it, the parser extracts document properties from the document and writes them to the property bag. Of the four document properties, three are included in the document's content type. WSS promotes these properties to the document library; that is, writes their property values to the appropriate columns. WSS does not promote the fourth document property, Status, even though the document library includes a matching column. This is because the document's content type does not include the Status column. WSS also writes all four document properties to a hash table that is stored with the document on the document library.

WSS can also invoke the parser to demote properties, or write a column value into the matching property in the document itself. When WSS invokes the demotion function of the parser, it again passes the parser the document, and a property bag object. In this case, the property bag object contains the properties that WSS expects the parser to demote into the document. The parser demotes the specified properties, and WSS saves the updated document back to the document library.

The figure below illustrates the document property demotion process. To update two document properties, WSS invokes the parser, passing it the document to be updated, and a property bag object containing the two document properties. The parser reads the property values from the property bag, and updates the properties in the document. When the parser finishes updating the document, it passes a parameter back to WSS that indicates that it has changed the document. WSS then saves the updated document to the document library.

Mapping Document Properties to Columns

Once the document parser writes the document properties to the property bag object, WSS promotes the document properties that match columns on the document library. To do this, WSS compares the document property name with the internal names of the columns in the document’s content type, or on the document library itself if the document doesn’t have a content type. When WSS finds a column whose internal name matches the document property, it promotes the document property value into that column for the document.

However, WSS also enables you to explicitly map a document property to a specific column. You specify this mapping information in the column’s field definition schema.

Mapping document properties to columns in the column’s field definition enables you to map document properties to columns that may or may not be named the same. For example, you can map the document property ‘Author’ to the ‘Creator’ column of a content type or document library.

To specify this mapping, add a ParserRef element to the field definition schema, as shown in the example below:

<Field Type=”Text” Name=”Creator” … >

  <ParserRefs>

    <ParserRef Name=”Author” Assembly=”myParser.Assembly”>

  </ParserRefs>

</Field>

The following elements are used to define a document property mapping:

ParserRefs

Optional. Represents a list of document parser references for this column.

ParserRef

Optional. Represents a single document parser reference. This element contains the following attributes:

· Name Required String. The name of the document property to be mapped to this column.

· Assembly Required String. The name of the document parser used.

A column’s field definition might contain multiple parser references, each specifying a different document parser.

In addition, if you are working with files in multiple languages, you can use parser references to manage the mapping of document properties to the appropriate columns in multiple languages, rather than have to build that functionality into the parser itself. The parser can generate a single document property, while you use multiple parser references to make sure the property is mapped to the correct column for that language. For example, suppose a parser extracts a document property named ‘Singer’. You could then map that property to a column named ‘Cantador’, as in the example below:

<Field Type=”Text” Name=”Cantador” … >

  <ParserRefs>

  <ParserRef Name=”Singer” CLSID=”MyParser”>

    <ParserRef Name=”Artist” Assembly=”MP3Parser.Assembly”>

  </ParserRefs>

</Field>

To edit a column’s field definition schema programmatically, use the SPField.SchemaXML object. There is no equivalent user interface for specifying a parsers for a column.

In the next entry, we'll discuss how WSS processes document that contain their content type definition.