FS4SP - Document Signature Customization

In this post I'll describe how to customize the document signature in FAST Search for SP2010.

The documentsignature is the MP (integer) that we use to de-duplicate results (also known as collapsing). It is a SHA-256 generated integer.

Get-FASTSearchMetadataManagedProperty -Name documentsignature

Name : documentsignature
Description : ID used for duplicate collapsing
Type : Integer
Queryable : True
StemmingEnabled : False
RefinementEnabled : False
MergeCrawledProperties : False
SubstringEnabled : False
DeleteDisallowed : True
MappingDisallowed : False
MaxIndexSize : 1024
MaxResultSize : 64
DecimalPlaces : 3
SortableType : SortableEnabled
SummaryType : Static

 

The Document signature value is calculated out of the stage called "DocumentSignature"

%FASTSearch%\etc\pipelineconfig.xml

The stage comes AFTER the mapping of crawled properties- managed properties which means the input fields reference MP not CP

      <!-- Process managed properties -->
      <processor name="LanguageDetectorSecondPass" />
      <processor name="Tokenizer"/>
      <processor name="OffensiveContentFilter"/>
      <processor name="DocumentSignature"/>

The Stage definition is as follows

    <processor name="DocumentSignature" type="general" hidden="0">
      <load module="processors.DuplicateId" class="DuplicateId"/>
      <config>
       <param name="Output" value="documentsignature" type="string"/>
       <param name="Input" value="title:0:required body:1024:required documentsignaturecontribution" type="string"/>
      </config>
    </processor>

As you may noticed there are 3 input MPs here : title, body and documentsignaturecontribution.

How to read the Input parameter ?

field:<length>:<required>

length is to limit the number of bytes to take into account for signature computation. 0 means no limit (entire content).  

required : the field is mandatory for signature computation. If the field is not mapped (no value) no document signature is generated.

if no length or required parameter is set (i.e documentsignaturecontribution) the MP is taken into account if populated with no limit of the number of bytes.

What now?

Well as seen above there is in FAST Search Server a mechanism to influence the document signature besides title and body MPs !

documentsignaturecontribution is a placeholder MP.

 

Example of Use

1. Create a custom list in SP2010 with 4 columns DSTitle, DSContribution, DSBody and DSID (optional).

2. Run a Crawl so that we have the newly discovered crawled properties available for mapping.

 

3. Map DSTitle and DSBody to their corresponding MP (title is already mapped since I just renamed the column name from Title to DSTitle).

 

4. Now you create the documentsignaturecontribution Managed Property as Text (string) and map it to your custom column DSContribution.

 

5. Now run a full crawl

6. Based on my list (see point #1) I expect list item #1, #2 and #5 to be different and list item #3 and #4 to be de-duplicated.

 

I can verify this in the FAST Search Center:

7. You can also verify it on the qrserver xml output as well

The query we get from the FAST Search Center

 

::1 - - [13/Mar/2013:16:56:45 +0100] "GET /cgi-bin/search?qtf_keyword:context=ssgid%3a%3a19496a77-5a8f-4b20-b36c-ba3ae70b5425%7cSPS-Location%3a%3a%7cSPS-Responsibility%3a%2c%3a%7c&hits=10&rpf_navigation:hits=50&rpf_navigation:enabled=True&spell=suggest&qtf_parsekw:timezone=3&type=kwall&qtf_teaser:dynlength=185&resubmitflags=1&language=en&query=emploi&sortby=%2bdefault&qtf_lemmatize=True&offset=0&version=14.0.0.0&collapseon=batvdocumentsignature&collapsenum=1&rpf_sortsimilar:enabled=False&qtf_parsekw:locale=eng&qtf_parsekw:localename=en-US&rpf_navigation:navigators=format(cutoff%3d0%2f0%2f20%2csort%3dfrequency%2fdescending)%2csitename(cutoff%3d0%2f0%2f20%2csort%3dfrequency%2fdescending)%2cauthor(cutoff%3d0%2f0%2f20%2csort%3dfrequency%2fdescending)%2cwrite(discretize%3dmanual%2f2012-03-13%2f2012-09-11%2f2013-02-11%2f2013-03-06%2f2013-03-12)%2ccompanies(cutoff%3d0%2f0%2f20%2csort%3dfrequency%2fdescending)%2caonamecolm(cutoff%3d0%2f0%2f20%2csort%3dfrequency%2fdescending)%2cowsmetadatafacetinfo&qtf_securityfql:uid=MCMud3xmYXN0ZXMwXG5pY29sYXN1&rpf_security:uid=MCMud3xmYXN0ZXMwXG5pY29sYXN1 HTTP/1.1" 200 27242 "" "" 0.0320 0.0000 0.0160 4 [truncated]

 

 The collapsing on the QRServer

The FCOCOUNT indicates how many documents where collapsed into one search result, the SITEID indicate the field value we collapsed on here the document signature field.

 

The SITEID from the qrserver is transmitted as the fcoid property on the front-end.

 

 

Et Voila !

I hope you'd find this feature useful as it allows you to refine the document signature out of the box (title and body:1024) and leverage it to fit your customers requirements.  

 

Keep in mind that documentsignaturecontribution is not size limited (bytes). You don't want to map a huge amount of data and impact document processing performance much.

 

Enjoy. Stay tuned.