Microsoft re-engineered the search experience in SharePoint 2013 to take advantage of the best capabilities from FAST plus many new capabilities built from the ground up. Although much has been said about the query side changes of search (result sources, query rules, content by search web part, display templates, etc), the feed side of search got similar love from Redmond. In this post I’ll discuss a concept carried over from FAST that allows crawled content to be manually massaged before getting added to the search index. Several basic examples of this capability exist, so I’ll throw some advanced solution challenges at it. The solution adds a sentiment analysis score to indexed social activity as is outlined in the video below.
The content enrichment web service (CEWS) callout is a component of the content processing pipeline that enables organizations to augment content before it is added to the search index. CEWS can be any external SOAP-based web service that implements the IContentProcessingEnrichmentService interface. SharePoint can be configured to call CEWS with specific managed properties and (optionally) the raw binary file. CEWS can update existing managed property values and/or add completely new managed properties. The outputs of this enrichment service get merged into content before it is added to the search index. The CEWS callout can be used for numerous data cleansing, entity extraction, classification, and tagging scenarios such as:
- Perform sentiment analysis on social activity and augment activity with a sentiment score
- Translate a field or folder structure to a taxonomy term in the managed metadata service
- Derive an item property based on one or more other properties
- Perform lookups against line of business data and tag items with that data
- Parse the raw binary file for more advanced entity extraction
The content enrichment web service is a synchronous callout in the content processing pipeline. As such, complex operations in CEWS can have a big impact on crawl durations/performance. An additional challenge exists in the enrichment of content that hasn’t changed (and thus doesn’t get crawled). An item only goes through the content processing pipeline during full crawls or incremental/continuous crawls after the item is updated/marked dirty. When only the enriched properties need to change, a full crawl is the only out of the box approach to content enrichment.
The solution outlined in this post addresses both of these challenges. It will deliver an asynchronous CEWS callout and a process for marking an indexed item as dirty so it can be re-crawled without touching/updating the actual item. The entire solution has three primary components…a content enrichment web service, a custom SharePoint timer job for marking content in the crawl log for re-crawl, and a database to queue asynchronous results that other components can reference.
|High-level Architecture of Async CEWS Solution|
Enrichment Queue (SQL Database)
Because of the asynchronous nature of the solution, operations will be running on different threads, some of which could be long running. In order to persist information between threads, I leveraged a single-table SQL database to queue asynchronously processed items. Here is the schema and description of that database table.
|Id||integer identity column that serves as the unique id of the rows in the database|
|ItemPath||the absolute path to the item as provided by the crawler and crawl logs|
|ManagedProperty||the managed property that gets its value from an asynchronous operation|
|DataType||the data type of the managed property so we can cast value correctly|
|CrawlDate||the date the item was sent through CEWS that serves as a crawl timestamp|
|Value||the value derived from the asynchronous operation|
Content Enrichment Web Service
As mentioned at the beginning of the post, the content enrichment web service callout is implemented by creating a web service that references the IContentProcessingEnrichmentService interface. There are a number of great example of this online, including MSDN. Instead, this post will focus on calling asynchronous operations from this callout. The main objective of making the CEWS callout asynchronous is to prevent the negative impact a long running process could have on crawling content. The best way to do this in CEWS is to collect all the information we need in the callout, pass the information to a long running process queue, update any items that have values ready from the queue, and then release the callout thread (before the long running process completes).
|Process Diagram of Async CEWS|
Below is the callout code in its entirety. Note that I leveraged the entity framework for connecting to my enrichment queue database (ContentEnrichmentEntities class below):
//delete the async item from the database
//Start a new thread for this async operation
//save the changes
using (Stream requestStream = myRequest.GetRequestStream())
using (WebResponse response = myRequest.GetResponse())
asyncItem.Value = sentiment.ToString();
private readonly ProcessedItem processedItem = new ProcessedItem()
public class AsyncData
The content enrichment web service is associated with a search service application using Windows PowerShell. The configuration of this service has a lot of flexibility around the managed properties going in and out of CEWS and the criteria for triggering the callout. In my example the trigger is empty, indicating all items going through CEWS:
$ssa = Get-SPEnterpriseSearchServiceApplication
Timer Job (Force Re-Crawl)
The biggest challenge with an asynchronous enrichment approach is updating the index after the CEWS thread is released. No API exists to directly update items in the search index, so CEWS is the last opportunity to augment an item before it becomes available to users executing queries. The best we can do is kick-off an asynchronous thread that can queue enrichment data for the next crawl. Marking individual items for re-crawl is a critical component to the solution, because “the next crawl” will only crawl items if a full crawl occurs or if the search connector believes the source items have updated (which could be never). The crawl log in Central Administration provides a mechanism to mark individual indexed items for re-crawl
|CrawlLogURLExplorer.aspx option to recrawl|
I decompiled the CrawlLogURLExplorer.aspx page and was pleased to find it leveraged a Microsoft.Office.Server.Search.Administration.CrawlLog class with a public RecrawlDocument method to re-crawl items by path. This API will basically update an item in the crawl log so it looks like an error to the crawler, and thus picked up in the next incremental/continuous crawl.
So why a custom SharePoint timer job? An item may not yet be represented in the crawl log when it completes our asynchronous thread (especially for new items). Calling RecrawlDocument on a path that does not exist in the crawl log would do nothing. The timer job allows us to mark items for re-crawl only if the most recent crawl is complete or has a start date after the crawl timestamp of the item. In short, it will take a minimum of two incremental crawls for a new item to get enrichment data with this asynchronous approach.
public override void Execute(Guid targetInstanceId)
private EntityConnection GetEntityConnection()
//return the formatted connection string
private void GetLatestCrawlTimes(SPSite site, out DateTime start, out DateTime stop)
With these three solution components in place, we get the following before/after experience in search
Content enrichment is a very mature but powerful search customization. I hope this post helped illustrate a creative use of content enrichment that can take your search experience to the next level.
Code for solution: ContentEnrichmentServices.zip