Content enrichment service scaling and aggregation

WCF Routing Service with content based routing

The configuration of the content enrichment feature only supports a single web service endpoint. This can be limiting for a number of common scenarios:

  • You want to integrate more than one content enrichment web service into content processing.
  • Your web service is time-consuming and you want to load balance between different instances of the web service.
  • You have a scaled out search topology with multiple content processing components, and you want to scale up the web service to match the load.
  • You want fault tolerance for the web service.

There are several possible technologies one could consider to solve these scenarios. This blog post focuses on using the Windows Communication Foundation (WCF) Routing Service technology included in .NET Framework 4.0. You can also check out our upcoming blog post on how to deploy a network load balancing cluster for high performance and availability.

WCF Routing enables development of complex routing logic, load-balancing, and fault tolerance. All of these mechanisms support our underlying requirement for scaling out, but not all of them need to be implemented for all scenarios. We’ll look closer at routing logic in particular in this blog post.

In short, the benefits of WCF Routing are:

  • Simple and quick technology to implement and deploy.
  • Allows for a varying degree of routing logic complexity.
    • A set of pre-defined filters that can be customized.
    • Completely custom filter implementations.
  • Service aggregation through routing rules that inspect managed properties.
  • Fault tolerance through backup endpoints.
  • Load balancing through custom filter implementations.

Search topology and WCF Routing

We’ll start by recapping some basics, and then move on to a concrete example.

A search topology can consist of anywhere from 1 to n content processing components. The role of a content processing component is to parse and transform the data coming into the system before delivering it to the indexing component. This processing takes place in discrete processing flows that can range from 0 to n instances within a specific content processing component. The number of active flow instances will depend on available resources and the amount of data being crawled. A ballpark figure can be calculated as the number of physical cores on the host multiplied by three. There’s no guarantee that this calculation will be true in the future.

When content enrichment is enabled for the Search service application, all active flow instances will potentially call out to the configured web service endpoint for every document. Assuming a web service that has no temporal cost, the web service will receive roughly the same number of calls per second as the crawl rate (documents per second) of the farm. How much of a bottleneck the web service becomes, if at all, depends on the following factors:

  • The amount of resources consumed by the web service implementation.
  • The hardware specification of the web service host.
  • The number of calls per second.
  • The size of the configured payload to send and receive between the content processing component and the web service.
    • Including network topology.
  • The number of concurrent calls per second.
    • Depends on the number of content processing components and active flow instances.

The following is a simple visual representation of how a search topology with two content processing components can be configured to communicate with a single WCF Routing Service. The WCF Routing Service in turn distributes incoming requests to the appropriately registered service endpoints based on a set of defined filters and the content of the received SOAP envelope. Each service implementation has a backup endpoint that will ensure high availability in case of a failure situation. Typically a CommunicationException or TimeoutException will cause the router to try the backup endpoint.

Search topology

Even though a single connector appears between nodes in the drawing, there will most likely be multiple HTTP connections at run time. The number of allowed active connections can be throttled through the service throttling subsection of the service behavior section in the web configuration file (for IIS hosting). By default the underlying connections will be persistent, which creates less overhead than re-creating an http-connection for every call.

Example implementation of content based routing

There may be situations where you have different web service implementations aimed at different types of content. You can pack all of them into a single service and handle requests differently depending on content, but in other cases you may know a priori that some content will be tougher to process and that it’s desirable to dedicate a particularly beefy host to those documents. Also important, maintainability of your service implementations may decrease if you have no separation of business logic. To show you how to achieve this, we’ll walk through an implementation of a WCF Routing Service where we do content-based routing predicated on the content source of an item.

Creating the WCF Routing Service

The following fictitious values are used in the example.

Role

Value

Web Service 1

servicehost1.contoso.com

Web Service 1 backup

servicehost2.contoso.com

Web Service 2

servicehost3.contoso.com

Web Service 2 backup

servicehost4.contoso.com

WCF Routing Host

routinghost.contoso.com

While there are different ways of implementing a WCF Routing Service, and different levels of complexity, we’ll focus on a very simple router that we can express mostly declaratively through the web configuration file. Initially you’ll need to have Internet Information Services (IIS) set up on a server and create a new site (including a new directory on your local drive for the site).

Let’s start with the web.config file and look the different sections in it separately before tying it all together in a full example. Every section described below will be a descendant of the <system.serviceModel> node. We’ll start with the binding used by both the router’s exposed service, and the clients it talks to.

Bindings

We’ve created a single basicHttpBinding where we’ve configured large values for the readerQuotas and the maxReceivedMessageSize. These values can be reduced later on once you know the limits you want to have in place. They are used to limit the allowed size and complexity of the received SOAP envelope.

 <basicHttpBinding>
    <binding 
        name="basicHttpBinding_IContentProcessingEnrichmentService" 
        maxReceivedMessageSize = "8388608">
        <readerQuotas 
            maxDepth="32" 
        maxStringContentLength="2147483647" 
            maxArrayLength="2147483647" 
            maxBytesPerRead="2147483647" 
            maxNameTableCharCount="2147483647" />
        <security mode="None" />
    </binding>
</basicHttpBinding>

Services

This is where we define the endpoint that the router uses to expose itself. We will configure the content enrichment feature in SharePoint to use this endpoint through the cmdlets later. Take note that the baseAddress attribute is not required when hosting in IIS, it’s simply here to make it clear what host this service is for.

 <service 
    behaviorConfiguration="RoutingServiceBehavior" 
    name="System.ServiceModel.Routing.RoutingService">
    <host>
        <baseAddresses>
            <add baseAddress="https://routinghost.contoso.com:800"/>
        </baseAddresses>
    </host>
    <endpoint 
        name="RoutingServiceEndpoint" 
        address="" 
        binding="basicHttpBinding" 
        bindingConfiguration=
        "basicHttpBinding_IContentProcessingEnrichmentService"
        contract="System.ServiceModel.Routing.IRequestReplyRouter" />
</service>

Clients

Here we define the endpoints to the content enrichment web service implementations that the router will route to. These are not different from a normal implementation that you host in a single-service scenario. As can be seen in the following example, we’re configuring a total of four client endpoints. These cover our two different service implementations, with an additional backup for each in case of a failure.

 <client>
    <endpoint 
        name="Service1" 
        address=
"https://servicehost1.contoso.com:800/ContentEnrichmentService.svc"
        binding="basicHttpBinding" 
        bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService" 
        contract="*" />
    <endpoint 
        name="Service1Backup" 
        address=
"https://servicehost2.contoso.com:800/ContentEnrichmentService.svc" 
        binding="basicHttpBinding" 
        bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService" 
        contract="*" />
    <endpoint 
        name="Service2" 
        address=
"https://servicehost3.contoso.com:800/ContentEnrichmentService.svc" 
        binding="basicHttpBinding" 
        bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService" 
        contract="*" />
    <endpoint 
        name="Service2Backup" 
        address=
"https://servicehost4.contoso.com:800/ContentEnrichmentService.svc" 
        binding="basicHttpBinding" 
        bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService" 
        contract="*" />
</client>

Service behavior

We need to create a service behavior where we reference the name of the filter table that will be defined in the next step. In addition, to enable full inspection of the SOAP envelopes in our XPath filters, we set the attribute routeOnHeadersOnly to false.

 <behavior name="RoutingServiceBehavior">
    <routing 
        filterTableName="ContentSourceFilters" 
        routeOnHeadersOnly="False"/>
</behavior>

Routing

Here we define the filters and the filter table where we map the filters to normal endpoints and backup endpoints. The XPath expressions look for all Property nodes in the SOAP envelope by using a predicate that specifies the name of the property and the value. This predicate is used to match against specific content sources. There are various types of filters that we can use, but the XPath type is sufficient in speed and functionality for this example. To develop more complex scenarios, take a look at custom filters in the online WCF documentation.

 <routing>
    <namespaceTable>
        <!-- Define prefix for Content Enrichment namespace, 
        used in XPath filters -->
        <add 
            prefix="cc" 
            namespace=
"https://schemas.microsoft.com/office/server/search/
contentprocessing/2012/01/ContentProcessingEnrichment"/>
    </namespaceTable>
    <!-- Filter definitions -->
    <filters>
        <filter 
            name = "Sharepoint" 
            filterType = "XPath" 
            filterData= 
"//cc:Property[cc:Name[. = 'ContentSource'] and 
cc:Value[. = 'Local Sharepoint Sites']]"/>
        <filter 
            name = "Fileshare" 
            filterType = "XPath" 
            filterData= 
"//cc:Property[cc:Name[. = 'ContentSource'] and 
cc:Value[. = 'Large Fileshare']]"/>
    </filters>
    <!-- Filter mappings -->
    <filterTables>
        <filterTable name="ContentSourceFilters">
            <add 
                filterName="Sharepoint" 
                endpointName="Service1" 
                backupList="BackupSharepoint"/>
            <add 
                filterName="Fileshare" 
                endpointName="Service2" 
                backupList="BackupFileshare"/>
        </filterTable>
    </filterTables>
    <!-- Backup lists -->
    <backupLists>
        <backupList name="BackupSharepoint">
            <add endpointName="Service1Backup" />
        </backupList>
        <backupList name="BackupFileshare">
            <add endpointName="Service2Backup" />
        </backupList>
    </backupLists>
</routing>

Web.config

It’s time to tie it all together in a single configuration. The following example uses all the previous pieces to build a complete configuration file.

 <?xml version="1.0"?>
<configuration>
    <system.serviceModel>
        <bindings>
            <basicHttpBinding>
                <binding 
                    name=
"basicHttpBinding_IContentProcessingEnrichmentService" 
                    maxReceivedMessageSize = "8388608">
                    <readerQuotas 
                        maxDepth="32" 
                        maxStringContentLength="2147483647" 
                        maxArrayLength="2147483647" 
                        maxBytesPerRead="2147483647" 
                        maxNameTableCharCount="2147483647" />
                    <security mode="None" />
                </binding>
            </basicHttpBinding>
        </bindings>
        <services>
            <service 
                behaviorConfiguration="RoutingServiceBehavior" 
                name="System.ServiceModel.Routing.RoutingService">
                <host>
                    <baseAddresses>
                        <add 
                            baseAddress=
                            "https://routinghost.contoso.com:800" />
                    </baseAddresses>
                </host>
                <endpoint 
                    name="RoutingServiceEndpoint" 
                    address="" 
                    binding="basicHttpBinding" 
                    bindingConfiguration=
"basicHttpBinding_IContentProcessingEnrichmentService" 
                    contract= "System.ServiceModel.Routing.IRequestReplyRouter" />
            </service>
        </services>
        <client>
            <endpoint 
                name="Service1" 
                address= 
"https://servicehost1.contoso.com:800/ContentEnrichmentService.svc" 
            binding="basicHttpBinding" 
                bindingConfiguration= 
"basicHttpBinding_IContentProcessingEnrichmentService" 
                contract="*" />
            <endpoint 
                name="Service1Backup" 
                address= 
"https://servicehost2.contoso.com:800/ContentEnrichmentService.svc" 
                binding="basicHttpBinding" 
                bindingConfiguration= 
"basicHttpBinding_IContentProcessingEnrichmentService" 
                contract="*" />
            <endpoint 
                name="Service2" 
                address=
"https://servicehost3.contoso.com:800/ContentEnrichmentService.svc" 
                binding="basicHttpBinding" 
                bindingConfiguration= 
"basicHttpBinding_IContentProcessingEnrichmentService" 
                contract="*" />
            <endpoint 
                name="Service2Backup" 
                address= 
"https://servicehost4.contoso.com:800/ContentEnrichmentService.svc" 
                binding="basicHttpBinding" 
                bindingConfiguration= 
"basicHttpBinding_IContentProcessingEnrichmentService" 
                contract="*" />
        </client>
        <behaviors>
            <serviceBehaviors>
                <behavior 
                    name="RoutingServiceBehavior">
                    <routing 
                        filterTableName="ContentSourceFilters" 
                        routeOnHeadersOnly="False"/>
                </behavior>
            </serviceBehaviors>
        </behaviors>
        <routing>
            <namespaceTable>
                <add 
                    prefix="cc" 
                    namespace= 
"https://schemas.microsoft.com/office/server/search/
contentprocessing/2012/01/ContentProcessingEnrichment"/>
            </namespaceTable>
            <filters>
                <filter 
                    name = "Sharepoint" 
                    filterType = "XPath" 
                    filterData = 
 "//cc:Property[cc:Name = 'ContentSource' and 
 cc:Value = 'Local Sharepoint Sites']"/>
                <filter 
                    name = "Fileshare" 
                    filterType = "XPath" 
                    filterData =  
"//cc:Property[cc:Name = 'ContentSource' and 
 cc:Value = 'Large Fileshare']"/>
            </filters>
            <filterTables>
                <filterTable name="ContentSourceFilters">
                    <add 
                        filterName="Sharepoint" 
                        endpointName="Service1" 
                        backupList="BackupSharepoint"/>
                    <add 
                        filterName="Fileshare" 
                        endpointName="Service2" 
                        backupList="BackupFileshare"/>
                </filterTable>
            </filterTables>
            <backupLists>
                <backupList name="BackupSharepoint">
                    <add endpointName="Service1Backup" />
                </backupList>
                <backupList name="BackupFileshare">
                    <add endpointName="Service2Backup" />
                </backupList>
            </backupLists>
        </routing>
    </system.serviceModel>
</configuration>

The service file

The markup code of the service file needs to reference the RoutingService class and Routing assembly, rather than your own implementation/assembly, which would be the normal procedure. The code part can be just an empty implementation since it won’t be used.

 <%@ 
    ServiceHost="" 
    Language="C#" 
    Debug="true" 
    Service=
    "System.ServiceModel.Routing.RoutingService, 
    System.ServiceModel.Routing, 
    version=4.0.0.0, 
    Culture=neutral, 
    PublicKeyToken=31bf3856ad364e35" 
%>

Final remarks

To summarize, we’ve shown how it is possible to overcome some of the limitations with a single web service endpoint through the use of WCF Routing. The fact that the router itself is still a single point of failure can be overcome through other load balancing mechanics like NLB.

If you want to learn more about how to customize search with content enrichment, check out the official documentation on MSDN, and the other blog posts on content enrichment.