SP2013 Crawling *Explained: Enumeration/Discovery (Part 3a)


With VerboseEx logging enabled, the crawl of a single item (or interchangeably, a "document") can generate more than 6000 ULS events, at times making troubleshooting a particular document analogous to finding the needle in the haystack. In this series of posts, I first described the high level orchestration for crawling a document and then deep dived on the steps involved with starting a crawl. In this part 3a, I'll first cover the concept of enumeration. Then, in the upcoming posts, I'll continue to show the same overall flow using key ULS events to illustrate a deep dive through an expected path when crawling an item.

Whereas the previous post primarily focused on the actions of the Master Crawl Component's Gatherer Admin role (e.g. orchestrating the content source level operation of crawling), this post will focus more on the interactions by the Gatherer role, which runs in each Crawl Component and orchestrates the steps to crawl a specific item (e.g. pulling a batch from the crawl queue, retrieving it from the content repository, feeding it to the CPC for processing, and handling the callbacks from the CPC). We will also focus on Search Connectors (also known as Protocol Handlers), that run in mssdmn.exe process for the Crawl Component and perform all of the interactions between the Crawler and the content repository.

Enumeration: Discovering the items to be crawled

Simplistically speaking, enumeration is the process of the Crawl Component asking the content repository – whether it be SharePoint, a file share, a web site, or BCS – for links to child items in a given "container". In this context, a container can be a folder in a file share, a root entity in a BDC model, an HTML web page seed list, a SharePoint Web Application, Site Collection, Site [web], or List/Library. When defining a SharePoint Content Source, each of the start addresses is a root for a container.

As an analogy, think of the process that occurs when you open Windows file explorer and double-click on a folder. In response to the double-click, a request is made to the underlying OS to get the list of files in this folder so they can be rendered to the user. Conceptually, enumeration by the Crawl Component works the same where a request about a container object is made to the content repository, and the content repository responds back with a reference to all of the items within that container (e.g. the "children").

For example, assume the Crawl Component makes a request for the parent file://root which then emits the child links to file://root/subFolder1 and file://root/subFolder2. The reference to each of the child items (such as a URL or file path in this case) get emitted back to the Gatherer, which then inserts them into the "crawl queue" (the MSSCrawlQueue table in a Crawl Store DB). From here, the parent item is then fed to the Content Processing Component, where it is processed and submitted to the Index like any other item.

For SharePoint content, the Crawl Component makes the request to a Web Application via the Site Data web service (/_vti_bin/sitedata.asmx) hosted on that particular Web Application, but the overall concept and overall flow for enumeration remain exactly the same for all types of content.

From this point, each of the Gatherers [Crawl Components] work independently where each earmarks a "batch" of documents from the crawl queue that it will begin processing. Although it is beyond the scope of this post, I have written much more about how multiple crawl components coordinate pulling/processing items and handling any database deadlocks that may occur. The point is that it is not determinant as to which Crawl Component will pull these items from the crawl queue for processing.

So at some point after the parent item has been processed, the Gatherer from any of the Crawl Components will pull one or more of these links from the MSSCrawlQueue table using the stored procedure proc_MSS_GetNextCrawlBatch, and repeat the cycle by submitting request to the content repository for those child items. In other words, it will start crawling the children. Keep in mind that it is entirely possible for this child object to be another container object, such as file://root/subFolder1 which has children items (e.g. file://root/subFolder1/a, file://root/subFolder1/b, …, file://root/subFolder1/n). Here, these will be emitted back to the Gatherer, which then inserts them into the MSSCrawlQueue …from here, it's the figurative rinse and repeat.

It is also possible for this to be a link to a specific item, such as file://root/file.pptx. In this case, a copy of item will be retrieved ("gathered") from the content repository, written to temporary storage by the Crawl Component, and then fed to the Content Processing Component (and eventually submitted to the Index after getting processed). The only real difference between a specific item and a container object is that the container also emits links that get inserted to the crawl queue.

Just to reiterate…

Logical items such as containers (e.g. a folder, a Web App, Site Collection, and so on) and SP List Items essentially just a collection of metadata that exists in memory of a given Crawl Component (and do not get written to the temporary disk of the Crawler). Although these items have no associated physical file, they are all still items, get assigned a DocId in the Crawl Store DB, get fed to a CPC and submitted to the Index just like any other item that does have an associated physical file.

 

For now, I will defer deep diving on the scenario of crawling a specific item until part 4 in this blog series and just focus on enumerating the container objects here.

Search Connectors: Where the rubber meets the road

Although the Gatherer orchestrates the crawl of an item, it largely just coordinates/routes the item to the various stages of the crawl flow (e.g. pulling an item from the crawl queue, coordinating the retrieval from the content repository, feeding it to the CPC, waiting for callbacks of items being processed, and committing the item's status in the MSSCrawlUrl table in a Crawl Store DB). For the Gatherer, this process is the largely the same regardless of the type of content being crawled (e.g. SharePoint, file shares, Web/HTTP, and BDC) and it is largely the same for specific items or containers.

The logic of how to interface with a particular type of content repository (e.g. SharePoint versus file share versus Web/HTTP versus BDC source) along with the logic for enumeration gets implemented in the various Search Connectors. To take this a step further, the Gatherer actually has no concept or understanding of enumeration – it simply gets a link from the crawl queue and invokes the appropriate type of Search Connector to handle the retrieval of the item. It is then up to the Search Connector to figure out what type of entity the link represents and how to handle it. If you've ever implemented a BDC connector, this is where the finder() method (to enumerate a container entity) versus the specificFinder() method (to retrieve the specific item) comes into play.

The logic for incremental crawls also gets implemented in the various Search Connectors. Simplistically speaking, the purpose of incremental crawl is to crawl all the items that have changed since some point in time (e.g. since the last crawl). Being said, you can see that this is just a special type of enumeration. This also makes sense because different content repositories define "changed" differently. For example, an Apache web server determines the last modified date for a file differently than a Windows file share and different from SharePoint and different than a BCS connector – each implements this process differently.

  • Although outside the scope of this post, the MSDN article linked here provides more details on implementing "Time stamp-Based Incremental Crawls" versus "Changelog-Based Incremental Crawls". This also reiterates the point that the logic incremental crawls get implemented in the Search Connector/Protocol Handler.

In the TechNet article here, it lists the connectors that by default are installed in SharePoint products, and to which content repositories they connect.

Connectors for each type of content

As I mentioned above, the Gatherer processing logic is the largely the same regardless of the type of content being crawled. However, I wanted to illustrate the high level processing occurring with a SharePoint crawl followed by that in a Web/HTTP to highlight some key differences.

SharePoint Connector

The SharePoint Protocol Handler ("Connector") enumerates a Web Application through a series of iterative SOAP calls to the Site Data web service (/_vti_bin/sitedata.asmx) hosted on that particular Web Application.

It first asks the Web Application (via sitedata.asmx), "For this Web Application URL [Virtual Server], what Content Databases do you have?"

With Verbose logging enabled for the "Connectors:SharePoint" category (which gets logged by the mssdmn.exe process), you can view this request in ULS, such as:

Tag: ac3jn Category: Connectors:SharePoint    Level: Verbose
Message: GetContent CorrelationID –guid-
ObjectType VirtualServer Url http://sp/_vti_bin/sitedata.asmx
Search RequestDuration 52 SPIISLatency 0 SP RequestDuration 14

…and similarly with VerboseEx, the resulting XML SOAP response (from the sitedata.asmx) can too, which looks somewhat like the following (the actual response is much more detailed with many other properties… this has been simplified to illustrate the point)

Tag: dm07 Category: Connectors:SharePoint    Level: VerboseEx
Message: GetContentDatabases received WS response
                 <VirtualServer>
<ContentDatabases>
                             <ContentDatabase ID="{-content-database-id-1-}" />
<ContentDatabase ID="{-content-database-id-2-}" />
</ContentDatabases>
                 </VirtualServer>

From this response, the SP Protocol Handler emits a link representing each Content DB, which then gets inserted into the crawl queue. This ULS message here (also logged by a mssdmn.exe process) would resemble the following:

Tag: dv7f Category: Connectors:SharePoint    Level: VerboseEx
Message: Emit link sts4://sp/contentdbid={-content-database-id-1-}/, DisplayURL=,
HighDateTime = 00000000, LowDateTime = 00000000, url=sts4://sp

For each emitted access URL above, the Gatherer then reports the corresponding messages (logged by the mssearch.exe process) as it inserts this into a temporary table in the Crawl Store as a "candidate" for crawling.

Tag: dw3a Category: Crawler:GathererPlugin Level: VerboseEx
Message: CGatherAddLink::InsertLink: InsertRow on TempTable succeeded,
URL sts4://sp/siteurl=/contentdbid={-content-database-id-},
CrawlID 22575, SourceDocID 101

After validating that this item should be crawled (e.g. verifying crawl rules, server hops, page depth, and so on), the item is then committed/flushed to the MSSCrawlQueue here:

Tag: dw2z Category: Crawler:GathererPlugin Level: VerboseEx
Message: CGatherAddLink::AddLinkComplete: Commit TempTable succeeded,
CrawlID 22575, SourceDocID 101

From this point, each of the Gatherers work independently from each other, so it is not determinant as to which one will perform the retrieval of the items in the MSSCrawlQueue. However, using the stored procedure proc_MSS_GetNextCrawlBatch, one of the Gatherers will then pulls this link about the Content DB from the MSSCrawlQueue and via the SP Protocol Handler, asks the applicable Web Application (again, via the sitedata.asmx), "For this Content DB with the GUID [XYZ], what Site Collections do you have?" Again, the Web Application responds with another SOAP response containing each of the Site Collection[s] along with other metadata applicable to each Site Collection, emitting links to each.

The Crawler and Web Application continue this conversation to drill down on each of the Webs (e.g. the Top level site and all sub-sites) in the Site Collection, each of the lists/libraries in a Web, and items from lists/libraries have been enumerated for the URL. As before, the request and response for this can be found in ULS, however, the specific event IDs in the ULS messages related to each request/response will differ from the first example. Being said, it's often easiest to just filter on "received WS response" to see the various [site data] web service responses for each object type.

Sample Access URL Structures for SharePoint Content

  • Web App (Virtual Server)
    sts4://sp/
  • Content Database
    sts4://sp/contentdbid={ContentDBID}/
  • SP Site Collection
    sts4://sp/siteurl=/siteid={SPSiteID}/
  • SP Web
    sts4://sp/siteurl=/siteid={SPSiteID}/weburl=/webid={SPWebID}/
  • SP List (Library)
    sts4://sp/siteurl=/siteid={SPSiteID}/weburl=/webid={SPWebID}/listid={SPListID}/
  • SP Item
    sts4://sp/siteurl=/siteid={SPSiteID}/weburl=/webid={SPWebID}/
    listid={SPListID}/folderUrl=subFolderName/itemId=12345

 

On the content side, SharePoint tracks changes to items in the EventCache table of the applicable Content Database. As part of the SOAP response from the sitedata.asmx interactions during the crawl, the Web Application [being crawled] provides the SharePoint Protocol Handler a Change Log Cookie that resembles something like "1;0;5e4720e3-3d6b-4217-8247-540aa1e3b90a;634872101987500000;10120", where the GUID identifies the source Content DB and 10120 specifies the particular row in the EventCache at which the change occurred.

During the incremental crawl, the SharePoint Protocol Handler supplies this Change Log Cookie when making the request to the Web Application (via the sitedata.asmx). With this, the Web Application can identify all events that have occurred for this Content DB since this point and thus, identify any content that has changed since the last crawl.

Web/HTTP Connector

SharePoint crawls Web/HTTP as a spider crawl. The first page (typically either the default page for a URL or specialized page called a "seed list" containing links to all other applicable items) gets retrieved via an HTTP GET request. With VerboseEx logging enabled on the "Connectors:HTTP" category, we can then see events such as the following:

Tag: du4u Category: Connectors:HTTP    Level: Verbose
Message: CHttpAccessorHelper::InitRequestInternal - successful request for 'http://parkinglot.chotchkies.lab'.

…once retrieved, it is then parsed by the Crawler:FilterDaemon (also running in the mssdmn.exe process).

Tag: e4z3 Category: Crawler:FilterDaemon    Level: VerboseEx
Message: Try to load DocFilter, URL = http://parkinglot.chotchkies.lab, CLSID = ,
DocFormat = text/html, Extension = , nsOverride = crawl-1, multithreaded = TRUE

The filter identifies an <a href> tags within the document, such as (still logged by the mssdmn.exe process):

Tag: amao7 Category: Crawler:FilterDaemon    Level: VerboseEx
Message: ProcessFilter Property A.HREF, Link http://parkinglot.chotchkies.lab/page1.html/
Source URL http://parkinglot.chotchkies.lab

Tag: amao7 Category: Crawler:FilterDaemon    Level: VerboseEx
Message: ProcessFilter Property A.HREF, Link http://parkinglot.chotchkies.lab/page2.html/
Source URL http://parkinglot.chotchkies.lab

…and these links found within this document get emitted and inserted into the MSSCrawlQueue (these will occur the same as the messages above for the "Crawler:GathererPlugin" category). And as described above, each of the Gatherers work independently from each other and one will pull a batch of items from the crawl queue using the stored procedure proc_MSS_GetNextCrawlBatch …then rinse and repeat as discussed above.

Typically, the content source provides configurable boundary conditions (e.g. such as the number of server hops allowed in this crawl and/or page depth) that prevent this spider crawl from effectively trying to crawl the Internet so to speak. To illustrate this in test environment, I created a content source that allowed a page depth of 5 and server hops of 3 (which is a giant footprint and should be avoided in most cases unless you intend to compete with your favored internet search provider).

After running the crawl, I reviewed the "Crawl Queue" report from the SharePoint Crawl Health Reports. From this, we see a giant spike of gold (representing the "Links to process") at the beginning of the crawl, which corresponded with the number of links inserted into the Crawl Store's TempTable. However, many of these links were thrown away because they exceeded one or more of the parameters (e.g. server hops or page depth) - only a sub-set were allowed to be inserted into the crawl queue as candidates for crawling. The area in blue represents the number of items currently in the crawl queue and waiting to be picked up by a Gatherer.

When an item is crawled by Web/HTTP Search Connector, the Crawler will track the last modification date for an item using the value in the Last-Modified response header reported by the source web server. When performing an incremental crawl, the Web/HTTP protocol handler then makes the request for that item using with the if-modified-since parameter in the request header and specifies that Last-Modified value from the previous crawl. Assuming a Last-Modified date of "Thu, 22 Apr 2010 14:01:24 GMT", the request at crawl time might appear like the following:

GET http://www.harbar.net/presentations/spevo/ITIT112%20mythbusters.pdf HTTP/1.0
Cache-Control: no-cache
Accept: */*
From: motivated@officespace.lab
If-Modified-Since: Thu, 22 Apr 2010 14:01:24 GMT
User-Agent: Mozilla/4.0 (compatible; MSIE 4.01; Windows NT; MS Search 6.0 Robot)
Proxy-Connection: Keep-Alive
Host: www.harbar.net

And the response (assuming no changes) might show:

HTTP/1.1 304 Not Modified
Proxy-Connection: Keep-Alive
Connection: Keep-Alive
Via: 1.1 LCS-PRXY-01
Age: 0
Date: Tue, 11 Aug 2015 13:24:22 GMT

A closing anecdote and word of caution for BCS…

When I first started working with SharePoint Search, I would start my crawls, watch it transition to Started, watch the Crawlers (e.g. with Fiddler or in live ULS) pick up the first few items, and then eventually at some point later start really ramping up to start crawling the meat of this content source. This was even more apparent with multiple crawl components where one was busy talking to the site data web service... but the others were virtually idle.

With large enough crawls, it was apparent that the entire content source did not have to be enumerated before crawling began, but there was something going on before the crawl really ramped up across all components. Originally, I just assumed that the Crawler had some sort or threshold that had to be reached before it ramped up - something like: ok, now we have x number of items - unleash the crawlers!  ...but now realize that I was wrong 🙂

My "aha!" moment came when I realized that the Gatherer processing is the largely the same regardless of the type of content being crawled (e.g. SharePoint, file shares, Web/HTTP, and BDC) and it is largely the same for specific items or containers. As I noted above, the Gatherer actually has no concept or understanding of enumeration – it simply gets a link from the crawl queue and invokes the appropriate type of Search Connector to handle the retrieval of the item. It is then up to the Search Connector to figure out what type of entity the link represents and how to handle it. The logic for enumeration and incremental [enumeration] gets implemented in the various Search Connectors.

The behavior I observed was tied to the enumeration of containers. When enumerating a given container, the links for the child items do not get emitted back to the Gatherer until the request to the content repository fully completes. In other words, if a request is made about a folder containing 10 thousand items, you won't see any emitting link messages in ULS trickling out one by one as items get discovered. Instead, you will see nothing reported by this thread in ULS until the request to the content repository returns ...and for large containers, this may be seconds later, minutes later, or even hours later in extreme cases. Then, once the request completes, you will see a flood of messages for all of the links in this container get reported at the same time, which is when the items get added to the crawl queue.

The apparent delays I observed in the beginning of this anecdote were simply caused by there being no items in the crawl queue for the Gatherers to start processing. Once the enumerations of the top level objects began to flood items in the crawl queue, the Gatherers finally had something to process and thus giving the impression that were being "unleashed".

This becomes much more apparent with very large BCS crawls (e.g. more than 1 to 2 million items in a single model) using the out-of-the-box connector or a lightly customized .NET BCS Connector... which is a problem scenario I happen to be hearing more and more lately. In this scenario, the model defines a single root entity, and all of the items in this view or table would be considered children of that root.

For a moment, go back to the file system analogy I used earlier in this post. If I had a folder on my local hard drive containing more than a million items (and no sub-folders) and double-clicked on the folder containing all of these files. I haven't actually performed this test, but I would expect the underlying request and rendering in file explorer to take a LOOOONG time to fulfill and likely appear hung (e.g. "Not responding") at times. However, if those same files were grouped into 100 sub-folders instead, the file explorer would load almost instantly.

In the BCS crawl, it is largely doing the same thing. With one root, the model makes a request to the content repository for ALL of the items in this one single request. In the SharePoint Limits and Boundaries document, the BCS limits state "The default maximum number of items per request the database connector can return is 2,000, and the absolute maximum is 1,000,000.". Admittedly this limit is not enforced for Search and is geared more towards the number of items being rendered per page. But, in my opinion, it does provide a ballpark data point as to where the Product Group starts thinking "this is big" for BCS.

Here is another data point. In the article "Enhancing the BDC model file for Search in SharePoint 2013", the section labeled Enumeration optimization when crawling external systems directly notes "Do not enumerate more than 100,000 items per call to the external system. Long-running enumerations can cause intermittent interruptions and prevent a crawl from completing."

Being said, I have seen plenty of BCS implementations in the 1-to-2 million item range for a single model where crawling worked without issue. And there are certain tricks you can do that might get you up to ~5 millions or more (e.g. only enumerate the ID of each item, expand out the memory thresholds for the mssdmn, and batching). But I expect things to get wonky after this point (and especially as you get closer to the 10 million mark in a single model). The limitation here isn't Search in general or even the Crawler, which I've seen handle hundreds of millions of items.

The underlying issue for these issues is simply tied to extremely large enumerations with too many items associated to a single root. And this is easy to prove. Just start a crawl and start watching your "Crawl Queue" report.. it will undoubtedly be hours before the content source fulfills the enumeration request and during this time, the crawl queue will appear virtually empty (holding just the one row for the item currently being enumerated).

Update: Here is a great series by Chad Hinton that dives into a BCS implementation to support segmented crawls for large databases (I've linked to Part 7 because it most aligns with what I've described above, but I do recommend reading the whole series if you are dealing with a large BCS crawl)

In coming posts: Continuing to peel back the onion…

So far in this series, we covered:

  1. A high-level view to orchestrating the crawl
  2. A deep dive look at starting the crawl
  3. Enumeration/Discovery
    1. Concepts                                     /* this post */
    2. Illustrating through ULS               /* next in the series */
    3. Simulating with PowerShell

 

And in coming posts, we will then deep dive into each of the following areas in separate posts:

  • Enumeration/Discovery: The process where the Crawl Component:
    • Asks for the links to items from the content repository (e.g. "what items do you have within this start address?")
    • And then stores these emitted/discovered links in the "crawl queue" (the MSSCrawlQueue table in the Crawl Store Database)
  • Gathering: The process of the Crawl Component retrieving [think: downloading] the enumerated items (e.g. using links from the crawl queue)
    • Each Crawl Component independently earmarks a small sub-set of items from the MSSCrawlQueue to be gathered/processed; once earmarked, the item is considered part of a "search transaction"
  • Feeding: The process of the Crawl Component submitting the gathered items to the Content Processing Component(s)
  • Processing: The process of the Content Processing Component converting the documents into a serialized object of Managed Properties (aka: "Mars Document")
    • The CPC will produce one Mars Document for each item gathered/fed for processing. In other words, the Mars Document is the end product from processing
  • Index Submission: The process of submitting the processed document (e.g. the Mars Document) to Index Components where it will be inserted into the index
    • Just before submission, a collection of Mars Documents get batched into a "Content Group", which gets submitted to the Index as an atomic operation
  • Callback Handling: Following Index Submission:
    • A callback message gets returned back to Content Processing Component regarding the status of the Index Submission
    • Then, another callback gets returned from the Content Processing Component back to the Crawl Component regarding the status of processing (which implicitly includes the status from Index Submission as well – In other words, processing cannot report success if Index Submission failed)
    • Finally, the transaction for the item completes when its status is persisted to the MSSCrawlUrl table and the item gets cleared from the MSSCrawlQueue table

       

…to be continued J

Comments (4)

  1. Chad Hinton says:

    Love these posts!  Can't wait until the next one so I can compare the ULS message flow from my segmented BCS crawls to your findings.

  2. bspender says:

    Thanks Chad! And thank you for the series you wrote... I've just updated my post with a link to your series, and going to link to it again here 🙂

    Series Index:

    BCS Models - Part 1: Target Database  

            (chad-hinton.blogspot.com/.../bcs-models-part-1-target-database.html)

    BCS Models - Part 2: Initial BCS External Content Types

    BCS Models - Part 3: Crawl Results

    BCS Models - Part 4: Bigger Database

    BCS Models - Part 5: The Bigger Database

    BCS Models - Part 6: How to eat this elephant?

    BCS Models - Part 7: Changes to the BCS Model to support segmented crawl

    BCS Models - Part 8: Crawl Results

    BCS Models - Part 9: Crawl-Time Security

  3. Aligo says:

    Great series of posts! Are you going to continue posting deep dives for the remaining topics/steps?

    1. bspender says:

      Admittedly, I got sidetracked from writing when I began building the Search Health Reports (SRx) and shifting my focus to the Cloud Hybrid SSA... Being said, through that, I've been trying to do knowledge sharing via tools rather than blogs. Most likely, if I'm able to carve out the time to finish that series, I'll pivot at where the series currently stands and start diving more into the flow for the Cloud SSA (*most of what I've already written still applies to both - so it would largely be a look at how the Crawler processes the gathered content and pushes it to SPO).

Skip to main content