How to determine the number of changes an incremental crawl will process prior to initiating the crawl


I’ve had customers ask this question several times because they would like a good understanding of how long the incremental crawl is going to take.  The # of changes an incremental crawl must process has a lot to do with why one incremental crawl takes 10 minutes and another crawl takes several hours.  This isn’t the only reason why an incremental crawl may take longer than expected but gives you some insight beforehand how many changes you are dealing with. 


Understanding the Basics:


The incremental process is dependent on the protocol handler being used.   This blog is solely focused on detecting changes against Sharepoint sites.  When detecting changes against Sharepoint sites, we use the “Sharepoint 3.0 (sts3) protocol handler”.  


We will first attempt to get changes from the last crawl.  We do this through MSDMN.exe process and hit a webservice called sitedatawebservice.  The URL is:


http://servername/_vti_bin/sitedata.asmx


In my case the url is http://russmaxwfe/_vti_bin/sitedata.asmx


For incremental calls we will use the GetChanges method.  It uses a soap protocol and will gather last change ID and which we received during last crawl which will enable the webservice to return us a list of all new changes. 


 


How to detect changes before starting incremental crawl:


This can all be accomplished by a series of SQL queries.  I want to remind readers that performing updates/edits to the database directly are 100% not supported.  The first table you need to check is the MSSChangeLogCookies table within the Search database.  This table keeps track of the last change that the crawler processed for each content database.  You’ll want to look at the ChangeLogCookie_new column and you’ll see several rows but the output of each will look something like this:


 


1;0;888eef75-d584-4edf-b242-f5161d4c3c44;633579660402500000;2386


 


The GUID, 888eef75-d584-4edf-b242-f5161d4c3c44, is the actual database were crawling against.  The last value, 2386, is the latest change ID.   So first, we need to find which content database this row is referencing.  To do this, we take the Guid, 888eef75-d584-4edf-b242-f5161d4c3c44, and perform the following query against the objects table of the configuration database:


 


select * from objects with (NOLOCK) where ID = ‘888eef75-d584-4edf-b242-f5161d4c3c44’


 


This will output the name of the content DB.  So for my case, it’s MOSS_ContentDB.  So at this point, we know that last change that was processed against the MOSS_ContentDB is 2386. 


Now we need to determine all of the changes from 2386 to latest from the MOSS_ContentDB.  The eventcache table within the content database contains all of the changes up to the most recent.   So in our example above, we need to know all of the changes greater than 2386 so we perform the following query:


 


 select * from eventcache with (NOLOCK) where ID > ‘2386’


 


The ID column will show you all changes after 2386.  The last row will be the latest change so in my case it’s 2396.  So before starting the incremental crawl, I know that the crawler will process 10 changes against this content database.   After running an incremental crawl if I check the MSSChangeLogCookies table in the Search DB, I’ll see the following:


ChangeLogCookie_old column will contain:


1;0;888eef75-d584-4edf-b242-f5161d4c3c44;633579660402500000;2386 


ChangeLogCookie_New column will now contain:


1;0;888eef75-d584-4edf-b242-f5161d4c3c44;633579795185130000;2396


And the process repeats itself…

Comments (3)

  1. I’ve had the question from customers come up, about why there’s differences in their crawl times. It

  2. Bob Quinn says:

    great article and very straightforward way of calculating the number of changes!  are there any other components of an incremental crawl that will add measurable effort/time to the process?  sometimes we see an incremental crawl run for hours with only one change being picked up – we’re thinking it has something to do with security changes within the content / groups / users…but haven’t had success predicting the impact ahead of time yet.

  3. rose says:

    This is good explanation for Sharepoint sites.  How about if we are crawling non-sharepoint sites?  How do we the crawler determine the changes?