PowerShell to Rebalance Crawl Store DBs in SP2013


In SharePoint 2013, simply adding a new Crawl Store DB doesn’t cause the SSA to rebalance links among stores, and admins are unable to manually trigger a rebalancing process until the standard deviation of links in all existing Crawl Stores exceeds the threshold defined by the SSA property CrawlStoreImbalanceThreshold.

Once this threshold is reached eventually, the Search Admin UI displays a control that allows the administrator to initiate the rebalancing process. Specifically, the CrawlStoresAreUnbalanced() method checks whether the standard deviation of link counts among all crawl stores is higher than value defined by the SSA property CrawlStoreImbalanceThreshold. Being said, you may have to lower the threshold value much lower than expected to trigger CrawlStoresAreUnbalanced()to evaluate as TRUE. Another SSA property, CrawlPartitionSplitThreshold, determines the threshold when hosts can be split across multiple Crawl Store DBs during the rebalancing process.

The following example illustrates a full example of these cmdlets, which are largely derived from the CrawlStorePartitionManager Class ( http://msdn.microsoft.com/en-us/library/microsoft.office.server.search.administration.crawlstorepartitionmanager )

Prior to the rebalancing process, we can see that all links currently exist in a single Crawl Store DB:

​Crawl Store DB Name ​ContentSourceID HostID​ linkCount​
​V5_SSA_CrawlStore ​1 ​4 ​20,558
​V5_SSA_CrawlStore 4 ​1 ​157,671
​V5_SSA_CrawlStore 6 ​2 ​14,813
​V5_SSA_CrawlStore ​6 ​3 ​10,818

 

$SSA = Get-SPEnterpriseSearchServiceApplication

New-SPEnterpriseSearchCrawlDatabase -SearchApplication $SSA -DatabaseName V5_SSA_CrawlStore2

New-SPEnterpriseSearchCrawlDatabase -SearchApplication $SSA -DatabaseName V5_SSA_CrawlStore3 

$foo = new-Object Microsoft.Office.Server.Search.Administration.CrawlStorePartitionManager($SSA)

$foo.CrawlStoresAreUnbalanced()

False 

$ssa.GetProperty(“CrawlStoreImbalanceThreshold”)

10000000      # 1 million (this is the default value)

$ssa.SetProperty(“CrawlStoreImbalanceThreshold”,10000)

# Verify in registry of Crawl Component that this changes to the new value

# ex: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\15.0\Search\Applications\1d330903-aad9-47e2-9373-f30e945c933c-crawl-0\CatalogNames

$foo.CrawlStoresAreUnbalanced()

True      # After lowering the threshold, it’s no longer “balanced”

$ssa.GetProperty(CrawlPartitionSplitThreshold)

10000000      # 10 million (this is the default value)

$ssa.SetProperty(CrawlPartitionSplitThreshold,50000)

# This allows any partition greater than 50,000 items to be split across Crawl Store when rebalancing 

$foo.BeginCrawlStoreRebalancing()

Guid

—-

f9923696-76f1-482d-96cd-c10aedd92fa2

$foo.TimeToCompletion(“f9923696-76f1-482d-96cd-c10aedd92fa2”)

# Repeat as needed using GUID from above…

$foo.Completed(“f9923696-76f1-482d-96cd-c10aedd92fa2”)

True

 

After the rebalance, use SQL Queries such as the following to confirm:

SELECT ContentSourceID, HostID, COUNT(*) AS linkCount FROM [V5_SSA_CrawlStore].[dbo].[MSSCrawlURL] with (nolock) group by ContentSourceID, HostID order by ContentSourceID, HostID

SELECT ContentSourceID, HostID, COUNT(*) AS linkCount FROM [V5_SSA_CrawlStore2].[dbo].[MSSCrawlURL] with (nolock) group by ContentSourceID, HostID order by ContentSourceID, HostID

SELECT ContentSourceID, HostID, COUNT(*) AS linkCount FROM [V5_SSA_CrawlStore3].[dbo].[MSSCrawlURL] with (nolock) group by ContentSourceID, HostID order by ContentSourceID, HostID

​Crawl Store DB Name ​ContentSourceID HostID​ linkCount​
​V5_SSA_CrawlStore ​1 ​4 ​20,558
​V5_SSA_CrawlStore 6 ​2 ​14,813
​V5_SSA_CrawlStore 6 ​3 ​10,818
​V5_SSA_CrawlStore2 ​4 ​1 77,836
​V5_SSA_CrawlStore3 ​4 ​1 ​79,835
 
Which confirms the reblanced Crawl StoreDBs as well as illustrating the splitting of a single HostID across the crawl stores (in this case, HostID 1 was split across CrawlStore2 and CrawlStore3).
 
Update: I’ve recently had several people reach out to me after reading this TechNet article, which states:

“In SharePoint Server 2010, host distribution rules are used to associate a host with a specific crawl database. Because of changes in the search system architecture, SharePoint Server 2013 does not use host distribution rules. Instead, Search service application administrators can determine whether the crawl database should be rebalanced by monitoring the Databases view in the crawl log”

 
Update: For reference, use the following PowerShell to determine the document counts being used by CrawlStoresAreUnbalanced() to calculate the standard deviation among all crawl stores:

$crawlLog = New-Object Microsoft.Office.Server.Search.Administration.CrawlLog $SSA

$dbHashtable = $crawlLog.GetCrawlDatabaseInfo()

$dbHashtable.Keys
    Guid
    —-
    5bf0290a-ad4c-4462-a7b2-6892be9431c1
    9e95a69e-7129-4d96-aaff-d577c4663cb3

$dbHashtable[“5bf0290a-ad4c-4462-a7b2-6892be9431c1”]
    DocumentCount : 5094767
    Partitions    : {msdn.microsoft.com, technet.microsoft….
    ID            : 5bf0290a-ad4c-4462-a7b2-6892be9431c1
    Name          : V5_SSA_CrawlStoreToo

$dbHashtable[“9e95a69e-7129-4d96-aaff-d577c4663cb3”]
    DocumentCount : 188343
    Partitions    : {{853da760-f456-4375-a77b-8e41bc218770}…
    ID            : 9e95a69e-7129-4d96-aaff-d577c4663cb3
    Name          : V5_SSA_CrawlStore

 

Comments (5)

  1. Craig Humphrey says:

    Hi,

    when I go through this process on my farm, $foo.CrawlStoresAreUnbalanced() always comes back true, even straight after the $foo.BeginCrawlStoreRebalancing() has completed!

    For my Content Sources, it looks like one has been successfully split across two crawl store DBs, but the split is 39K to 104K, which doesn't seem particularly balanced. While another has ended up with all 871K items in the one crawl store DB.

    Even the link counts between my three crawl store DBs is imbalanced: 105K to 143K to 871K.

    FYI I'm currently working through a crawler load imbalance with MS Support and had only one crawl store DB. I haven't re-run my crawls yet, but now my crawl store DBs are unbalanced too…

    Any ideas?

    Thanks

    Craig

  2. bspender says:

    Hi Craig – the check for CrawlStoresAreUnbalanced() is purely a check of standard deviation among the crawl stores. If you have one large content source (e.g. tied to a single domain) and one or more small content sources, then the behavior you're seeing is completely expected…

    Keep in mind that we try to keep all items tied to the same host together (which is why the CrawlPartitionSplitThreshold is 10 MILLION items – if there are less than this number of items tied to the host, then the re-balancing won't break up those links across crawl stores). So in your case, the 871K items (assuming they are tied to the same host) would not be split.

    So think of the re-balancing at the host (e.g. domain name) level like the host distribution rules of SP2010. We try to keep each bucket of items together, but if it is sufficiently large, we can break it off into another crawl store (this also prevents the max size of a crawl store from limiting the number of items from any given host… because we can now break it off into a new crawl store).

    In other words, tthe goal of rebalancing is not to make each crawl store closely(or even similarly) balanced in terms of the number of links in each… but rather, instead, trying to balance the buckets as best as possible.

    Being said, it is quite possible for buckets to be balanced the best we can… but the actual standard deviation is higher than your threshold (so the CrawlStoresAreUnbalanced() would continue to report true).

    I hope this helps *(apologies for the delayed response)

  3. Christoph Hannappel says:

    Hi,

    thanks for the good post. It helped me to split my Crawl Databases and understand what to expect. Those posts are unfortunately a rare thing 🙂

  4. SV says:

    The rebalancing shows completed. Bu the SSA status is still

    Administrative status Paused for:Refactoring

    Any idea?

  5. bspender says:

    It sounds like the SSA was paused with $SSA.PauseForIndexRepartitioning() which you would use when adding a new Index partition…

    Such as that described here:

    technet.microsoft.com/…/jj862355.aspx