Mitigating Shard corruption in Elasticsearch


Recently while troubleshooting one of the customer issues for Search in Team Foundation Server, we observed that the Elasticsearch (ES) cluster was consistently going into RED state even on a clean new install and start. The ES version that is packaged with TFS 2017 Update 2 is 2.4.1. (Note that it's a customized version with some tuning done for Search in TFS).

This post details the learning from it.

For a cluster to be in GREEN health all shards need to be healthy. Shards can be concluded to be healthy, if the http://{ElasticSearchUrl}/_cat/shards?v  indicates all shards are in STARTED state and assigned to a node. ES cluster configuration for Search in TFS has a single node and hence, there are no replica shards.

If at instances a shard gets into a corrupted state, we can get to know

Following are some mitigation options that can be employed to fix the issue -

Option 1

Get status from http://{ElasticSearchUrl}/_cat/shards?v. In most cases, the ES log file would also indicate the location for a corrupted shard. e.g. Caused by: java.nio.file.AccessDeniedException: E:\TfsData\Search\IndexStore\TFS_Search_DEVSOURCE16\nodes\0\indices\codesearchshared_0_0\9\translog

In this case when we know the exact location of the corrupted shard/index, do the following -

(You can use Sense to post ES commands. Refer here: https://www.elastic.co/blog/found-sense-a-cool-json-aware-interface-to-elasticsearch)

  1. Get all index, shard details which are unassigned from the results of shards health data. Run Repeat (2-4) for all of them:
  2. Do a POST indexName/_close
  3. Go the index data folder location (such as  E:\TfsData\Search\IndexStore\TFS_SearchABC\nodes\0\indices\codesearchshared_0_0\7\translog can be found from the ES logs); Delete whatever is present in the shard.
  4. Do POST indexName/_open
  5. Restart ES and executing  http://{ElasticSearchUrl}/_cat/shards?v should show no unassigned shards.  The shard state should show as Started or Initializing.

Option 2

Get status from http://{ElasticSearchUrl}/_cat/shards?v. If there are unassigned shards, they may not necessarily be corrupted, but rather failed to get assigned to a node. We could manually assign them to the node.

Run the following command:

POST /_cluster/reroute
{
    "commands" : [
    {
        "allocate" : {
            "index" : "{IndexName}",
            "shard" : {ShardId}
            "node" : "{NodeId}",
            "allow_primary" : true
        }
    }
    ]
}

Check status from http://{ElasticSearchUrl}/_cat/shards?v. It should have assigned the shard to the node (the shard state should be Started or Initializing).

Repeat for all unassigned shards.

In case shard is marked failed, rewrite the above query to forcefully assign

POST /_cluster/reroute?retry_failed=true
{
    "commands" : [ 
    {
        "allocate" : { 
        "index" : "{IndexName}",
        "shard" : {ShardId}
        "node" : "{NodeId}", 
        "allow_primary" : true
        }
    }
    ]
}

Option 3

Last recommendation is to update the index settings as mentioned below -

  • Do a GET indexName/_settings - use this data in below in PUT.
  • POST indexName/_close
  • Update the settings:

PUT indexName/_settings
{
    "index": {
        "index.shard.check_on_startup" : "fix", //Segments that were reported as corrupted will be automatically removed. This option may result in data loss. If the index does not have data, or you are prepared to reindex, you are good to run this.
        "analysis": {
            "analyzer": {
                "contentanalyzer": {
                    "type": "custom",
                    "tokenizer": "contenttokenizer"
                },
               "pathanalyzer": {
                   "type": "custom",
                   "tokenizer": "pathtokenizer"
               }
           },
           "tokenizer": {
               "contenttokenizer": {
               "pattern": "(\\w+)|([^\\w\\s]?)",
               "type": "pattern",
               "group": "0"
           },
           "pathtokenizer": {
               "type": "path_hierarchy",
               "delimiter": "\\"
           }
     }
}

Comments (0)

Skip to main content