Thinking Big – Search Scale and Performance on a Budget

I recently came across Paul Nelson’s informative post on search scalability. I don’t know how long it’s been up there, but reading it made me think of customers I’ve spoken with recently who are looking to scale up their search deployments, but, due to tight budgets, want to do so without simply buying more hardware.

Paul focuses on document count as the main consideration for architecting scalable search, saying:

There is really only one dimension of size: The total count of documents in the system.

He goes on to describe several useful strategies for scaling search for “large” systems – those with document counts of >500 million. Importantly, imo, he also points out that even medium sized systems (10-100 million docs) will have special scaling needs depending on their performance requirements:

If these systems have any kind of query or index performance requirements — for example, it is a public web site with 10-30 queries per second, or that new documents arrive at a rate of 10 documents per second — then you will likely need an array of machines to handle your needs.   

I mostly want to reinforce and build on this second point. Effective scaling search means getting the most out of your search infrastructure (i.e. maximizing the number of documents per unit of hardware), but scale and performance are two sides of the same coin, and whether a system can squeeze ten thousand or ten billion documents on a machine, it must still satisfy the applications performance requirements.

If you can’t just add hardware, what then? Well, there are still options for getting more capacity out of a search system that provides the right level of control for optimization and tuning. Understanding these options requires understanding how search system performance is measured and the associated trade-offs that exist. Paul alludes to some of these trade-offs, but it’s worth providing a few more details and examples to drive this point home.

Search System Performance Metrics

Metrics for search system performance typically fall into two categories: query performance and indexing performance. In turn, these categories each have two measures associated with them:

Query performance

·         Query latency (or response time) – the time it takes for a query to be processed and results to be returned.

·         Query rate – the rate at which the system can process queries. Usually measured in queries per second (or QPS).

Indexing performance*

·         Indexing latency – the time it takes for a document to be indexed and made available to search.

·         Indexing rate – the rate at which the system can process and index documents. Measured in documents per second.

*Indexing performance assumes systems that actually create an index or some other sort of database optimized for information retrieval. This rules out “federated search” engines, which rely on other systems to create and manage these indices.

There are some variations on these measurements. For example, you can track average or peak values for each.  Document count per node (where a node = a Processing/Memory/Storage unit on a network) impacts all of these measures, but there’s a balance between query performance and index performance that also influences how many documents you can squeeze onto a single node.  The perhaps obvious explanation is that the more system resources you allocate to serve query performance, the fewer resources you’ll have available for indexing, and vice versa.

Applications with rapidly changing content or with very time sensitive data place high demand on indexing performance. Other applications, like highly trafficked Web sites, place high demand on query performance. Different applications place different demands on scalability depending on the performance requirements across these dimensions. To take a specific example, consider an eDiscovery application that provides search across 100s of millions of archived emails. The query rate and indexing latency requirements for this type of application are typically lower than what a reasonably popular social networking site with an equivalent document count might see. As a result, eDiscovery search applications are able to squeeze more documents per node than highly trafficked Web sites – even if they serve the same total number of documents.

For another comparison, large eCommerce sites can have extreme query performance requirements – in some cases handling several thousand queries per second during peak traffic times, while still delivering sub-second responses. Even with these extreme query requirements, these sites can have relatively modest indexing performance requirements when compared to, say, financial news applications where content “freshness” and, so, low index latency are a priority.

Impact of Features

An often neglected factor that impacts performance is feature set. Features like faceted searching, results clustering, automatic query completion, and advanced query operators can each add incremental overhead to indexing performance, query performance, or both, depending on the feature and the system. For example, queries used for eDiscovery are sometimes crafted by teams of lawyers. This can result in queries made up of dozens or even hundreds of carefully selected search terms combined in a maze of (also carefully selected) Boolean, proximity, and other types of search operators.

I remember one FAST partner describing how their legacy eDiscovery tool (built on relational database technology) took up to 2 weeks to process a particularly long and complex query. Needless to say, they were delighted when we demonstrated the same query taking only a few seconds. It was not sub-second, but the point is that they would have been happy with this particular query if it came back in a few hours. In fact, our conversations on optimization included whether we could squeeze more capacity (docs per node) by relaxing the query response time requirements to 10-15 seconds for these queries in their application.

Different search systems are better (faster) than others, but parsing and evaluating very long and complex queries will generally take more cycles and resources than the usual 1 or 2 term ad hoc query. Relative to absolute document count, the individual impact on performance and scale of any one feature may be small, but taken as a whole and for certain applications, like the one in the example above, they can represent meaningful tuning options.

Know Your Options

The moral of the story is that getting enterprise search scale and performance right for large systems can be somewhat nuanced – especially if you’re on a tight budget. If you’re embarked on, or about to embark on a large scale enterprise search project, make sure you understand these performance considerations. Best of breed enterprise search platforms support many tuning strategies that factor in all the key dimensions of search performance and scale. Read your system’s deployment guide (if it comes with one) to understand these options.

Lastly, if you’re not sure if your project has what might be considered demanding scale or performance requirements, consider getting some expert advice. Below are some good online forums you can tap for expert advice and to get a sense for whether your system might be considered “demanding”.  (Search Engine Developers group on Yahoo)  (Enterprise Search Engine Professionals on LinkedIn)


Comments (3)

  1. Paul says:

    Hey Nate:

    Excellent points.

    Some additional comments for readers (I know you know all of this already):

    1.  Very large systems need to be more concerned about 2nd order effects overwhelming the system. These are things like hardware failures, configuration complexity, network traffic, etc.

    This is why I often like splitting large systems into multiple smaller, independent systems when things get big. For example, two 5-row FAST installations may be, overall, more reliable than one large 10-row system for query scalability.

    2.  Just to emphasize that indexing is very expensive, and a high-bandwidth indexing application (i.e. very frequent updates) will use up a lot of resources.

    So, if possible, make some servers “static” – i.e. no new documents (deletes are okay, they execute very quickly). This will free up large amounts of resource for query and will allow you to pack on more documents per node increasing document count scalability. Of course, doing this is tricky and requires some careful architectural analysis.

    Going along with this, indexing is more efficient the larger the number of documents are included in a batch. For example, if indexing 10 documents requires X, then indexing 1000 documents may only require X*2. The larger the batch, the more efficient. This further emphases the value of centralizing indexing onto a small number of nodes – if possible.

    Of course, a lot of these ideas are non-standard configurations. 🙂

    Also, they have other issues – such as interfering with default relevancy ranking TF/IDF formula and increasing configuration complexity…

    Which is all to say: don’t bother unless you’re doing something really big. Otherwise, just buy an extra server or two and don’t worry about it.

  2. Webcast TechNet Webcast: SharePoint und PowerShell (Level 300) 4.6.09 15:30-15:50 Uhr Für Office SharePoint