# The cost of trying too hard: String searching

There are many algorithms for fast string searching, but the running of a string search is inherently O(n), where n is the length of the string being searched: If m is the length of the string being searched for (which I will call the "target string"), then any algorithm that accesses fewer than n/m elements of the string being searched will have a gap of m unaccessed elements, which is enough room to hide the target string.

More advanced string searching algorithms can take advantage of characteristics of the target string, but in the general case, where the target string is of moderate size and is not pathological, all that the fancy search algorithms give you over the naive search algorithm is a somewhat smaller multiplicative constant.

In the overwhelming majority of cases, then, a naive search algorithm is adequate. As long as you're searching for normal strings and not edge cases like "Find `aaaaaaaaaaaaaaab` in the string `aaaaaaaaaaaaaabaaaaaaaaaaaaaaab`". If you have a self-similar target string, the running time of a naive search is O(mn) where m is the length of the target string. The effort in the advanced searching algorithms goes towards diminishing the effect of m, but pay for it by requiring preliminary analysis of the target string. If your searches are for "relatively short" "normal" target strings, then the benefit of this analysis doesn't merit the cost.

That's why nearly all library functions that do string searching use the naive algorithm. The naive algorithm is the correct algorithm over 99% of the time.

Tags

1. If you’re going to be searching the SAME long string over and over again, it is VERY much faster to build an index. (Assuming the string is non-pathological.)

Compare SQL’s LIKE ‘%butter%’ vs. CONTAINS(…, ‘butter’, …) on a large data set.

2. Michael Bishop says:

In the past, I needed a very fast way of searching a constant string for many substrings. I built a suffix trie from the data. This allows searching in O(m) time. You sacrifice memory, but when you are looking for short substrings in large text the speed difference is amazing.

3. Chris Moorhouse says:

"The naive algorithm is the correct algorithm over 99% of the time."

99% of projects? 99% of calls to search functions/methods? 99% of calls made to search functions/methods?

Maybe I just take Mr Lippert too seriously. ("250% of what, exactly?", Sep 1/05) :D

4. Simon Cooke says:

I still like Boyer-Moore because it only touches each byte once. There’s a certain elegance there that I really appreciate.

Of course, that’s all irrelevant because ultimately, Raymond’s spot on – most of the time there’s no advantage to using the more complex one.

Although you can make things better by optimizing the search for the first character of the string :)

5. Roger Clark says:

"trying too hard" really isn’t the right phrase to be using here.

6. Per Vognsen says:

Both this and the previous blog posting on splay trees would have been far more accurately titled "The cost of not understanding the implications of your choices", although this isn’t as catchy as yours.

The trend so far (if I may extrapolate from two data points) seems to be that you have an axe to grind with people who uncritically copy algorithms from textbooks. Well, yeah, people like that are pretty common, but copying algorithms uncritically is usually not among their chief failings…

7. People will occasionally say things like "Why doesn’t XYZ use ? It’s clearly superior to the brute-force algorithm you’re currently using." These people are trying too hard. In many cases, the fancy algorithm turns out to be worse than the dumb one.

8. Norman Diamond says:

I think this case is different from splay trees. If a smarter algorithm is used, is it really slower in the average case, or does it remain equal in the average case while wrapping up the pathological edge cases?

If you’re writing a library (write once and use many times), and if performance of the average case doesn’t deteriorate, then you still get medium-sized gains at cheap cost by choosing a better algorithm.

9. Per Vognsen says:

People will occasionally say things like "Why

>doesn’t XYZ use ? It’s clearly superior to the

>brute-force algorithm you’re currently using."

>These people are trying too hard. In many cases,

>the fancy algorithm turns out to be worse than

>the dumb one.

Ah, so you mean that they are trying too hard to apply concepts they were taught about in school (or elsewhere), say. Yes, this is definitely a common affliction.

A friend once related to me the story of one of his final projects for a class on networking concepts, which was to build a simple instant messenger service.

The project was to be built in Java, and one of the things they’d covered in class was RMI (the Java equivalent of .NET Remoting).

My friend had easily worked out a simple line-delimited TCP/IP-based protocol and had already prototyped and tested it via a simple command-line client, but one of the other team members was very unhappy. He kept wanting to know "How does RMI fit into this?!?". (RMI is the Java equivalent of .NET Remoting.)

10. It’s more than getting all enamored of fancy algorithms. It’s just recognizing when you’re overkilling the problem. Using a fancy hash table to keep track of 5 elements, for example.

11. Dean Harding says:

One of the thing about the Hashcode class in .NET is that it’s just so easy to use. And "everybody" knows that Hashtables aare as fast as you can get.

So it’s really annoying when you see people using Hashtables to store a dozen or so values, when a SortedList would be even faster due to it’s *much* lower overhead…

The problem is that people just don’t think about the problem enough…

12. Bart says:

Especially annoying is people using a UnsortedList ‘because N will be small’ and then using the app where the N doesn’t stay small enough.

Especially since they have a HashSet class which exposes the exact same methods used with the same contracts and there aren’t enough instances to worry about the overhead.

13. Jonathan Allen says:

In regards to using HashTable vs SortedList in .Net, most people have no way to know which to use. It isn’t like the documentation is real clear on when to use each, nor does anyone really have to time to performance test non-critical code.

Really all they need to do is add a couple of "when should I use this" lines to each class / method, and the rate of correct usage will go way up.

14. AC says:

The best example of overkill is the way Windows Update Agent API works. For any Windows computer to scan for updates, it has to download a CAB file which at the moment contains around fifty thousand(!) XML files totaling 50 MB, to decompress them, parse them etc.

Try that on Pentium II or III and dial-up connection. Instead, all the data could have been stored in a few tab separated files and then compressed.

My second favorite overkill example is the way IE stores cache and history data. It was so complicated that IE was for years unable to maintain its own cache — something remained forgotten all the time.

15. Ken Showman says:

The information about the Windows Update Agent (WUA) API that "AC" mentioned above is very misleading.

The Windows Update Agent (WUA) API is designed to search the local computer’s cache of update metadata. When WUA searches for updates, it actually uses a highly optimized protocol that only picks up new and changed updates, prunes out entire branches of updates based on OS and other applicability rules, omits languages that are not installed on the local machine, and then caches the results in a client-side database. In the usual case, almost nothing is downloaded when WUA performs a search; virtually all of the time "searching for updates" is actually spent running the various applicability rules, e.g. verifying that all of your previously installed patches are still installed and valid.

In order to support API callers on computers where there is no trusted server (such as a computer with no Internet connection), a feature was added to let the API caller supply the giant CAB mentioned above. This CAB is effectively a replacement for the entire WU server, and it includes all updates in all languages because it cannot be known at CAB-generation time which platform(s) the API caller will care about. The data is formatted in XML because that is the same format used by the "real" client-server protocol. The individual XML files in question are never written to disk, so the fact that there are 50,000 of them (as opposed to one giant XML element that contained everything) is beside the point.

Funny that AC suggests that using a tab-delimited file for the CAB scenario would somehow have been less work for the WUA team. That is definitely not the case.

16. Sean Barrett says:

Simon Cooke incorrectly claims Boyer-Moore touches each byte exactly once. Perhaps he is thinking of KMP; Boyer-Moore is one of the ‘skip m’ algorithms but it has worst cases that examine more than n characters total, even for no matches.

Also worth considering about the performance of an m-skipping algorithm is that if m is smaller than the cache-line size, you’ll still be touching every cache line of the (large) string, so any performance gain may be miniscule if the memory traffic dominates.

I invented a skip-m style algorithm for regular-expression searching 15 years ago, but I never did anything with it beyond publishing it on my website, because as Raymond says, it’s just not that interesting except for very long search strings.

17. Dustin Long says:

Given that "the world is moving towards webapps", it helps to implement the non-naive approach nowadays. Assuming you’re going to be making some internet accessible application, you should better be ready for some fairly pathological arguments to be coming your way.

18. AC says:

Ken: "Funny that AC suggests that using a tab-delimited file for the CAB scenario would somehow have been less work for the WUA team. That is definitely not the case."

It’s not about WUA team’s work, it’s how much each computer has to do to find out the updates it needs. Your arguments are all in the direction "we made it to work nicely with server, and if it doesn’t have the server, then it was easiest for the team to pack everything as 50000 XML files". I can imagine that this was the easiest for the team to do, but, sorry, that is still the example of the "overkill" design. Instead of nicely preparing the data on source side, once for all millions of users, it remains in the form that was "easiest".

I really consider "database to XML files" as one of the most inefficient forms for the export of database tables, and "tables to tab separated files" as one of the most efficient. Why would anybody need the names of the fields more than once per table, during the processing of the table?

19. Eric Lippert says:

99% of projects? 99% of calls to search functions/methods? 99% of calls made to search functions/methods?

In this case I would say "all of the above", and follow up with "99% underestimates by a considerable number of orders of magnitude".

I made the same proposal — turn this naive algorithm into something "better" — when I was a naive intern and needed to make some minor changes to the VB implementation of InStr.

Tim "Father of DOS" Paterson owned that code at the time and he taught me a valuable lesson. In the vast majority of real-world cases, the time taken to blaze through the search string looking for all the occurences of the first letter in the target string is far less than the time it takes to scan the string, build the index, blah blah blah. It’s the very definition of premature optimization.

We have thousands upon thousands of lines of real-world Visual Basic code in the compatibility lab and people are usually (a) looking for short strings inside other short strings, and (b) their string operations are gated on database or disk performance anyway, so who cares if the string search takes 200 nanoseconds vs 250?

Really, very few people do searches of the entire human genome using InStr in Visual Basic.

20. dottedmag says:

What the library writers don’t do generally is the documentation of complexity of the algorithms they use. The only such specification I seen so far was the C++ standard. You have to dig the sources usually.

21. ShameOnYou says:

The implementation details isn’t documented because the programmers are ashamed of implementing a unoptimized algoritm in a library used by millions.

22. The algorithm *is* optimized. It’s optimized for non-self-similar strings.