Last time on FABULOUS ADVENTURES:

Eric Lippert: The second term is asymptotically insignificant compared to the first, so we’ll throw it away, and declare that any sort on a list of n items has at least a worst case of O(n log n) comparisons.

Leo McGarry: We must inform the President!

And now, the thrilling conclusion.

To extract and analyze the Google queries from my logs, I used a bunch of one-off scripts, egregious hacks, and programs written in a whole bunch of different languages. The one-off script has little in common with the ship-it-to-customers code I write every day. One-off scripts do not need to be readable, testable, performant, robust, localizable, secure, extensible or even bug-free. The one-off script does a job once, and then I might delete it or toss it in my “scratch” directory or whatever. As long as it gets the job done, everything is fine.

First, let’s pull down every referrer page and use a regular expression to parse out the query string.

Set re = New RegExp

re.Pattern = “a href=””http:\/\/[^>]*google[^>]*q=([^&””<]*)[&””<]”

re.Global = True

Set HTTP = CreateObject(“Microsoft.XMLHTTP”)

URL = http://blogs.msdn.com/ericlippert/admin/Referrers.aspx?pg=

For page = 1 to 2193

HTTP.Open “GET”, URL & page, False

HTTP.Send

Set matches = Re.Execute(HTTP.ResponseText)

If Not Matches Is Nothing Then

For Each Match in Matches

For Each SubMatch In Match.SubMatches

WScript.Echo SubMatch

Next

Next

End If

Next

I used VBScript because I prefer VBScript’s syntax for parsing out submatches to JScript’s. See the stuff in the parens in the expression? That’s the *submatch* that I want to extract — the actual query portion of the URL.

Notice that I have the number of pages hard-coded. I knew how many pages there were, and therefore why would I bother writing logic to deduce that? *This is a one-off*. Don’t like it? Fix it yourself, buddy.

I ran that program — which took quite a while, because downloading 2193 web pages isn’t cheap — and dumped the text to a file. I now had a file that looked like this:

dllmain

determine+character+encoding

how+to+use+loadpicture+in+asp

what+is+garbage+collector+exactly+works+in+.net

jscript+innertext

wscript+object+in+asp.net

vbscript+round+up

wshcontroller+activex

closures+jscript.net

c++++smart+pointers

etc. I was finding all those plus-for-space characters hard to read, so I loaded that thing into my editor and did a little vi magic to first replace all instances of c++ with CPLUSPLUS, then turned every + into a space, and then turned the CPLUSPLUSes back to c++. That gave me

dllmain

determine character encoding

how to use loadpicture in asp

what is garbage collector exactly works in .net

jscript innertext

wscript object in asp.net

vbscript round up

wshcontroller activex

closures jscript.net

c++ smart pointers

Then I used my editor to search for the entries that started with “how”, “what”, and so on, because I figured that the questions would be the funny ones.

I was also interested in things like “what’s are the top ten most popular words in these queries?”, just out of curiosity. Mike wrote a blog entry about that a while back, where he deduced from first principles how to solve the “what’s the most popular word in this text?” problem.

Let’s break that problem down into three subproblems:

(1) Obtain a list of words.

(2) Figure out the frequency of each word.

(3) Output the words, sorted by frequency.

Suppose that there are n words in the list obtained from (1). Recall that yesterday we came up with an O(n^{2}) solution to part (2) — for each word in the list, compare it to every word that comes after in the list. If we added a counter, we could determine how many collisions each word had. To solve part (3), we could create a new list of (word, count) pairs, and then sort that list by frequency, somehow.

Mike came up with a much better solution than that for part (2). Mike’s solution was to **first sort the word list**. Suppose you have the word list <the, cat, in, the, hat>. Sorting it takes O(n log n) steps, as we learned yesterday: <cat, hat, in, the, the> and suddenly finding the frequency of each word is easy. You no longer need to compare each word to ALL the words that come after it, just the first few. The sort is expensive but it made finding and counting the repeated words very cheap.

This is also a good solution because it uses off-the-shelf parts; it uses a built-in sort routine to do the heavy lifting.

How then do you solve part (3)? Mike again had the clever idea of re-using existing parts. Dump the word and frequency data into a DataTable object, and use the DataTable‘s sorting methods to sort by frequency. Another solution that he came up with later was to implement IComparable on an object that stores the (word, frequency) pairs and sorts on frequency.

Can we do better than O(n log n)? Let’s try. And, as an added bonus, I’m going to use the new generic collection feature of C#, coming in Whidbey. If you haven’t seen this feature before, it’s pretty cool. To briefly summarize, you can now declare types like “*a dictionary which maps integers onto lists of strings*“. The syntax is straightforward, you’ll pick it up from context pretty easily I’m sure.

Step one, read in the log and tokenize it using the convenient split method.

StreamReader streamReader = new StreamReader(@”c:\google.log”);

string source = streamReader.ReadToEnd();

char[] punctuation = new char[] {‘ ‘, ‘\n’, ‘\r’, ‘\t’, ‘(‘, ‘)’, ‘”‘};

string[] tokens = source.ToLower().Split(punctuation, true);

The true means “don’t include any empty items” — if we were splitting a string with two spaces in a row, for example, there would be an “empty” string between the two delimiters. (In the next version of the framework, this method will take an enumerated type to choose options rather than this rather opaque Boolean.)

We’ve solved subproblem (1). Onto (2):

Dictionary<string, int> firstPass = new Dictionary<string, int>();

foreach(string token in tokens)

{

if (!firstPass.ContainsKey(token))

firstPass[token] = 1;

else

++firstPass[token];

}

On my first pass through the list of tokens, I look in the first pass table to determine if I’ve seen this token before. If I have, then I increase its frequency count by one, otherwise I add it to the table with a frequency count of one. This loop is O(n) because lookups and insertions on dictionary objects are of constant cost no matter how many items are already in the dictionary. We’ve solved subproblem (2).

On to subproblem (3). Now I’m going to take my mapping from words to frequency and reverse it — turn it into a mapping from frequency to a list of words that have that frequency. I’m also going to keep track of the highest frequency item I’ve seen so far, because that will be convenient later.

As you can see, this uses the same trick as before. If the frequency is already in the table then add the current word to the word list, otherwise create a new word list.

int max = 1;

Dictionary<int, List<string>> secondPass = new Dictionary<int, List<string>>();

foreach(string key in firstPass.Keys)

{

int count = firstPass[key];

if (count > max)

max = count;

if (!secondPass.ContainsKey(count))

secondPass[count] = new List<string>();

secondPass[count].Add(key);

}

Clearly the number of unique words in the firstPass table cannot be larger than n, so this loop is also O(n). Inserting items into tables and onto the end of lists is of constant cost, so there’s no additional asymptotic load there.

OK, great, but how does this help to solve subproblem (3)? We need one more loop.

for (int i = max ; i >= 1 ; –i)

{

if (secondPass.ContainsKey(i))

{

Console.WriteLine(i);

foreach(string s in secondPass[i])

{

Console.WriteLine(s);

}

}

}

Again, this straightforward loop cannot possibly print out more than n items and all the list and table operations take constant time, so this is O(n). The algorithm spits out the list

3362

vbscript

2253

jscript

990

object

910

in

879

array

[…]

1

[…]

forward

xbox

honest

**Three loops, none** is ever worse than O(n). Therefore, the whole algorithm is O(n). As you can see, I’ve sorted the Google query keywords by frequency using a worst-case O(n) algorithm, thereby providing a **counterexample** to my proof from last time that such a sort algorithm is **impossible**. Neat trick, proving a thing and then finding a counterexample! How’d I manage to do that?

**I’ve pulled a bait-and-switch.** What I proved yesterday was that any sort on a list **where the only way to deduce the ordering requires data comparison** is an O(n log n) problem. But nowhere in this algorithm do I make so much as a single data comparison! The internals of the table lookups do data comparisons to determine **equality**, but not **ordering**, and the comparisons to determine max and iterate the last loop are not comparisons between two data, so they don’t count.

This sort algorithm *doesn’t follow our generalized sort schema at all*, so clearly the proof doesn’t apply.

I can get away with not making any comparisons because I know additional facts about the data — facts which I did not assume in my proof. In this case, the relevant fact is that the thing which I am sorting — the list of frequencies — is a list of **n** **integers all of which are between 1 and ****n****.** If you can assume that then as we’ve just seen, you don’t have to use comparisons at all. This algorithm is called **CountingSort** (for fairly obvious reasons). Another algorithm that has similar properties is **RadixSort**, to which a commenter alluded yesterday.

If you’re sorting lists of arbitrary strings or arbitrary floating point numbers where you *cannot* make any assumptions about the data then you’ll end up using some version of the generalized comparison sort from yesterday.

The moral of the story (aside from “watch out for bait-and-switch schemes”) is that hash tables, particularly generic hash tables, are incredibly useful off-the-shelf parts for a huge number of applications. We use them all over the place in our code because they’re **easy to use and extremely efficient** even when you throw lots of data at them.

But wait, something should be nagging at you. There is something missing from this analysis. Saying that this is an O(n) algorithm perhaps isn’t quite the whole story. What did I leave out?

Well, the biggest thing left out is the amount of memory/temporaries used.

🙂

I believe there are algorithmns that can sort faster than quicksort, but quicksort uses (IIRC) O(ln(N)) memory where as they tend to use O(N)

In a production program, an O(n) memory using algorithmn can kill the whole thing’s performance despite being fast in isolation.

Good point. For a list which mostly consists of unique words, this solution makes two entire in-memory copies of most of the data in the list. If the data are large and not copied by reference, that could be quite expensive.

Quicksort on the other hand only consumes memory in the form of stack frames, and each stack frame is of a constant size, and in the typical case there will be O(log n) stack frames max in use at any time. (The standard quicksort algorithm suffers from a n-squared worst case in rare cases, but getting into those details would take us far afield.)

There are other aspects that I’ve neglected in this analysis. (Hint: think about what I said above; what if the data are large?)

You’re also cheating and assuming that hash table operations are O(1). In reality, hash table complexity is… well… complex. Here’s a web page that contains various hash "attacks":

http://www.cs.rice.edu/~scrosby/hash/

Jeez, very cool.

Know something weird? AFAICT, the boolean parameter on Split() isn’t documented:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemstringclasssplittopic.asp

‘less I’m looking at the wrong Split().

The code below degrades to O(N^2) — degrades quickly even with a good hash algorythm if your hash table is small (that is size N or less). I seem to remember that you need a hash table size of 2N (or was it N^2?) for the hash table to be able to assure performance… but I don’t remember for sure.

In any case the problem is that secondPass.ContainsKey(i) walks a linked list containing up to N items with that hash value.

for (int i = max ; i >= 1 ; –i)

{

if (secondPass.ContainsKey(i))

{

Console.WriteLine(i);

foreach(string s in secondPass[i])

{

Console.WriteLine(s);

}

}

}

Now that I think about it, all the loops degrade this way, this one is just clearest because it has the nested loops.

In actually, this code is going to be very slow:

foreach(string key in firstPass.Keys)

{

int count = firstPass[key];

if (count > max)

max = count;

if (!secondPass.ContainsKey(count))

secondPass[count] = new List<string>();

secondPass[count].Add(key);

}

Because most of your words are going to appear once, the hash bucket for 1 is going to be huge and slow. This section will degrade O(Z^2) where Z is the number of words that appear once.

> The code below degrades to O(N^2)

You’re assuming a very naive hash table algorithm. Sure, if you use a really terrible hash table, things go bad real fast.

As Raymond pointed out, it’s complicated. But for many situations, you can use extensible hashing and get pretty close to O(1) searches and inserts.

> Because most of your words are going to appear once, the hash bucket for 1 is going to be huge and slow

I think you’re misunderstanding the code. The hash bucket for one only has one thing in it, and that’s a linked list. Adding to the end of a linked list is O(1).

Also, I think you’re misunderstanding the order of the nested loop. Yes, some of those lists could be a large fraction of n, but the TOTAL length of EVERY list added together is n or less.

Is that clear? I mean, suppose you’ve got "the cat in the hat", and you have the hash table

1 cat, in, hat

2 the

the fact that the first list is long means that the second will be short. The total order of the nested loop can’t be more than O(n)

You’ve not told the dictionary how many items to expect, so it’s going to start with whatever the default minimum size is. When you try to add more than this, it has to create a new underlying array and copy all the items to it. This will be repeated if the new array still isn’t enough to hold all the items.

If Dictionary is implemented like Hashtable, it’s based on skipping forward to another bucket if the bucket selected by the hash is already in use. It will also rehash the current table once the number of hash collisions reaches a significant amount (load factor multiplied by the number of buckets in the table). I did a bit of digging with Reflector – I don’t have VS2005 CTP here, so I can’t confirm what Dictionary<> does.

When the underlying bucket array is expanded, the code searches for the smallest prime which is greater than the size you asked for. It has a small table of 70 primes (I assume starting around 100ish with a decent step between them); if you ask for a capacity bigger than that, it starts searching prime numbers by dividing your candidate by every odd number up to Sqrt(candidate). When it expands automatically, the minimum size is twice the current size plus 1.

Yes, best case is O(1), but if you get hash collisions or the table needs to rehash, it suddenly gets a lot slower.

I’m trying to think of the rough capacities required. The two obvious cases I can think of are where all the words are different – the first Dictionary needs a capacity of N, but the second only 1 – and where the words are all the same – both dictionaries only need capacity of 1. I’m not too good at this O() stuff!

Of course the real question is why were those "c++" terms written as "c++" and not "c%2B%2B"? Something fishy is going on there!

I missed the part where you do "ContainsKey" in less than O(log n). Can you explain?

Um, er, I’m going to “say something stupid™”.

Why care about performance for a one-off task? It’s a one-off task, you have a specific and not very large n, and you don’t actually care if it takes O(n), O(n log n), or O(n^2).

And, while theoretically O(n) is better than O(n log n), in practice O(n) might be SomethingBig * n and O(n log n) might be n log n and n might be 3, and for n = 3 bubble sort suddenly becomes the optimal algorithm 🙂

I don’t accept the O(1) hashtable operations claim either. Some (admittedly not very well designed) quick tests on my machine doing something very similar with integer keys and adding or incrementing a counter value every time a key is hit suggest something else:

size=32768, time/op=5.66E-06

size=131072, time/op=6.57E-06

size=524288, time/op=9.53E-06

(any more and I start swapping 🙂

I concede that hashtables can be setup to do lookups in O(1) time, but in the case of the .NET 1.1 hashtable at least, they probably aren’t (cursory inspection using reflector would appear to confirm this, but I may be missing something).

"I think you’re misunderstanding the code. The hash bucket for one only has one thing in it, and that’s a linked list. Adding to the end of a linked list is O(1)."

No I do understand it… I’m not talking about the insert part I’m talking about the checking part… you hash the index and then run down the *whole list* to check if an item already exists. If most of the items are only seen once then this degrades the hash from O(1) to O(N)

There are other problems with this algorithm but this is the most obvious one.

As an aside this algorithm (with some modification) looks like it might work quite well with multi-processor systems.

"Also, I think you’re misunderstanding the order of the nested loop. Yes, some of those lists could be a large fraction of n, but the TOTAL length of EVERY list added together is n or less.

Is that clear? I mean, suppose you’ve got "the cat in the hat", and you have the hash table

1 cat, in, hat

2 the

the fact that the first list is long means that the second will be short. The total order of the nested loop can’t be more than O(n) "

No this is not true, as I said above.

Because you do the linked list for each item inserted this becomes N^2.

To follow your example:

The cat in the hat is fun.

When you insert "is" it will take 4 steps to see if "is" exists. When you insert "fun" it will take 5 steps to see if "fun" exists, and so on… thus up to N steps N times, or N^2.

> you hash the index and then run down the *whole list* to check if an item already exists

No, I don’t. It’s a hash table FROM numbers TO lists of strings. Here’s where I hash the number:

if (!secondPass.ContainsKey(count))

and here’s where I insert the key into the list:

secondPass[count].Add(key);

But nowhere do I check to see if the key is already in the list. If I did, yes, that would be expensive. But I don’t need to because we can prove logically that there will NEVER be a collision on this list.

"Add" just adds the item to the end of a linked list, and that’s cheap.

> When you insert "is" it will take 4 steps to see if "is" exists

No, it won’t. At no time do I search a list. I search a hash table, yes, but not a list.

> I missed the part where you do "ContainsKey" in less than O(log n). Can you explain

Consider a hash table with chaining that rehashes whenever the average chain length exceeds, say, four. On average, such a hash table requires one hash and a maximum of four comparisons per lookup, no matter how big the table gets.

As Mike points out above, rehashing the table might get expensive. But don’t forget that the amortized cost of the rehashing is going to be spread over all the inserts, most of which will be very cheap. The cost of re-hashing is going to be O(n), but you only rehash about O(1/n) of the time, so it works out to O(1) per insert.

> Why care about performance for a one-off task?

I don’t. And clearly the time taken to fetch the 2193 web pages was way, way longer than the time taken to move the strings around in memory!

However, two things:

First, note that this is a VERY short algorithm that I put together very quickly. One of the aims of me writing this blog is to give scripters new techniques that they might not have been exposed to before. Lots of people do not understand the power of tables to make small, fast programs.

Second, this is a somewhat contrived example that I’m using as an educational opportunity. There have been a number of threads on JOS and elsewhere lately on the value or lack thereof of a mathematical approach to computing. I’m trying to illustrate the intersection between theoretical computer science and practical down-n-dirty programming.

That’s why I care what the order is.

> And, while theoretically O(n) is better than O(n log n), in practice O(n) might be SomethingBig * n

Indeed, in practice you cannot disregard the constant factor. For example, when I was a foolish intern I suggested to Tim Paterson that we rewrite the Instr method to use a

"better" string search algorithm. InStr uses an O(n m) algorithm where n and m are the lengths of the search string and the pattern. There exist O(n + m) algorithms, which are obviously better for large n and m.

But the naive algorithm Instr uses is BLAZINGLY fast for common cases. Most of the work optimizes down to a single machine instruction. The algorithmically better methods do lots of expensive string preprocessing up front and end up being slower in typical cases.

"But nowhere do I check to see if the key is already in the list. If I did, yes, that would be expensive. But I don’t need to because we can prove logically that there will NEVER be a collision on this list. "

You do walk the list when you perform the secondPass.ContainsKey(count) — that is how hash tables work… if there is a colision you walk a linked list of items with that hash. AND YES there will be collisions… all words that occure once will hash to the same location because the key is the number of times a word occures.

For a good example of how this effects performance see Raymond Chen’s link above.

—–

It is interesting that your last post was about typical performance for InStr, you have the same problem here… for the best input data this will run at O(N), for the typical case (eg lots of words that only appear once) it will run at O(N^2) and for almost all cases it will run at O(NlogN) or worse.

I’m sorry to have to continue to be so contradictory, Hogan.

Yes, checking for the collision might walk a hash chain. (Using "list" to mean "hash chain" is confusing, particularly when I have a hash table of lists.)

But as I said above, if the hash table is written using an extensible-hashing-with-chaining algorithm then the average chain length is bounded to some small number.

In the JScript engine we use this algorithm in the name tables and thereby guarantee that the average number of comparisons is less than 2.5 no matter how big the table gets. Thus, its O(1) for collision detection, not O(n).

Raymond’s link is about a different topic — how to put data in a hash table such that the data is chosen to deliberate produce bad performance. That’s a separate topic that maybe I’ll cover later — writing a hash table to have good performance even when attacked is a hard problem. For my purposes, I’m assuming that the hash table has good performance on typical, non-hostile data.

Let me run through a small example to see if I can convince you that the many-unique-words case is O(n). Consider a list of three words, A, B and C.

In the first pass I produce this hash table:

A -> 1

B -> 1

C -> 1

You agree that each insert into this hash table is O(1) and hence building the hash table is O(n), right?

In the second pass, I build the "second pass" hash table like this. It starts empty. I extract (A, 1) from the first hash table and look up "1" in the second hash table. I don’t find it, so I insert

1 -> < >

That’s an O(1) operation.

Then I say "fetch me the list" and get back < >. That’s an O(1) operation.

Then I insert A into the list. That’s an O(1) operation, and now I have

1 -> < A >

OK, we’ve made it through the loop once. The second time through I extract (B, 1) from the first hash table. I look up "1" in the second hash table, which is an O(1) operation. But I don’t INSERT anything into the second hash table. I insert something into the LIST which I have allocated. I fetch a reference to the list and get back < A > , and I insert B onto the end of that list to produce <A, B>. I don’t change the hash table at all, I change the contents of the thing that the hash table refers to. So now we have

1 -> <A, B>

And we’ve made it through the second loop doing only O(1) operations.

Third time, same thing. Extract (C, 1), look up 1, get the list, insert C onto the end of the list.

Does that make sense?

Obviously Eric’s example requires that your list contains a pointer to the _end_ of the list as well as the head. Most non-academic linked list code I’ve used does: it simply makes your linked list more flexible at the cost of one pointer. It’s not quite a necessity for a doubly-linked list to have a pointer to the end, but it makes things more orthogonal (particularly if you use the pointer-to-pointer technique).

I agree that if you only had a pointer to the head of the list, it would be O(n), because you have to find the end if you want to add to it. But if you knew that, you’d add the new item to the head of the list rather than the end: we’re not guaranteeing which order the words with the same frequency appear in.

Correct — a double-linked list (or a single-linked list with a pointer to the end) is required if you want to do O(1) inserts on the end.

In fact, the List<> implementation is not a linked list at all. It’s a double-when-full array list. The amortized cost of insertions at the end of a DWF is also O(1). The down side is that the cost of insertions at the beginning is O(n).

I’ve always thought the naming of ArrayList (and obviously the new List<>) was weird. They’re not lists, they’re arrays. For a while I thought the name of ArrayList implied a deque<> (from STL), but digging around with Reflector cleared it up: the nearest STL equivalent is a vector<>.

I still think in C++ 😉

I assume that cache locality is a distinct factor in this kind of thing. If your behaviour is read-mostly, append new items rather than insert, then an array typically has better locality of reference than a linked list. Unless you do what MFC’s collections do and allocate whole blocks of nodes in one go, keeping freed nodes on a free list (and even then you get a lot of indirection through pointers).

I&nbsp;just got a question this morning about how to take two collections of items and determine how…

I&nbsp;just got a question this morning about how to take two collections of items and determine how…