Google Desktop On The Cheap, Part Zero

Article
10/18/2004

Before I get started with my next series, two quick notes.

First, at long last I had dinner with famed

JOS poster (and east coast technical evangelist for my product, VSTO) Philo the other night, as he happened to be in Redmond. I'm pleased to report that he's as witty, urbane and opinionated in real life as online, and I'm not just saying that because he expensed my salmon. Thanks! And blog more!

Second,

liorean is thinking about putting together some sort of serialized-codeDOM-like markup language to describe JScript programs. We chatted about it a bit last week and I'm still not 100% sure what it's for, but it sure looks interesting. Anyone who has more free time than I do who wants to check it, out, contact him here.

OK, onward.

I've said

before that we designed the script languages to make it easy to quickly write simple scripts to automate web page functionality, choose what HTML to serve up in ASP, or write better batch files, but that they were explicitly not designed to be fully-fledged high-performance production languages for solving really complex or data-intensive problems.

I stand by that statement. However, that's not to say that it's impossible to develop some pretty heavy-duty software in script. Rather, it's to say that if you're going to choose script come hell or high water, you need to understand its performance characteristics and choose your algorithms carefully.

There's a problem I frequently have at work. I have billions of bytes of source code on my machine, and sometimes I really want to know where a particular function is defined, but for whatever reason I don't know where the source code is and I'm not able to debug into it. There are several possible solutions to this problem:

grep/findstr/etc -- do a linear search of the billions of bytes for the particular string. Trouble is, this is extremely time consuming.

install Microsoft's index server. That would require me understanding how it all works, and it seems to be tied to web servers and all kinds of goo I don't want to mess with just to grep the sources. And where's the fun in that?

Google Desktop looks like it does exactly what I want. However, I didn't hear about it until after I had already written much of the code I'll be talking about over the next few entries, and besides, it doesn't install on my machine because it is incompatible with… my firewall. Huh? Why is a text indexer incompatible with my network security? Weird. I'm not touching that thing until they sort that out.

You guessed it -- write my own indexing system in script, and then blog about it.

Easy, right? I mean

, I already blogged about a very similar problem: how to efficiently find the words that occur most/least often in a body of text. We could solve the indexing problem the same way. Recall that my algorithm to solve that problem was basically to build a series of hash tables mapping words to the number of times they occurred. We could just modify that algorithm to map from words to a list of the files in which they occur. Then the search becomes incredibly fast -- you just ask the hash table for the list of files that contains the keyword you're looking for, and you're done.

The trouble is, you very quickly discover that this solution does not scale to hundreds of millions of words in hundreds of thousands of files. It's just too big. The hash table wouldn't fit into the virtual memory space. And the script engine hash tables (dictionary, or any JScript object) were never designed to handle that kind of load. The garbage collection in JScript alone would destroy performance after the first few hundred thousand words.

Another problem that you quickly run into is that it is comparatively easy to do fast searches on the index once you've got it. Creating the index in the first place, and updating it as files change, that's a hard problem. One of the first things that springs to mind is that the index is itself going to be huge, and should probably spend most of its life on disk itself. Another is that the files which make up the index should themselves be relatively small, so that we can process them individually without having to load the whole world into garbage-collected memory. Basically, the index needs to be "paged" rather than monolithic, just like virtual memory is paged in and out of physical memory as it is needed, and the index pages need to be organized in such a way that we touch very few of them to get to the page we want.

I've been kicking around a bunch of different ways to solve this problem -- multiway balanced trees, binary/multiway external mergesort, etc -- and I thought I might blog about some of my experiments over the next few weeks. I'm still heads-down at work (and I've started working on another book!) so this will likely be updated pretty sporadically.

Google Desktop On The Cheap, Part Zero

Additional resources