Weak vs. Strong typing

Yes, I know already what you've been thinking, but this post isn't about programming languages. It's about search.

We live today in an ocean of information. Too much information we can ever cope for. But, in some sense this information is strongly-typed: things belong to categories. And this "strong-typing" aspect enables us to approach search in an more effective way. That's why when I saw for the first search engine (yahoo, if I remember correctly) I was troubled. That was not what I wanted to use to search stuff! Despite of this initial reaction, I began to be more and more accomodated with it. And I probably forgot what I wanted to see from a search engine, because after years of yahoo-style unstructured search, it's now harder to think outside the box.

But I still feel that the right way to approach this problem should be radically different. There has to be a type for the items you want to be searched, or for the attributes that you use as a filter. Whatever this type is, I don't know. At this point, I just suppose that is needs to be a certain way to categorize objects into classes.

One way to approach strong typing is to define a schema. This is the WinFS approach, where the schema defines the type. This approach might work perfectly for a few dozens of information classes. A hand-made schema can then be used to define different "search namespaces" for things like contacts, music files, documents, etc. And this approach might be just we need to structure all information from our desktop.

But this "manually-defined schema" approach won't scale at the Internet size (if you ask me, it might not even work at the desktop size - just think about versioning). After all, we are dealing with obscene amounts of horribly unstructured, highly dynamic and sometimes wildly inconsistent data. How we are going to define a strongly-typed schema at the internet level? That's not going to work.

So, what if this search engine discovers these categories on the fly? And what if this type is actively used during the search? For example, let's assume that you start the search on the word "table". The search page will display an initial results page as a courtesy, but more importantly it will ask you "Table as a furniture piece, or table as a dataset?". You click on one of the two hyperlinks (let's say on "table as a dataset"), and the search engine will give you a refined results page. Next, it will ask you about various variations on the idea - is this related to SQL ? Or Excel? etc. At each step, the search engine gives you various semantic categories of whatever you are searching for.

Note that the fact that a table can be either a dataset or a piece of furniture would be automatically discovered at the time the search indexes are built. At that time, when our search engine crawls the internet, it sees that 20% of the sites are reffering the word table as a subject in a furniture-related context. And 10% in the context of SQL.

The discussion really becomes interesting when you start thinking about how could you actually implement this stuff. But that's another story.

I hope that the current search industry is still in early stages. And I won't be surprised that ten years from now, everybody will see the evolution from weak-typed to strong-typed search as a natural one. In exactly the same way we see today the benefits of strong typing in modern programming languages...