My experiments with SPS Search and Filtering

It's been a long time since my last log. Well I have been busy learning SPS search Filters, protocol handlers and was reading other's blogs rather than writing. Also one thing before I start of with SPS stuff. The timeout attribute I mentioned in my last blog works only with sync calls and would not work with async call/callbacks, I stand corrected on that thanks to Roy Chastain.

As soon as I got my feet wet(which basically meant I installed SPS server). Rahul, I am not really sure what I did that irked him off so much that he pushed me into the dark alleys of search, custom Filter and protocol handlers. Comming from the COM world where everything that can be thought of was already documented, the muddy waters of IFilters and protocol handlers were as clear to me as the swamps of florida which I had visited last weekend. So I decided to write my own implementation of IFilter and Protocol Handler. I am done with the Filter part now and would be working on the Protocol Handler comming weeks.I guess the real problem with someone like me who has not dealt much with searching and crawling components is that it's difficult to imagine where and how our component would be fit and who and when would our methods be called. Where here is my take on what I understood of it: Filters based on IFilter for MS Indexing server is what SPS uses to Fitler out contents, this was simple enough.

The whole design of SPS search is divided into two individual parts:
1. CRAWLING: In this phase the server will crawl any content source, i.e. a file share, a .html page etc. By crawling it means that it would try to extract text and properties of that document and index those in a flat file index. So a crawler would call a protocol handler based on what type of content source it is, the protocol handler would know how to open that content source and then call the IFilter implementation for the respective File type to get its properties and text.
2. SEARCHING: Once a content source has been indexed any search can be performed on this index. Isn't this cool.
Jumping straight to Filters. Filters are basically written to customize the content of a particular file type so it can be indexed efficiently. So for eg. if my app's document has a .ant extension and it stores the name of users in a struct. I would need a Filter to open the file, read the structures, get out the name inside the struct and save it in the indexer. That is the real power of IFilters.

Now how do we write one, heh heh, allright for all those code junkies(me included). Here is a good MSDN link for more technical details:https://msdn.microsoft.com/library/default.asp?url=/library/en-us/odc_SP2003_ta/html/ODC_HowToWriteaFilter.asp?frame=true

But basically to summarize Protocol handler would open a content source, get to a file, and based on its extension it would call the Filter that has been registered for that extension to index it. So basically we are interested in these 4 methods in a Filter:
1. Init: This is where you would do initialization for your filter.
2. GetChunk: This method gets called to read the file and get a chunk of text or properties.  
3. GetText:
4. GetValue: Once a chunk has been read, GetValue or GetText would be called based on the chunk type that we set in GetChunk till it returns FILTER_E_NO_MORE_TEXT or FILTER_E_NO_VALUES.

That is it. That is all it requires to write a filter(well not really you would need to have a good understanding of COM, pointers, variant data types and C++ programming). The simplest thing to get started is the simpFilt sample that comes with SDK. To install the filter:
1. Compile it and regsvr32 the dll.
2. In Regedit goto HKLM\software\Microsoft\SPSSearch\ContentIndexCommon\Filters and add the name of the extension your filter will be crawling and in the default value put the CLSID of your dll.
3. From SPS portal site go to Site Settings->Configure Search and Indexing->Include File type   and add the extension for your filter.
4. Now add a content source where the file with the registered extension is stored and do a full crawl and voila your filter is up and running.
Probably the best way to get a deeper look is to Open the filter project in VS.net and attach to the msadmn.exe process and put breakpoints on those 4 methods.
Currently  I  am writing a IFilter for Windows Media files(wmv and wma). It's gonna be pretty cool once its ready. I had some trouble with the Media Format SDK but couple of links from Santosh and I was reading those properties like Arial Font 16. My Filter can crawl the document and get the properties out but the indexer somehow is not indexing them properly. But i hope to sort that out this week.

Wow this log turned out to be pretty long but would post the Media filter as soon as its done.