Build Custom Federated Search Connector in Microsoft Search Server (and SharePoint) – Solve Problems and Extend Your Ideas


I assume the read of this article understand what is federated search. So we already know that in order to use Federated Search webpart in Search Server, you need to provide a RSS feed to it, which can also be called "OpenSearch" stuff.

But, not every application you search will return this kind of RSS/ATOM feed. For example, Google, Baidu and many other web sites. So how can you federate search results from this kind of web sites?

http://msdn2.microsoft.com/en-us/library/bb931083.aspx

Scenario 2: Connecting to an External Search Site That Returns Results in HTML Format

Scenario background: The site is configured to use Anonymous access.

Possible solution: Use a Web application outside of the context of a SharePoint site, which contains a lightweight ASPX page that does the following:

  1. Submits a search request to the site by using the search terms passed in the initial request URL.

  2. Converts the results in the HTML response received from the external search site to RSS format.

  3. Returns the RSS XML in the response to the search server.

In this scenario, the federated connector’s Web application could be located on a remote server; however, a simpler solution is to create the Web application within the _layouts folder for the SharePoint site. For more information about creating this type of Web application, see How to: Modify Configuration Settings for an Application to Coexist with Windows SharePoint Services.

In a variation for this federated connector solution, you can add support for multiple external search sites by modifying the ASPX page to include details for more than one site within a case statement. The query template specified for these locations could then include a custom parameter that specifies which site in the case statement receives the federated query. Another variation is to combine the results for multiple external search providers, incorporating logic to order the results based on relevance.

Well, there're already some people who did a nice job, for example Andrew Woodward:

http://www.21apps.com/2008/01/search-server-2008-federated-sites-that.html

I would go a little further on this. Here I take Baidu as an example. Baidu is the biggest Internet search engine in China. (Google China? God knows where them are. Baidu introduced many interesting applications that Chinese users love to use. But Google China, is only famous for stealing the input method dictionary of another major Internet company SOHU, and then made its own Pinyin input method. After this was exposed to the public, they did a not so honest "apologize" and said that were two interns who did it. Perfect, later this became a popular phase in China, if anyone did evil things but was discovered by the public, he would say it's intern's or temporary employees' fault. Well, what a shame on this "not to be evil" company. - little off topic) .

Baidu.com does not return any RSS feed. What's more, it is using GB2312 encoding method to show the results. So if you directly use regex to capture something in Baidu, you will get some squares which do not make sense.

And there're some limitations in asp.net Request.QueryString method. It cannot correctly process Gb2312 encoding. So the Page Load Method must be changed to the following code:

protected void Page_Load(object sender, EventArgs e)
    {
        if (Request.QueryString["q"]!= null)
        {
            query = Request.Url.Query.ToString();
            query = query.Remove(0,3);
        }
    }

In this way, a query string will be kept so you can process it with Encode and Decode. If you use QueryString, you will get a stupid behavior that it incorrectly use Decode method in a wrong encoding charset...The result is a disater. Stupid, stupid, stupid. I want to slap the guy who wrote this method. Does he know there're not only English in this world?

For example, my nickname opal, in Chinese is 猫眼石. If queried from IE, it will be encoded using UTF-8. But Baidu can only consume GB-2312.

In UTF-8, 猫眼石 is %E7%8C%AB%E7%9C%BC%E7%9F%B3.

In GB2312, 猫眼石 is %C3%A8%D1%DB%CA%AF.

It's quite different. If you want do a search for %E7%8C%AB%E7%9C%BC%E7%9F%B3, and it is treaten as a GB2312 string, it will become 4.5 Chinese charactors. and none of them will make sense.

Okay, compain less, do more. So then we need to decode query string.

 private string getRssItemXml(string query)
    {
        //first you must decode it as UTF8. Because when IE access a utf-8 based website, it will pass the corresponding encoded strings.
        //Of course, you can modify web.config to make this application using Gb2312, but that doesn't make sense.
        query = HttpUtility.UrlDecode(query, Encoding.UTF8);
        //Then we need do encode it to gb2312. Baidu can only consume that.
        query = HttpUtility.UrlEncode(query, Encoding.GetEncoding("gb2312"));
        string url = string.Format("http://www.baidu.com/s?wd={0}", query);

        WebClient client = new WebClient();
        byte[] byteData = client.DownloadData(url);
        //Returned results are also in GB2312, so you have to rebuild it.
        string strData = Encoding.GetEncoding("gb2312").GetString(byteData);
        Regex searchPattern = new Regex("\\)\" href=\"(?<link>.*?)\" target=\"_blank\"><font size=\"3\">(?<title>.*?)</font></a><br>(?<desc>.*?)<br>");
        StringBuilder sb = new StringBuilder();

        foreach (Match m in searchPattern.Matches(strData))
        {
            sb.AppendFormat("<item><title><![CDATA[{0}]]></title><link><![CDATA[{1}]]></link><description><![CDATA[{2}]]></description></item>",m.Groups["title"].Value,m.Groups["link"].Value, m.Groups["desc"].Value);
        }

        return sb.ToString();
    }
 

So then put this aspx file to a website, have your federated search webpart point to it, like http://www.abc.com/Baidu.aspx?q={searchTerms}, and then you can get Baidu federated search in Microsoft Search Server 2008.

I put part of my work here:

http://cid-8007edf5c56fc334.skydrive.live.com/self.aspx/Microsoft%20Search%20Server/CaptureWeb.rar

It contains:

Baidu Federated Search Web Service

Baidu News Federated Search Web Service

iCiba (English-Chinese Dictionary) Federated Search Web Service

Dictionary.com Federated Search Web Service

Yes! You can put dictionaries on your federated search web page so if anybody want to search a word, he will get the meaning immediately! You can also have some triggers to make this happen only with numbers or charactors, etc.

snap048

Comments (0)

Skip to main content