Customizing the Blog Crawler for different formats


I’ve had several requests that require customizing the Blog Crawler.


 


The entire source code of the Blog Crawler is available, so it can be modified to crawl blogs other than http://blogs.msdn.com


Currently, it saves the entire HTML retrieved from a blog’s URL. It converts relative links to absolute like so:


From     href=”http://blogs.msdn.com/Themes/Blogs/hover/style/style.css”


To         href=”http://blogs.msdn.com/Themes/Blogs/hover/style/style.css”


 


This allows the web control to render the page with the CSS references as well as making all the links on the page live. When it’s rendered, links like CSS and images are retrieved as needed. This is fairly slow, and requires an online connection.


 


The Foxpro version of the crawler actually saves the HTML page as an MHT file (from IE, File->Save As->Type->Web Archive, single file), which means all images and CSS are stored in the file, so no online content is retrieved, and it’s much faster to render the pages. I might update the VB version to save as MHT file, perhaps as an option. Which way would you prefer?


 


A blog endpoint is an actual blog post permalink. Several blog URL’s are not endpoints: for example, they may be a summary of postings for the month, by category, etc.


The crawler determines if it’s an endpoint by counting the number of “/” in the URL with 1 line of code. This can be changed easily to accommodate other blogs.


        fIsPostedEntry = cUrl.Replace(“/”, “”).Length + 8 = cUrl.Length   ‘ if there are 8 backslashes, then it’s a blog entry (“http://blogs.msdn.com/calvin_hsia/archive/2006/05/16/599108.aspx”)


 


The crawler assumes that every blog entry starts with the same root URL, like “http://blogs.msdn.com/calvin_hsia”. It crawls the blog’s home page, like http://blogs.msdn.com/calvin_hsia and finds all the links with the same root and adds them to a table. If it’s an endpoint, the page is saved into the table as well.


 


The way the crawler parses out the published date is probably very customized to http://blogs.msdn.com


                Case “div”  ‘ Parse out the Publish date


                    If fIsPostedEntry Then


                        Dim oC As Object


                        oC = .Attributes.GetNamedItem(“class”)


                        If Not oC Is Nothing And Not oC.value Is Nothing Then


                            Dim cClass As String = oC.Value


                            If (cClass = “postfoot” Or cClass = “posthead”) And Not .innerText Is Nothing Then


                                Dim cText As String = .innerText.Replace(“Published”, “”)


                                If cText.Length > 0 Then


                                    Try


                                        cText = cText.Trim.Substring(0, cText.IndexOf(CStr(IIf(cText.IndexOf(“AM”) > 0, “AM”, “PM”))) + 2)


                                        dtPubDate = DateTime.Parse(cText)


                                        fGotPubDate = True


                                    Catch ex As Exception


                                        System.Diagnostics.Debug.WriteLine(“Date parse err: “ + ex.Message)


                                    End Try


                                End If


                            End If


                        End If


                    End If


The way the crawler parses out links for endpoints, it looks for “archive/2” (as in “archive/2006”) in the URL There were some links on my blogs from comment spam which needed to be filtered out too.


 


                Case “a”    ‘ it’s a link


                    cLink = .Attributes(“href”).value.Replace(“%5f”, “_”).ToString.ToLower


                    If cLink.StartsWith(cBlogUrl) And cLink <> cCurrentLink Then    ‘ if it’s to the blog


                        If (Not cLink.Contains(“#”)) And cLink.Contains(“archive/2”) Then   ‘like archive/2006


                            If cLink.Contains(“<“) OrElse cLink.Contains(“%”) Then  ‘ some comment spam


                            Else


                                    << got good link >>


                            End If


                        End If


                    End If


 


Changing the code to work with blogs other than http://blogs.msdn.com means seeing how much they differ in format and changing these areas of code.


 


For example, http://blogs is an internal Microsoft blogging site. It says “Posted on “ rather than “Published on “, so that would need to be changed.