Customizing the Blog Crawler for different formats

Article
06/15/2006

I’ve had several requests that require customizing the Blog Crawler.

The entire source code of the Blog Crawler is available, so it can be modified to crawl blogs other than https://blogs.msdn.com

Currently, it saves the entire HTML retrieved from a blog’s URL. It converts relative links to absolute like so:

From href="https://blogs.msdn.com/Themes/Blogs/hover/style/style.css"

To href="https://blogs.msdn.com/Themes/Blogs/hover/style/style.css"

This allows the web control to render the page with the CSS references as well as making all the links on the page live. When it’s rendered, links like CSS and images are retrieved as needed. This is fairly slow, and requires an online connection.

The Foxpro version of the crawler actually saves the HTML page as an MHT file (from IE, File->Save As->Type->Web Archive, single file), which means all images and CSS are stored in the file, so no online content is retrieved, and it’s much faster to render the pages. I might update the VB version to save as MHT file, perhaps as an option. Which way would you prefer?

A blog endpoint is an actual blog post permalink. Several blog URL’s are not endpoints: for example, they may be a summary of postings for the month, by category, etc.

The crawler determines if it’s an endpoint by counting the number of “/” in the URL with 1 line of code. This can be changed easily to accommodate other blogs.

fIsPostedEntry = cUrl.Replace("/", "").Length + 8 = cUrl.Length ' if there are 8 backslashes, then it's a blog entry ("https://blogs.msdn.com/calvin_hsia/archive/2006/05/16/599108.aspx")

The crawler assumes that every blog entry starts with the same root URL, like “https://blogs.msdn.com/calvin_hsia”. It crawls the blog’s home page, like https://blogs.msdn.com/calvin_hsia and finds all the links with the same root and adds them to a table. If it’s an endpoint, the page is saved into the table as well.

The way the crawler parses out the published date is probably very customized to https://blogs.msdn.com

Case "div" ' Parse out the Publish date

If fIsPostedEntry Then

Dim oC As Object

oC = .Attributes.GetNamedItem("class")

If Not oC Is Nothing And Not oC.value Is Nothing Then

Dim cClass As String = oC.Value

If (cClass = "postfoot" Or cClass = "posthead") And Not .innerText Is Nothing Then

Dim cText As String = .innerText.Replace("Published", "")

If cText.Length > 0 Then

Try

cText = cText.Trim.Substring(0, cText.IndexOf(CStr(IIf(cText.IndexOf("AM") > 0, "AM", "PM"))) + 2)

dtPubDate = DateTime.Parse(cText)

fGotPubDate = True

Catch ex As Exception

System.Diagnostics.Debug.WriteLine("Date parse err: " + ex.Message)

End Try

End If

The way the crawler parses out links for endpoints, it looks for “archive/2” (as in “archive/2006”) in the URL There were some links on my blogs from comment spam which needed to be filtered out too.

Case "a" ' it's a link

cLink = .Attributes("href").value.Replace("%5f", "_").ToString.ToLower

If cLink.StartsWith(cBlogUrl) And cLink <> cCurrentLink Then ' if it's to the blog

If (Not cLink.Contains("#")) And cLink.Contains("archive/2") Then 'like archive/2006

If cLink.Contains("<") OrElse cLink.Contains("%") Then ' some comment spam

Else

<< got good link >>

End If

Changing the code to work with blogs other than https://blogs.msdn.com means seeing how much they differ in format and changing these areas of code.

For example, https://blogs is an internal Microsoft blogging site. It says “Posted on “ rather than “Published on “, so that would need to be changed.

Customizing the Blog Crawler for different formats

Additional resources