I’ve had several requests that require customizing the Blog Crawler.
Currently, it saves the entire HTML retrieved from a blog’s URL. It converts relative links to absolute like so:
This allows the web control to render the page with the CSS references as well as making all the links on the page live. When it’s rendered, links like CSS and images are retrieved as needed. This is fairly slow, and requires an online connection.
The Foxpro version of the crawler actually saves the HTML page as an MHT file (from IE, File->Save As->Type->Web Archive, single file), which means all images and CSS are stored in the file, so no online content is retrieved, and it’s much faster to render the pages. I might update the VB version to save as MHT file, perhaps as an option. Which way would you prefer?
A blog endpoint is an actual blog post permalink. Several blog URL’s are not endpoints: for example, they may be a summary of postings for the month, by category, etc.
The crawler determines if it’s an endpoint by counting the number of “/” in the URL with 1 line of code. This can be changed easily to accommodate other blogs.
fIsPostedEntry = cUrl.Replace(“/”, “”).Length + 8 = cUrl.Length ‘ if there are 8 backslashes, then it’s a blog entry (“http://blogs.msdn.com/calvin_hsia/archive/2006/05/16/599108.aspx”)
The crawler assumes that every blog entry starts with the same root URL, like “http://blogs.msdn.com/calvin_hsia”. It crawls the blog’s home page, like http://blogs.msdn.com/calvin_hsia and finds all the links with the same root and adds them to a table. If it’s an endpoint, the page is saved into the table as well.
The way the crawler parses out the published date is probably very customized to http://blogs.msdn.com
Case “div” ‘ Parse out the Publish date
If fIsPostedEntry Then
Dim oC As Object
oC = .Attributes.GetNamedItem(“class”)
If Not oC Is Nothing And Not oC.value Is Nothing Then
Dim cClass As String = oC.Value
If (cClass = “postfoot” Or cClass = “posthead”) And Not .innerText Is Nothing Then
Dim cText As String = .innerText.Replace(“Published”, “”)
If cText.Length > 0 Then
cText = cText.Trim.Substring(0, cText.IndexOf(CStr(IIf(cText.IndexOf(“AM”) > 0, “AM”, “PM”))) + 2)
dtPubDate = DateTime.Parse(cText)
fGotPubDate = True
Catch ex As Exception
System.Diagnostics.Debug.WriteLine(“Date parse err: “ + ex.Message)
The way the crawler parses out links for endpoints, it looks for “archive/2” (as in “archive/2006”) in the URL There were some links on my blogs from comment spam which needed to be filtered out too.
Case “a” ‘ it’s a link
cLink = .Attributes(“href”).value.Replace(“%5f”, “_”).ToString.ToLower
If cLink.StartsWith(cBlogUrl) And cLink <> cCurrentLink Then ‘ if it’s to the blog
If (Not cLink.Contains(“#”)) And cLink.Contains(“archive/2”) Then ‘like archive/2006
If cLink.Contains(“<“) OrElse cLink.Contains(“%”) Then ‘ some comment spam
<< got good link >>
Changing the code to work with blogs other than http://blogs.msdn.com means seeing how much they differ in format and changing these areas of code.
For example, http://blogs is an internal Microsoft blogging site. It says “Posted on “ rather than “Published on “, so that would need to be changed.