The VB version of the Blog Crawler

This is the VB.Net 2005 version of the Blog Crawler. It’s based on the Foxpro version, but.it uses SQL Server Everywhere so you can deploy it on your mobile device! It crawls a blog and stores all entries into a SQL Server Everywhere table. This includes blog comments and Cascading Style Sheets.

I had to wait to post this blog entry because SQL Everywhere CTP public release is today (announced at Tech Ed)!

To run it, you only need to copy a few files from this link (1.6 megabytes) into a directory on your machine and start BlogCrawl.Exe. There is no registration or install of any kind required, except the Net Framework 2.0 (which is installed with Visual Studio 2005, or you can download the runtime). The Source code can be unzipped into the same folder and is here. The program (including SQL /E) is totally isolated to the install folder, except for the My Settings XML file which stores your preferences in your local settings folder. It doesn’t touch your registry or install any other files.

When you start the program, the top part shows a grid of already crawled blog posts. The bottom part shows each post in a web control as it looked at the time of download. The links on the page are live. When first starting, there will be no data. If you click the Crawl button, it will start a background thread that scans the blog and downloads any entries that have not been downloaded yet. The status bar shows crawl progress.

It takes about 20 minutes to crawl my blog and download my 240 posts. You can stop and continue the background thread at any time by hitting the same Crawl button. The data is stored as a SQL Mobile database in the same folder in a file called <blogname>.sdf.

You can type a search string in the textbox and click the Search button to limit the number of records in the grid to those blogs containing the search string.

It’s customized for blogs hosted on https://blogs.msdn.com for parsing out the blog entry publication date and determining what page is a blog post and what is just an intermediate page (like February posts). I haven’t tested it with all the various blog CSS styles, but the source can be modified.

The program defaults to crawling my blog, but allows you to switch to other blogs. Click the Blog Options button to crawl your favorite blog.

If you change the Followed value for a particular entry to 0, then the next crawl will recrawl that link, perhaps if you want to get the latest comments.

It uses the new MySettings feature to persist user settings, such as window position and which blog was last crawled. The new SplitContainer class allows you to move the splitter bar between the grid and the web control and the SplitterDistance is persisted in My.Settings.

One of my machines was playing a sound while my web crawler was crawling. The culprit was Control Panel->Sounds->Sound->Windows Explorer->Information Bar.

See also

SQL Moblie books online

Use Regular Expressions to get hyperlinks in blogs