Webcrawl a blog to retrieve all entries locally: RSS on steroids


Today’s sample shows how to create a web crawler in the background. This crawler starts with a web page, looks for all links on that page, and follows all those links. The links are filtered to my blog, but generalizing the code to search the entire web or some other site is trivial (if you have enough disk space<g>). (VB.Net version to appear soon on this blog.)


 


I was doing a search on my blog for “ancestors” via the Search box on the sidebar on the left, and there were no results. Strange, I thought, so I used MSN search for my site:


 


http://search.msn.com/results.aspx?FORM=TOOLBR&q=ancestors+site%3Ahttp%3A%2F%2Fblogs.msdn.com%2FCalvin_Hsia%2F


 


That search succeeded: it came up with the expected blog entry.


 


This incident reminded me of the fact that I’ve done a lot of work to create my blog, but I depend on a 3rd party to maintain it. There are hundreds of code samples, with links to references. If the blog server were to disappear for some reason, so would all my content. I wanted to retrieve all my blog content into a local table. Then I can manipulate it any way I want.


 


In particular, suppose I want to read my entire blog. I would have to do a lot of manual clicking to get to the month/day of the post, and then I might have missed something because I’m manually crawling. That’s pretty cumbersome. Also, I can have all of a blog available while offline, updating when connected.


 


So I wrote a code sample below that crawls my blog, looking for all the blog posts, and shows them in a form which has search capability. Because it’s all local, searching and navigating from post to post is extremely fast. The entry is displayed in a web control, so the page looks just like it would online and the hyperlinks are all live.


You can start a web crawl by pushing the Crawl button. You can interrupt the web crawl by typing ‘Q’ (<esc> will cancel the automation of the IE SaveAs dialog). The next time the crawl runs, it will resume where it left off. Crawling acts as if you were subscribed to my blog via RSS. Once you have all current content, Crawling again later will just add any new content. The saved content is the entire blog entry web page, including any comments.  As an exercise, readers are encouraged to make the web crawling execute on a background thread!


 


A crawl starts at the main page http://blogs.msdn.com/Calvin_Hsia, which shows any new content and has links on the side bar for any other posts. The page is loaded and then parsed for any links. Any links pointing to my blog are inserted into a table if they’re not there already. Then the table is scanned for any unfollowed links and the process repeats. If a page is a leaf node (currently any link with 8 backslashes) then the Publication date is parsed, and the file is saved in the MHT field in the table. The link parsing was a little complicated due to some comment spam reducing measures and some broken links when the blog host server switched software.


 


You will probably have to modify the code if you want to do the same for other blogs. For example, some blogs may have the Publication date in a different place. Others may have archive links elsewhere or in a different format.


 


I experimented with using HTTPGet


 


cTempFile=ADDBS(GETENV(“TEMP”))+SYS(3)+”.htm”


LOCAL oHTTP as “winhttp.winhttprequest.5.1”


LOCAL cHTML


oHTTP=NEWOBJECT(“winhttp.winhttprequest.5.1”)


oHTTP.Open(“GET”,”http://blogs.msdn.com/calvin_hsia/archive/2004/06/28/168054.aspx”,.f.)


oHTTP.Send()


STRTOFILE(ohTTP.ResponseText,cTempFile)


oIE=CREATEOBJECT(“InternetExplorer.Application”)


oIE.Visible=1


oIE.Navigate(cTempFile)


 


But the content looked pretty bad, because of the CSS references, pictures, etc.


 


Being able to automate IE was helpful, but how do you parse the HTML for the links to each blog entry? I thought about using an XSLT, but that was fairly complex. I used the IE Document model IHTMLDocument,to search through the HTML nodes for links.


 


IE has a feature that saves a web page to a single file: Web Archive, single file(*.mht) from the File->SaveAs menu option. So I used Windows Scripting Host to automate this feature.


 


Making the code run in a background thread is trivial: just use the ThreadClass from here.


 


See also :


            Wite your own RSS News/Blog aggregator in <100 lines of code


Use a simple XSLT to read the RSS feed from a blog,


Do you like reading a blog author? Retrieve all blog entries locally for reading/searching using XML, XSLT, XPATH


Generating VBScript to read a blog


   


CLEAR ALL


CLEAR


#define WAIT_TIMEOUT                     258


#define ERROR_ALREADY_EXISTS             183


#define WM_USER                                                              0x400


 


SET EXCLUSIVE OFF


SET SAFETY OFF


SET ASSERTS ON


PUBLIC oBlogForm as Form


oBlogForm=creat(“BlogForm”,“blogs.msdn.com/Calvin_Hsia”)


oBlogForm.Visible=1


DEFINE CLASS BlogForm AS Form


          Height=_screen.Height-80


          Width = 900


          AllowOutput=0


          left=170


          cBlogUrl=“”


          oThreadMgr=null


          ADD OBJECT txtSearch as textbox WITH width=200


          ADD OBJECT cmdSearch as CommandButton WITH left=210,caption=“\<Search”


          ADD OBJECT cmdCrawl as CommandButton WITH left=310,caption=“\<Crawl”


          ADD OBJECT cmdQuit as CommandButton WITH left=410,caption=“\<Quit”


          ADD OBJECT oGrid as Grid WITH ;


                   width = thisform.Width,;


                   top=20,;


                   ReadOnly=1,;


                   Anchor=15


          ADD OBJECT oWeb as cWeb WITH ;


                   top=230,;


                   height=thisform.Height-250,;


                   width = thisform.Width,;


                   Anchor=15


          ADD OBJECT lblStatus as label WITH top = thisform.Height-18,width = thisform.Width,anchor=4,caption=“”


          PROCEDURE Init(cUrl as String)


                   this.cBlogUrl=cUrl


                   IF !FILE(“blogs.dbf”)


                             CREATE table Blogs(title c(250),pubdate t,link c(100),followed i, Stored t,mht m)


                             INDEX on link TAG link


                             INDEX on pubdate TAG pubdate DESCENDING


                             INSERT INTO Blogs (link) VALUES (cUrl)     && jump start the table with a link


                             INSERT INTO blogs (link) VALUES (‘http://blogs.msdn.com/vsdata/archive/2004/03/18/92346.aspx’)        && early blogs


                             INSERT INTO blogs (link) VALUES (‘http://blogs.msdn.com/vsdata/archive/2004/03/31/105159.aspx’)


                             INSERT INTO blogs (link) VALUES (‘http://blogs.msdn.com/vsdata/archive/2004/04/05/107986.aspx’)


                             INSERT INTO blogs (link) VALUES (‘http://blogs.msdn.com/vsdata/archive/2004/05/12/130612.aspx’)


                             INSERT INTO blogs (link) VALUES (‘http://blogs.msdn.com/vsdata/archive/2004/06/16/157451.aspx’)


                   ENDIF


                   USE blogs SHARED    && reopen shared


                   this.RequeryData()


                   this.RefrGrid


          PROCEDURE RequeryData


                   LOCAL cTxt, cWhere


                   cTxt=ALLTRIM(thisform.txtSearch.value)


                   cWhere= “!EMPTY(mht)”


                   IF LEN(cTxt)>0


                             cWhere=cWhere+” and ATC(cTxt, mht)>0″


                   ENDIF


                   SELECT * FROM blogs WHERE  &cWhere ORDER BY pubdate DESC INTO CURSOR Result


                   thisform.lblStatus.caption=“# records =”+TRANSFORM(_tally)


                   WITH this.oGrid


                             .RecordSource= “Result”


                             .Column1.FontSize=14


                             .Column1.Width=this.Width-120


                             .RowHeight=25


                   ENDWITH


                   thisform.refrGrid     


          PROCEDURE RefrGrid


                   cFilename=ADDBS(GETENV(“temp”))+SYS(3)+“.mht”


                   STRTOFILE(mht,cFilename)


                   thisform.oWeb.Navigate(cFilename)


          PROCEDURE oGrid.AfterRowColChange(nColIndex as Integer)


                   IF this.rowcolChange=1       && row changed


                             thisform.RefrGrid


                   ENDIF


          PROCEDURE cmdQuit.Click


                   thisform.Release


          PROCEDURE cmdCrawl.Click


                   thisform.txtSearch.value=“”


                   fBackgroundThread=.t.       && if you want to run on background thread


                   IF this.Caption = “\<Crawl”


                             thisform.lblStatus.caption= “Blog crawl start”


                             CreateCrawlProc()


                             IF fBackgroundThread


                                      this.Caption=“Stop \<Crawl”


                                      *Get ThreadManager from http://blogs.msdn.com/calvin_hsia/archive/2006/05/23/605465.aspx


                                      thisform.oThreadMgr=NEWOBJECT(“ThreadManager”,“threads.prg”)


                                       thisform.oThreadMgr.CreateThread(“MyThreadFunc”,thisform.cBlogUrl,“oBlogForm.CrawlDone”)


                                      thisform.lblStatus.caption= “Background Crawl Thread Created”


                             ELSE


                                      LOCAL oBlogCrawl


                                      oBlogCrawl=NEWOBJECT(“BlogCrawl”,“MyThreadFunc.prg”,“”,thisform.cBlogUrl)          && the class def resides in MyThreadFunc.prg


                                      thisform.CrawlDone


                             ENDIF


                   ELSE


                             this.Caption=“\<Crawl”


                             IF fBackgroundThread AND TYPE(“thisform.oThreadMgr”)=“O”


                                      thisform.lblStatus.caption= “Attempting thread stop”


                                      thisform.oThreadMgr.SendMsgToStopThreads()


                             ENDIF


                   ENDIF


          PROCEDURE CrawlDone


                   thisform.oThreadMgr=null


                   thisform.cmdCrawl.caption=“\<Crawl”


                   thisform.lblStatus.caption= “Crawl done”


                   this.RequeryData()


          PROCEDURE cmdSearch.Click


                   thisform.RequeryData


          PROCEDURE destroy


                   IF USED(“result”)


                             USE IN result


                   ENDIF


                   SELECT Blogs


                   SET MESSAGE TO


                   SET FILTER TO


                   SET ORDER TO LINK   && LINK


ENDDEFINE


 


DEFINE CLASS cweb as olecontrol


          oleclass=“shell.explorer.2”


          PROCEDURE refreshxxx


                   NODEFAULT


          PROCEDURE TitleChange(cText as String)


                   thisform.caption=cText


          PROCEDURE Navigatecomplete2(pDisp AS VARIANT, URL AS VARIANT) AS VOID


                   IF url=GETENV(“TEMP”)


                             ERASE (url)


                   ENDIF


ENDDEFINE


 


 


PROCEDURE CreateCrawlProc as String      && Create the Thread proc, which includes the crawling class


TEXT TO cstrVFPCode TEXTMERGE NOSHOW && generate the task to run: MyThreadFunc


**************************************************


**************************************************


          PROCEDURE MyThreadFunc(p2)      && p2 is the 2nd param to MyDoCmd


                   TRY


                             DECLARE integer GetCurrentThreadId in WIN32API


                             DECLARE integer PostMessage IN WIN32API integer hWnd, integer nMsg, integer wParam, integer lParam


                             cParm=SUBSTR(p2,AT(“,”,p2)+1)


                             hWnd=INT(VAL(p2))


                             oBlogCrawl=CREATEOBJECT(“BlogCrawl”,cParm)


                   CATCH TO oEx


                             DECLARE integer MessageBoxA IN WIN32API integer,string,string,integer


                             MESSAGEBOXA(0,oEx.details+” “+oEx.message,TRANSFORM(oex.lineno),0)


                   ENDTRY


                   PostMessage(hWnd, WM_USER, 0, GetCurrentThreadId())


DEFINE CLASS BlogCrawl as session


          oWeb=0


          oWSH=0


          fStopCrawl=.f.


          hEvent=0


          cMonths=“January   February  March     April     May       June      July      August    September October   November  December 


          cCurrentLink=“”


          PROCEDURE init(cBlogUrl)


                   LOCAL fDone,nRec,nStart


                   nStart=SECONDS()


                   DECLARE integer CreateEvent IN WIN32API integer lpEventAttributes, integer bManualReset, integer bInitialState, string lpName


                   DECLARE integer CloseHandle IN WIN32API integer


                   DECLARE integer WaitForSingleObject IN WIN32API integer hHandle, integer dwMilliseconds


                   DECLARE integer GetLastError IN WIN32API


                   this.hEvent = CreateEvent(0,0,0,“VFPAbortThreadEvent”) && Get the existing event


                   IF this.hEvent = 0


                             THROW “Creating event error:”+TRANSFORM(GetLastError())


                   ENDIF


                   ?“Start Crawl”


                   DECLARE integer GetWindowText IN WIN32API integer, string @, integer


                   DECLARE integer Sleep IN WIN32API integer


                   this.oWeb=CREATEOBJECT(“InternetExplorer.Application”)


                   this.oWeb.visible=1


                   this.oweb.top=0


                   this.oweb.left=0


                   this.oweb.width=500


                   this.oWSH=CREATEOBJECT(“Wscript.Shell”)


                   USE blogs ORDER 1


                   REPLACE link WITH cBlogUrl, followed WITH 0       && set flag to indicate this page needs to be retrieved and crawled


                   this.fStopCrawl=.f.


                   fDone = .f.


                   DO WHILE !fDone AND NOT this.fStopCrawl


                             fDone=.t.


                             GO TOP


                             SCAN WHILE NOT this.fStopCrawl


                                      nRec=RECNO()


                                      IF followed = 0


                                                REPLACE followed WITH 1


                                                this.BlogCrawl(ALLTRIM(link))


                                                IF this.fStopCrawl


                                                          GO nRec


                                                          REPLACE followed WITH 0    && restore flag


                                                ENDIF


                                                fDone = .f.


                                      ENDIF


                             ENDSCAN


                   ENDDO


                   ?“Done Crawl”,SECONDS()-nStart


          PROCEDURE BlogCrawl(cUrl)


                   LOCAL fGotUrl,cTitle


                   fGotUrl = .f.


                   DO WHILE !fGotUrl    && loop until we’ve got the target url in IE with no Error


                             this.oweb.navigate2(cUrl)


                             DO WHILE this.oweb.ReadyState!=4


                                      ?“Loading “+cUrl


                                      Sleep(1000)   && yield processor


                                      IF this.IsThreadAborted()


                                                this.fStopCrawl=.t.


                                                ?“Aborting Crawl”


                                                RETURN


                                      ENDIF


                             ENDDO


                             cTitle=SPACE(250)


                             nLen=GetWindowText(this.oWeb.HWND,@cTitle,LEN(cTitle))


                             cTitle=LEFT(cTitle,nLen)


                             IF EMPTY(cTitle) OR UPPER(ALLTRIM(cTitle))=“ERROR” OR (“http”$LOWER(cTitle) AND “400”$cTitle)


                                      ?“Error retrieving “,cUrl,” Retrying”


                             ELSE


                                      fGotUrl = .t.


                             ENDIF


                   ENDDO


                   this.cCurrentLink=cUrl


                   IF OCCURS(“/”,cUrl)=8          &&http://blogs.msdn.com/calvin_hsia/archive/2005/08/09/449347.aspx


                             cMht=this.SaveAsMHT(cTitle) && save the page before we parse


                             IF this.fStopCrawl


                                      RETURN .f.


                             ENDIF


                             REPLACE title WITH STRTRAN(STRTRAN(cTitle,” – Microsoft Internet Explorer”,“”),“Calvin Hsia’s WebLog : “,“”),;


                                       mht WITH cMht,Stored WITH DATETIME()


                             IF EMPTY(title)         && for some reason, the page wasn’t retrieved


                                      REPLACE followed WITH 0


                             ENDIF


                   ENDIF


                   ?“Parsing HTML”


                   this.ProcessNodes(this.oWeb.Document,0) && Recur through html nodes to find links


                   ?“Done Parsing HTML”


          PROCEDURE ProcessNodes(oNode,nLev)    && recursive routine to look through HTML


                   LOCAL i,j,dt,cClass,oC,cLink


                   IF this.IsThreadAborted() OR nLev > 30     && limit recursion levels


                             RETURN


                   ENDIF


                   WITH oNode


                             DO CASE


                             CASE LOWER(.NodeName)=“div”    && look for pub date


                                      IF OCCURS(“/”,this.cCurrentLink)=8 && # of backslashes in blog leaf entry


                                                oC=.Attributes.GetnamedItem(“class”)


                                                IF !ISNULL(oC) AND !EMPTY(oC.Value)


                                                          cClass=oC.Value


                                                          IF cClass=“postfoot” OR cClass = “posthead”


                                                                   cText=ALLTRIM(STRTRAN(.innerText,“Published”,“”))


                                                                   IF !EMPTY(cText)


                                                                             dt=this.ToDateTime(cText)


                                                                             IF SEEK(this.cCurrentLink,“blogs”)


                                                                                      REPLACE pubdate WITH dt


                                                                             ELSE


                                                                                      ASSERT .f.


                                                                             ENDIF


                                                                   ENDIF


                                                          ENDIF


                                                ENDIF


                                      ENDIF


                             CASE .nodeName=“A”


                                      cLink=LOWER(STRTRAN(.Attributes(“href”).value,“%5f”,“_”))


                                      IF ATC(“http://blogs.msdn.com/calvin_hsia/”,cLink)>0


                                                IF ATC(“#”,cLink)=0 AND ATC(“archive/2”,cLink)>0


                                                          *http://blogs.msdn.com/calvin_hsia/archive/2004/10/11/<a%20rel=


 


                                                          IF “<“$cLink   && comment spam prevention:


*http://blogs.msdn.com/calvin_hsia/archive/2004/10/11/240992.aspx


*<a rel=”nofollow” target=”_new” href=”<a rel=”nofollow” target=”_new” href=”http://www.53dy.com“>http://www.53dy.com</a>                                                       


                                                          ELSE


                                                                    *http://blogs.msdn.com/calvin_hsia/archive/2004/11/16/visual%20foxpro


                                                                   IF “%”$cLink


                                                                    *http://blogs.msdn.com/calvin_hsia/archive/2004/11/16/258422.aspx


                                                                   * broken link: host updated software for category links


                                                                   *<A title=”Visual Foxpro” href=”http://blogs.msdn.com/calvin_hsia/archive/2004/11/16/Visual%20Foxpro“>Visual Foxpro</A>


*                                                                           SET STEP ON


                                                                   ELSE


                                                                             IF !SEEK(cLink,“Blogs”)


                                                                                      INSERT INTO Blogs (link) VALUES (cLink)


                                                                             ENDIF


                                                                   ENDIF


                                                          ENDIF


                                                ENDIF


                                      ENDIF


                             ENDCASE


                             FOR i = 0 TO .childNodes.length-1


                                      this.ProcessNodes(.childNodes(i),nLev+1)


                             ENDFOR


                   ENDWITH


                   PROCEDURE ToDateTime(cText as String) as Datetime


                             *Friday, April 01, 2005 11:30 AM by Calvin_Hsia


                             LOCAL dt as Datetime


                             ASSERT GETWORDNUM(cText,6)$“AM PM”


                             nHr = INT(VAL(GETWORDNUM(cText,5)))


                             IF GETWORDNUM(cText,6)=“PM” AND nhr < 12


                                      nHr=nHr+12


                             ENDIF


                             dt=CTOT(GETWORDNUM(ctext,4) + “/” +; && Year


                                      TRANSFORM(INT(1+(AT(GETWORDNUM(cText,2),this.cMonths)-1)/10)) + “/” +;          && month


                                      TRANSFORM(VAL(GETWORDNUM(cText,3)))  + “T” +;     && day of month


                                      TRANSFORM(nHr)+“:”+;               && hour


                                      TRANSFORM(VAL(SUBSTR(GETWORDNUM(cText,5),4))))  && minute


                             ASSERT !EMPTY(dt)


                   RETURN dt


          PROCEDURE SaveAsMHT(cTitle as String) as String


                   fRetry = .t.


                   DO WHILE fRetry


                             fRetry = .f.


                             WITH this.oWSH


                                      .AppActivate(cTitle)   && bring IE to the foreground


                                      TEMPFILEMHT= “c:\t.mht”    && temp file can be constant


                                      ERASE (TEMPFILEMHT)


                                      .SendKeys(“%fa”+TEMPFILEMHT+“{tab}w%s”)    && Alt-F (File Menu) S (Save As) type Web Archive  Alt-S


                                      nTries=5


                                      DO WHILE !FILE(TEMPFILEMHT)      && wait til IE saves the file


                                                Sleep(5000)


                                                nTries=nTries-1


                                                IF nTries=0


                                                          fRetry=.t.


                                                          EXIT


                                                ENDIF


                                                IF this.IsThreadAborted()


                                                          this.fStopCrawl=.t.


                                                          ?“Aborting crawl”


                                                          RETURN “”


                                                ENDIF


                                      ENDDO


                                      sleep(100)


                             ENDWITH


                   ENDDO


                   RETURN FILETOSTR(TEMPFILEMHT)


          RETURN


          PROCEDURE IsThreadAborted as Boolean


                   IF WaitForSingleObject(this.hEvent,0) = WAIT_TIMEOUT


                             RETURN .f.


                   ENDIF


                   RETURN .t.


          PROCEDURE destroy


                   this.oWeb.visible=.f.


                   this.oWeb=0


                   CloseHandle(this.hEvent)


ENDDEFINE


**************************************************


**************************************************


ENDTEXT


          STRTOFILE(cstrVFPCode,“MyThreadFunc.prg”)


          COMPILE MyThreadFunc.prg


*SELECT PADR(TRANSFORM(YEAR(pubdate)),4)+”/”+PADL(MONTH(pubdate),2,”0″) as mon,COUNT(*) FROM blogs WHERE !EMPTY(mht) GROUP BY mon ORDER BY mon DESC INTO CURSOR result


 


RETURN

Comments (21)

  1. Calvin,

    That’s really beautiful code… Excellent pointers for diverse applications deployed with Visual FoxPro.  Thanks for commenting/posting it!!

  2. SednaY says:

    I’ve updated the VFP MT example based on this (at http://codegallery.gotdotnet.com/SednaY) to be sort of like .NET:

    Example Use:

    * t=CREATEOBJECT(‘testserver.thread’)

    * t.start(5,"do c:MTmyVFPMyThreadFunc WITH p2")

    * && start method params:(1)#threads,(2)VFP code to MT,(3)Silent mode

    * ?t.check && returns .T. if completed

    * t=null && cleanup

    Simple, fast, efficient – it is VFP!!

  3. I wanted to update a couple zip files of the VB version of my Blog Crawler (to be posted soon) with the…

  4. This is the VB.Net 2005 version of the Blog Crawler. It’s based on the Foxpro version, but.it uses SQL…

  5. The EventHandler function allows you to connect some code to an object’s event interface. For example,…

  6. Sometimes you run a program and you don’t want it to show any dialogs or User Interface at all. For example,…

  7. Calvin has written a blog crawler with both VFP and VB.NET versions that allows you to back up your own…

  8. ClaudeFox says:

    I’ve updated this VFP Web Crawler to more closely match the VB.Net version.  Check it out at:

    http://www.codeplex.com/vfpwebcrawler

    All source is included…

  9. SQLExpress is free and comes with Visual Studio, but the sample Northwind database isn’t included. You

  10. SQLExpress is free and comes with Visual Studio, but the sample Northwind database isn’t included. You

  11. SednaY says:

    I updated a version of this code to include an easy to use VFP project and

    -ability to specify number of threads

    -better switching between blogs

    -debug option to make crawling visible

    See VFPWebCrawler 2.0 at:

    http://www.codeplex.com/VFPWebcrawler

  12. I spent a few hours at a local company called 2Bot ( http://www.2bot.com/ ) which makes a 3-D printer

  13. I received a question: Simply, is there a way of interrupting a vfp sql query once it has started short