Web-scraping with VB’s XML support


There was an interesting article about using VB’s XML support for generating HTML: http://www.infoq.com/news/2009/02/MVC-VB.


I’ve been using VB and XML for the reverse purpose — scraping web pages to retreive information. I enjoy sailing, and I wanted to find historical data on windspeeds to know when would be the best time of year to set out on a long trip. (Answer: March and April have the best winds around Seattle).


I found an excellent site, to scrape from, http://www.almanac.com/, which has historical weather data for many places around the country. The first step in scraping is copyright law. Facts alone are not copyrightable, but the act of selecting and compiling facts is a creative work and so the compilation is copyrightable. Hence, for instance, a telephone directory is protected by copyright. So too is almanac.com’s compilation. And that’s why I only scraped their pages for my own personal use.


The almanac has URLs like this: http://www.almanac.com/weatherhistory/oneday.php?number=994014&wban=99999&day=1&month=4&year=2008&searchtype=. It’s easy to see what the format is, and generate similar URLs myself.


 


The code to parse XHTML:


I looked at the HTML source code of a page from the almanac in Notepad, figured out its structure, and wrote some simple XML queries to dig into it. (Note: the function “Fetch” fetches HTML pages from the web, but converts them into XHTML ready for VB XML queries. More on that later). Here’s the VB code. I highlighted the XML queries.


 


Option Strict On


Imports System.Net


Imports System.IO


Imports <xmlns:xhtml=http://www.w3.org/1999/xhtml>


 


 


Module Module1


 


    Dim Places As Integer() = {994014}


    Dim Years As Integer() = {2008}


    Dim Months As Integer() = {4, 5}


 


    Sub Main()


        Console.WriteLine(“{1}{0}{2}{0}{3}{0}{4}{0}{5}{0}{6}{0}{7}{0}{8}”, vbTab, “Date (Y/M/D)”, “Location”, “Temp (^F)”, “Precipitation (in)”, “Visibility (miles)”, “Wind Mean (mph)”, “Wind Sustained (mph)”, “Wind Gust (mph)”)


        For Each year As Integer In Years


            For Each month As Integer In Months


                Dim d = New DateTime(year, month, 1)


                Dim dnm = New DateTime(If(d.Month = 12, d.Year + 1, d.Year), If(d.Month = 12, 1, d.Month + 1), d.Day)


                Dim lastDay = CInt((dnm – d).TotalDays)


                For day As Integer = 1 To lastDay


                    For Each place As Integer In Places


                        Dim url = String.Format(“http://www.almanac.com/weatherhistory/oneday.php?number={0}&wban=99999&day={1}&month={2}&year={3}&searchtype=”, place, day, month, year)


                        Dim fn = Fetch(url)


                        Dim xml = XElement.Load(fn)


                        Dim body = (From i In xml…<xhtml:div> Where i.GetAttr(“class”) = “yui-u first”).FirstOrDefault


                        If body Is Nothing Then Continue For


                        Dim title = body.<xhtml:h2>.Value.ToString.Replace(“,”, ” “)


                        If title.ToLower.StartsWith(“no data”) Then Continue For


                        Dim temp, precipitation, visibility, windMean, windSustained, windGust As Double?


                        Dim data = From i In body…<xhtml:td>


                        For Each td In data


                            Dim text = td.<xhtml:p>.FirstOrDefault


                            If text Is Nothing Then Continue For


                            Dim category = text.Value.Replace(vbCrLf, ” “).Replace(vbCr, ” “).Replace(vbLf, ” “).ToLower


                            text = td.<xhtml:b>.FirstOrDefault


                            If text Is Nothing Then Continue For


                            Dim svalue = text.Value.Replace(vbCrLf, ” “).Replace(vbCr, ” “).Replace(vbLf, ” “).ToLower


                            Dim value = 0.0 : If Not Double.TryParse(svalue, value) Then Continue For


                            If category Like “mean temperature” Then temp = value


                            If category Like “total precipitation” Then precipitation = value


                            If category Like “visibility” Then visibility = value


                            If category Like “mean wind speed” Then windMean = value


                            If category Like “maximum sustained” Then windSustained = value


                            If category Like “maximum gust” Then windGust = value


                        Next


                        Dim s = String.Format(“{0:0000}/{1:00}/{2:00}”, year, month, day)


                        Console.WriteLine(“{1}{0}{2}{0}{3}{0}{4}{0}{5}{0}{6}{0}{7}{0}{8}”, vbTab, s, title, temp, precipitation, visibility, windMean, windSustained, windGust)


                    Next


                Next


            Next


        Next


    End Sub


End Module


 


Fetching pages: HTML into XHTML


Goal: to use VB’s XML support for reading the web page. That’s because VB has such nice syntax (I find it easier than xpath, or beautiful soup, or the alternatives). The problem is that most web-pages are written in a sloppy kind of HTML that might render okay but certainly can’t be loaded into XElement.Load.


Solution: download Tidy, an awesome open-source library and executable for, well, tidying HTML into proper XHTML. I downloaded “tidy.exe” and put it into my windows directory, so I could execute it without messing around with the path.


The above code calls a function “Fetch”. This is the one that fetches pages, and invokes “tidy” to clean up the html. Here is the implementation of Fetch. It uses a function “InputAndOutputToEnd” to redirect input and output of tidy.exe when it runs it. I wrote about InputAndOutputToEnd last month.


 


Module Helpers


 


    ”’ <summary>


    ”’ GetAttr: x.GetAttr(“attr”) is equivalent to x.@attr. It’s here to work around a MONO bug: MONO


    ”’ will throw an exception on x.@attr if the attribute is absent; the CLR doesn’t. This function


    ”’ also doesn’t throw.


    ”’ </summary>


    <System.Runtime.CompilerServices.Extension()> Function GetAttr(ByVal e As XElement, ByVal attr As String) As String


        If e Is Nothing Then Return “”


        For Each a In e.Attributes


            If String.Compare(attr, a.Name.LocalName, True) = 0 Then Return a.Value


        Next


        Return “”


    End Function


 


    ”’ <summary>


    ”’ Fetch: this function fetches the given Url and saves it into a cache in a temporary directory.


    ”’ It returns the filename. If the Url had given back “text/html”, then this function invokes


    ”’ “tidy.exe” (from http://tidy.sourceforge.net/) to turn the html into valid XHTML such as can


    ”’ be read with XElement.Load. The function will throw an exception if anything bad happened,


    ”’ e.g. WebException or BadUriException. If asked to fetch a url but this url had already been downloaded


    ”’ previously, and the previous download was no more than “CacheAtLeastDays” old and hadn’t


    ”’ been deleted, then the previous download is used. The idea is that our program might well hammer


    ”’ web-services, and we don’t want to be too cruel on them, so even if they didn’t specify caching


    ”’ for a page then we might still want to cache it. (If the webservice specified a cache longer than


    ”’ CacheAtLeastDays, then any number of internet proxies along the way might cache it, and so


    ”’ CacheAtLeastDays is a minimum rather than a maximum.) This function is not protected against


    ”’ multiple threads calling it. There might be contention if multiple threads call it and try to


    ”’ download and write to the same file. Note: in the cache, URLs are escaped then truncated to 240


    ”’ characters. So if they were longer than that (e.g. long query strings) then there’ll be cache


    ”’ conflicts and the wrong data might be returned.


    ”’ </summary>


    Function Fetch(ByVal Url As String, Optional ByVal CacheAtLeastDays As Double = 7) As String


        Dim dir = IO.Path.GetTempPath & My.Application.Info.AssemblyName & “\fetch”


        If Not Directory.Exists(dir) Then Directory.CreateDirectory(dir)


        ‘ Note: if the directory already existed, then CreateDirectory just proceeds silently without fuss.


 


        Dim fn = dir & “\” & Uri.EscapeDataString(Url.Replace(“http://”, “”).Replace(“/”, “_”)).Replace(“%”, “#”)


        ‘ MONO: If you try to XElement.Load(fn) where fn includes %escapes, then it tries to unescape them.


        ‘ So we make sure there are no %escapes in the filename.  (CLR doesn’t have this quirk.)


 


        fn = fn.Substring(0, Math.Min(240, fn.Length))


        ‘ MONO on unix: is fine so long as every directory/filename component is <=240 characters.


        ‘ CLR on windows: requires the entire path “fn” to be <=240 characters.


        ‘ http://blogs.msdn.com/bclteam/archive/2007/02/13/long-paths-in-net-part-1-of-3-kim-hamilton.aspx


 


        If File.Exists(fn) Then


            Dim age = DateTime.Now – File.GetLastWriteTime(fn)


            If age.TotalDays <= CacheAtLeastDays Then Return fn


            File.Delete(fn)


        End If


 


        Dim x = WebRequest.Create(Url)


        Using r = x.GetResponse


            Dim t = “”


            Using rs As New StreamReader(r.GetResponseStream)


                t = rs.ReadToEnd


            End Using


            If Not r.ContentType.StartsWith(“text/html”) Then


                My.Computer.FileSystem.WriteAllText(fn, t, False, Text.Encoding.UTF8)


                Return fn


            End If


            Using tidy As New System.Diagnostics.Process


                Dim cmd = “tidy”


                Dim args = “-asxml -numeric -quiet –doctype omit”


                ‘ MONO: XElement.Load throws an exception if DOCTYPE is present. CLR doesn’t. Hence we omit the DOCTYPE.


                tidy.StartInfo.FileName = cmd


                tidy.StartInfo.Arguments = args


                tidy.StartInfo.UseShellExecute = False


                tidy.StartInfo.RedirectStandardInput = True


                tidy.StartInfo.RedirectStandardOutput = True


                tidy.StartInfo.RedirectStandardError = True


                tidy.Start()


                Dim err = “”, op = “”


                tidy.InputAndOutputToEnd(t, op, err)


                tidy.WaitForExit(5000)


                If tidy.HasExited Then


                    ‘ We had already asked (“-numeric”) for tidy to escape non-ascii characters. But


                    ‘ nonetheless, XElement.Load will throw an exception if there are any, and we really


                    ‘ don’t want that, so we’ll do belt-and-braces here:


                    Dim op2 As New Text.StringBuilder(op.Length)


                    For i = 0 To op.Length – 1


                        Dim c = AscW(op(i))


                        If (c >= 32 AndAlso c < 127) OrElse c = 13 OrElse c = 10 OrElse c = 9 Then


                            op2.Append(op(i))


                        End If


                    Next


                    My.Computer.FileSystem.WriteAllText(fn, op2.ToString, False, Text.Encoding.ASCII)


                    Return fn


                End If


                tidy.Kill()


                tidy.WaitForExit(2000)


            End Using


        End Using


        Return “”


    End Function


 


 


    ”’ <summary>


    ”’ InputAndOutputToEnd: Given a started process, this lets you supply a string as input if you want,


    ”’ and will read all output and error to the end. This function has no timeout: if we give it an input string


    ”’ but the process fails to read it to completion, or if we ask for standard-output/error but the process


    ”’ fails to close these streams, then the function will block indefinitely. The function will throw


    ”’ an exception if there was an error reading from the streams. The caller is expected to have started


    ”’ the process before calling the function, and the caller is expected to wait for the process to close


    ”’ and to dispose of it afterwards. If the caller uses this function, then the caller should do no


    ”’ other input/output to the process.


    ”’ </summary>


    <Runtime.CompilerServices.Extension()> Sub InputAndOutputToEnd(ByVal p As Diagnostics.Process, ByVal StandardInput As String, ByRef StandardOutput As String, ByRef StandardError As String)


        If p Is Nothing Then Throw New ArgumentException(“process must be non-null”, “p”)


        ‘ Assume p has started. Alas there’s no way to check.


        If p.StartInfo.UseShellExecute Then Throw New ArgumentException(“Set StartInfo.UseShellExecute to false”)


        If (p.StartInfo.RedirectStandardInput <> (StandardInput IsNot Nothing)) Then Throw New ArgumentException(“Provide a non-null Input only when StartInfo.RedirectStandardInput”)


        If (p.StartInfo.RedirectStandardOutput <> (StandardOutput IsNot Nothing)) Then Throw New ArgumentException(“Provide a non-null Output only when StartInfo.RedirectStandardOutput”)


        If (p.StartInfo.RedirectStandardError <> (StandardError IsNot Nothing)) Then Throw New ArgumentException(“Provide a non-null Error only when StartInfo.RedirectStandardError”)


       


        ‘ MSDN notes, http://msdn.microsoft.com/en-us/library/system.diagnostics.processstartinfo.redirectstandardoutput.aspx,


        ‘ that “Synchronous read operations introduce a dependency between the caller reading from the StandardOutput stream


        ‘ and the child process writing to that stream. These dependencies can cause deadlock conditions.” We avoid the deadlock


        ‘ by running in a separate thread.


       


        Dim outputData As New InputAndOutputToEndData


        Dim errorData As New InputAndOutputToEndData


       


        If p.StartInfo.RedirectStandardOutput Then


            outputData.Stream = p.StandardOutput


            outputData.Thread = New Threading.Thread(AddressOf InputAndOutputToEndProc)


            outputData.Thread.Start(outputData)


        End If


        If p.StartInfo.RedirectStandardError Then


            errorData.Stream = p.StandardError


            errorData.Thread = New Threading.Thread(AddressOf InputAndOutputToEndProc)


            errorData.Thread.Start(errorData)


        End If


       


        If p.StartInfo.RedirectStandardInput Then


            p.StandardInput.Write(StandardInput)


            p.StandardInput.Close()


        End If


       


        If p.StartInfo.RedirectStandardOutput Then outputData.Thread.Join() : StandardOutput = outputData.Output


        If p.StartInfo.RedirectStandardError Then errorData.Thread.Join() : StandardError = errorData.Output


        If outputData.Exception IsNot Nothing Then Throw outputData.Exception


        If errorData.Exception IsNot Nothing Then Throw errorData.Exception


    End Sub


 


    Private Class InputAndOutputToEndData


        Public Thread As Threading.Thread


        Public Stream As IO.StreamReader


        Public Output As String


        Public Exception As Exception


    End Class


 


    Private Sub InputAndOutputToEndProc(ByVal data_ As Object)


        Dim data = DirectCast(data_, InputAndOutputToEndData)


        Try : data.Output = data.Stream.ReadToEnd : Catch e As Exception : data.Exception = e : End Try


    End Sub


 


End Module 


 


 


Comments (7)

  1. Don’t trust Tidy too much – its HTML parser is far from perfect, and when it fails, you do not get valid XHTML as output, even if you asked for it. One example of something it can’t handle are MSOffice HTML extensions (such as VML), and the non-standard-compliant way Office uses to declare namespaces in HTML documents it produces. And there are quite a few pages on the Web made by using "Save as HTML" in word.

    Apart from writing an HTML parser from scratch, the only other reasonable option is to use IE (or rather HTMLDocument coclass and IHTMLDocument interface) to parse it, and then walk its DOM. Along these lines:

    http://www.codeproject.com/KB/IP/parse_html.aspx

  2. joko nardi says:

    Hi…i saw your contents -> great..can you give me sample to download 🙂

    thanks

    jk

  3. Wooldridge says:

    Amazing blog! Do you have any recommendations for aspiring

    writers? I'm hoping to start my own site soon but

    I'm a little lost on everything. Would you recommend starting with a free platform like WordPress

    or go for a paid option? There are so many options out there that I'm

    completely confused .. Any recommendations? Many thanks!

  4. Diggles says:

    Wow, that's what I was seeking for, what a information! existing here at this webpage, thanks admin of this site.

  5. Carlin says:

    It's impressive that you are getting thoughts from this piece of

    writing as well as from our argument made here.

  6. Dolan says:

    Someone necessarily lend a hand to make severely articles I might state.

    That is the first time I frequented your web page and so far?

    I surprised with the research you made to create

    this particular post extraordinary. Wonderful process!

  7. scrape-defender says:

    Thank you for this excellent read. scraping the websites is increasing so it is necessary to take security services.

Skip to main content