Web-scraping with VB’s XML support

There was an interesting article about using VB's XML support for generating HTML: http://www.infoq.com/news/2009/02/MVC-VB.

I've been using VB and XML for the reverse purpose -- scraping web pages to retreive information. I enjoy sailing, and I wanted to find historical data on windspeeds to know when would be the best time of year to set out on a long trip. (Answer: March and April have the best winds around Seattle).

I found an excellent site, to scrape from, http://www.almanac.com/, which has historical weather data for many places around the country. The first step in scraping is copyright law. Facts alone are not copyrightable, but the act of selecting and compiling facts is a creative work and so the compilation is copyrightable. Hence, for instance, a telephone directory is protected by copyright. So too is almanac.com's compilation. And that's why I only scraped their pages for my own personal use.

The almanac has URLs like this: http://www.almanac.com/weatherhistory/oneday.php?number=994014&wban=99999&day=1&month=4&year=2008&searchtype=. It's easy to see what the format is, and generate similar URLs myself.


The code to parse XHTML:

I looked at the HTML source code of a page from the almanac in Notepad, figured out its structure, and wrote some simple XML queries to dig into it. (Note: the function "Fetch" fetches HTML pages from the web, but converts them into XHTML ready for VB XML queries. More on that later). Here's the VB code. I highlighted the XML queries.


Option Strict On

Imports System.Net

Imports System.IO

Imports <xmlns:xhtml="http://www.w3.org/1999/xhtml">



Module Module1


    Dim Places As Integer() = {994014}

    Dim Years As Integer() = {2008}

    Dim Months As Integer() = {4, 5}


    Sub Main()

        Console.WriteLine("{1}{0}{2}{0}{3}{0}{4}{0}{5}{0}{6}{0}{7}{0}{8}", vbTab, "Date (Y/M/D)", "Location", "Temp (^F)", "Precipitation (in)", "Visibility (miles)", "Wind Mean (mph)", "Wind Sustained (mph)", "Wind Gust (mph)")

        For Each year As Integer In Years

            For Each month As Integer In Months

                Dim d = New DateTime(year, month, 1)

                Dim dnm = New DateTime(If(d.Month = 12, d.Year + 1, d.Year), If(d.Month = 12, 1, d.Month + 1), d.Day)

                Dim lastDay = CInt((dnm - d).TotalDays)

                For day As Integer = 1 To lastDay

                    For Each place As Integer In Places

                        Dim url = String.Format("http://www.almanac.com/weatherhistory/oneday.php?number={0}&wban=99999&day={1}&month={2}&year={3}&searchtype=", place, day, month, year)

                        Dim fn = Fetch(url)

                        Dim xml = XElement.Load(fn)

                        Dim body = (From i In xml...<xhtml:div> Where i.GetAttr("class") = "yui-u first").FirstOrDefault

                        If body Is Nothing Then Continue For

                        Dim title = body.<xhtml:h2>.Value.ToString.Replace(",", " ")

                        If title.ToLower.StartsWith("no data") Then Continue For

                        Dim temp, precipitation, visibility, windMean, windSustained, windGust As Double?

                        Dim data = From i In body...<xhtml:td>

                        For Each td In data

                            Dim text = td.<xhtml:p>.FirstOrDefault

                            If text Is Nothing Then Continue For

                            Dim category = text.Value.Replace(vbCrLf, " ").Replace(vbCr, " ").Replace(vbLf, " ").ToLower

                            text = td.<xhtml:b>.FirstOrDefault

                            If text Is Nothing Then Continue For

                            Dim svalue = text.Value.Replace(vbCrLf, " ").Replace(vbCr, " ").Replace(vbLf, " ").ToLower

                            Dim value = 0.0 : If Not Double.TryParse(svalue, value) Then Continue For

                            If category Like "mean temperature" Then temp = value

                            If category Like "total precipitation" Then precipitation = value

                            If category Like "visibility" Then visibility = value

                            If category Like "mean wind speed" Then windMean = value

                            If category Like "maximum sustained" Then windSustained = value

                            If category Like "maximum gust" Then windGust = value


                        Dim s = String.Format("{0:0000}/{1:00}/{2:00}", year, month, day)

                        Console.WriteLine("{1}{0}{2}{0}{3}{0}{4}{0}{5}{0}{6}{0}{7}{0}{8}", vbTab, s, title, temp, precipitation, visibility, windMean, windSustained, windGust)





    End Sub

End Module


Fetching pages: HTML into XHTML

Goal: to use VB's XML support for reading the web page. That's because VB has such nice syntax (I find it easier than xpath, or beautiful soup, or the alternatives). The problem is that most web-pages are written in a sloppy kind of HTML that might render okay but certainly can't be loaded into XElement.Load.

Solution: download Tidy, an awesome open-source library and executable for, well, tidying HTML into proper XHTML. I downloaded "tidy.exe" and put it into my windows directory, so I could execute it without messing around with the path.

The above code calls a function "Fetch". This is the one that fetches pages, and invokes "tidy" to clean up the html. Here is the implementation of Fetch. It uses a function "InputAndOutputToEnd" to redirect input and output of tidy.exe when it runs it. I wrote about InputAndOutputToEnd last month.


Module Helpers


    ''' <summary>

    ''' GetAttr: x.GetAttr("attr") is equivalent to x.@attr. It's here to work around a MONO bug: MONO

    ''' will throw an exception on x.@attr if the attribute is absent; the CLR doesn't. This function

    ''' also doesn't throw.

    ''' </summary>

    <System.Runtime.CompilerServices.Extension()> Function GetAttr(ByVal e As XElement, ByVal attr As String) As String

        If e Is Nothing Then Return ""

        For Each a In e.Attributes

            If String.Compare(attr, a.Name.LocalName, True) = 0 Then Return a.Value


        Return ""

    End Function


    ''' <summary>

    ''' Fetch: this function fetches the given Url and saves it into a cache in a temporary directory.

    ''' It returns the filename. If the Url had given back "text/html", then this function invokes

    ''' "tidy.exe" (from http://tidy.sourceforge.net/) to turn the html into valid XHTML such as can

    ''' be read with XElement.Load. The function will throw an exception if anything bad happened,

    ''' e.g. WebException or BadUriException. If asked to fetch a url but this url had already been downloaded

    ''' previously, and the previous download was no more than "CacheAtLeastDays" old and hadn't

    ''' been deleted, then the previous download is used. The idea is that our program might well hammer

    ''' web-services, and we don't want to be too cruel on them, so even if they didn't specify caching

    ''' for a page then we might still want to cache it. (If the webservice specified a cache longer than

    ''' CacheAtLeastDays, then any number of internet proxies along the way might cache it, and so

    ''' CacheAtLeastDays is a minimum rather than a maximum.) This function is not protected against

    ''' multiple threads calling it. There might be contention if multiple threads call it and try to

    ''' download and write to the same file. Note: in the cache, URLs are escaped then truncated to 240

    ''' characters. So if they were longer than that (e.g. long query strings) then there'll be cache

    ''' conflicts and the wrong data might be returned.

    ''' </summary>

    Function Fetch(ByVal Url As String, Optional ByVal CacheAtLeastDays As Double = 7) As String

        Dim dir = IO.Path.GetTempPath & My.Application.Info.AssemblyName & "\fetch"

        If Not Directory.Exists(dir) Then Directory.CreateDirectory(dir)

        ' Note: if the directory already existed, then CreateDirectory just proceeds silently without fuss.


        Dim fn = dir & "\" & Uri.EscapeDataString(Url.Replace("http://", "").Replace("/", "_")).Replace("%", "#")

        ' MONO: If you try to XElement.Load(fn) where fn includes %escapes, then it tries to unescape them.

        ' So we make sure there are no %escapes in the filename.  (CLR doesn't have this quirk.)


        fn = fn.Substring(0, Math.Min(240, fn.Length))

        ' MONO on unix: is fine so long as every directory/filename component is <=240 characters.

        ' CLR on windows: requires the entire path "fn" to be <=240 characters.

        ' http://blogs.msdn.com/bclteam/archive/2007/02/13/long-paths-in-net-part-1-of-3-kim-hamilton.aspx


        If File.Exists(fn) Then

            Dim age = DateTime.Now - File.GetLastWriteTime(fn)

            If age.TotalDays <= CacheAtLeastDays Then Return fn


        End If


        Dim x = WebRequest.Create(Url)

        Using r = x.GetResponse

            Dim t = ""

            Using rs As New StreamReader(r.GetResponseStream)

                t = rs.ReadToEnd

            End Using

            If Not r.ContentType.StartsWith("text/html") Then

                My.Computer.FileSystem.WriteAllText(fn, t, False, Text.Encoding.UTF8)

                Return fn

            End If

            Using tidy As New System.Diagnostics.Process

                Dim cmd = "tidy"

                Dim args = "-asxml -numeric -quiet --doctype omit"

                ' MONO: XElement.Load throws an exception if DOCTYPE is present. CLR doesn't. Hence we omit the DOCTYPE.

                tidy.StartInfo.FileName = cmd

                tidy.StartInfo.Arguments = args

                tidy.StartInfo.UseShellExecute = False

                tidy.StartInfo.RedirectStandardInput = True

                tidy.StartInfo.RedirectStandardOutput = True

                tidy.StartInfo.RedirectStandardError = True


                Dim err = "", op = ""

                tidy.InputAndOutputToEnd(t, op, err)


                If tidy.HasExited Then

                    ' We had already asked ("-numeric") for tidy to escape non-ascii characters. But

                    ' nonetheless, XElement.Load will throw an exception if there are any, and we really

                    ' don't want that, so we'll do belt-and-braces here:

                    Dim op2 As New Text.StringBuilder(op.Length)

                    For i = 0 To op.Length - 1

                        Dim c = AscW(op(i))

                        If (c >= 32 AndAlso c < 127) OrElse c = 13 OrElse c = 10 OrElse c = 9 Then


                        End If


                    My.Computer.FileSystem.WriteAllText(fn, op2.ToString, False, Text.Encoding.ASCII)

                    Return fn

                End If



            End Using

        End Using

        Return ""

    End Function



    ''' <summary>

    ''' InputAndOutputToEnd: Given a started process, this lets you supply a string as input if you want,

    ''' and will read all output and error to the end. This function has no timeout: if we give it an input string

    ''' but the process fails to read it to completion, or if we ask for standard-output/error but the process

    ''' fails to close these streams, then the function will block indefinitely. The function will throw

    ''' an exception if there was an error reading from the streams. The caller is expected to have started

    ''' the process before calling the function, and the caller is expected to wait for the process to close

    ''' and to dispose of it afterwards. If the caller uses this function, then the caller should do no

    ''' other input/output to the process.

    ''' </summary>

    <Runtime.CompilerServices.Extension()> Sub InputAndOutputToEnd(ByVal p As Diagnostics.Process, ByVal StandardInput As String, ByRef StandardOutput As String, ByRef StandardError As String)

        If p Is Nothing Then Throw New ArgumentException("process must be non-null", "p")

        ' Assume p has started. Alas there's no way to check.

        If p.StartInfo.UseShellExecute Then Throw New ArgumentException("Set StartInfo.UseShellExecute to false")

        If (p.StartInfo.RedirectStandardInput <> (StandardInput IsNot Nothing)) Then Throw New ArgumentException("Provide a non-null Input only when StartInfo.RedirectStandardInput")

        If (p.StartInfo.RedirectStandardOutput <> (StandardOutput IsNot Nothing)) Then Throw New ArgumentException("Provide a non-null Output only when StartInfo.RedirectStandardOutput")

        If (p.StartInfo.RedirectStandardError <> (StandardError IsNot Nothing)) Then Throw New ArgumentException("Provide a non-null Error only when StartInfo.RedirectStandardError")


        ' MSDN notes, http://msdn.microsoft.com/en-us/library/system.diagnostics.processstartinfo.redirectstandardoutput.aspx,

        ' that "Synchronous read operations introduce a dependency between the caller reading from the StandardOutput stream

        ' and the child process writing to that stream. These dependencies can cause deadlock conditions." We avoid the deadlock

        ' by running in a separate thread.


        Dim outputData As New InputAndOutputToEndData

        Dim errorData As New InputAndOutputToEndData


        If p.StartInfo.RedirectStandardOutput Then

            outputData.Stream = p.StandardOutput

            outputData.Thread = New Threading.Thread(AddressOf InputAndOutputToEndProc)


        End If

        If p.StartInfo.RedirectStandardError Then

            errorData.Stream = p.StandardError

            errorData.Thread = New Threading.Thread(AddressOf InputAndOutputToEndProc)


        End If


        If p.StartInfo.RedirectStandardInput Then



        End If


        If p.StartInfo.RedirectStandardOutput Then outputData.Thread.Join() : StandardOutput = outputData.Output

        If p.StartInfo.RedirectStandardError Then errorData.Thread.Join() : StandardError = errorData.Output

        If outputData.Exception IsNot Nothing Then Throw outputData.Exception

        If errorData.Exception IsNot Nothing Then Throw errorData.Exception

    End Sub


    Private Class InputAndOutputToEndData

        Public Thread As Threading.Thread

        Public Stream As IO.StreamReader

        Public Output As String

        Public Exception As Exception

    End Class


    Private Sub InputAndOutputToEndProc(ByVal data_ As Object)

        Dim data = DirectCast(data_, InputAndOutputToEndData)

        Try : data.Output = data.Stream.ReadToEnd : Catch e As Exception : data.Exception = e : End Try

    End Sub


End Module 



Comments (7)
  1. Don’t trust Tidy too much – its HTML parser is far from perfect, and when it fails, you do not get valid XHTML as output, even if you asked for it. One example of something it can’t handle are MSOffice HTML extensions (such as VML), and the non-standard-compliant way Office uses to declare namespaces in HTML documents it produces. And there are quite a few pages on the Web made by using "Save as HTML" in word.

    Apart from writing an HTML parser from scratch, the only other reasonable option is to use IE (or rather HTMLDocument coclass and IHTMLDocument interface) to parse it, and then walk its DOM. Along these lines:


  2. joko nardi says:

    Hi…i saw your contents -> great..can you give me sample to download 🙂



  3. Wooldridge says:

    Amazing blog! Do you have any recommendations for aspiring

    writers? I'm hoping to start my own site soon but

    I'm a little lost on everything. Would you recommend starting with a free platform like WordPress

    or go for a paid option? There are so many options out there that I'm

    completely confused .. Any recommendations? Many thanks!

  4. Diggles says:

    Wow, that's what I was seeking for, what a information! existing here at this webpage, thanks admin of this site.

  5. Carlin says:

    It's impressive that you are getting thoughts from this piece of

    writing as well as from our argument made here.

  6. Dolan says:

    Someone necessarily lend a hand to make severely articles I might state.

    That is the first time I frequented your web page and so far?

    I surprised with the research you made to create

    this particular post extraordinary. Wonderful process!

  7. scrape-defender says:

    Thank you for this excellent read. scraping the websites is increasing so it is necessary to take security services.

Comments are closed.

Skip to main content