Microcode: PowerShell Scripting Tricks: Scripting The Web (Part 1) (Get-Web)

Several of the last posts have tackled how to take the wild world of data and start to turn it into PowerShell objects, so that it’s easier to make heads or tails out of it.  Once all of that data is in a form that PowerShell can use more effectively in the object pipeline, you can use PowerShell to slice and dice that data in wonderfully effective ways.

Data mining is a strange art that I find scripting languages normally are a little better at than compiled languages. I find that PowerShell can be incredibly useful for data miners for two important reasons.  First, it can pull upon an unprecedented array of technologies (COM, WMI, .NET, SQL, Regular Expressions, Web Service, Command Line) in order to get structured data.  Second, and more importantly, since you can easily create objects on the fly in PowerShell, and since PowerShell has wonderful string processing, you can often use PowerShell to extract structure data out of unstructured data.

Being able to pull data out of the mist and give it form is a very valuable skill, because people do not think in structured data.  While it might be useful to the computing world to have most information in structured data, most people disseminating information don’t think or record their thoughts with rigorous structure.  However, many people do record their thoughts in semi-rigorous structure, like the sentences and paragraphs you’re reading now.

The fact of the matter is that tons of data floats on the web requiring a very little bit of work and a small amount of art to extract it.  This is because the web that people record their thoughts in is largely in HTML, and so, it is possible to learn a few ways to pull the data from the little structure that exists.

The first piece of the toolkit to extract out data from the Web is a function I’ve called Get-Web.  Get-Web will simply download web pages, and it wraps part of the System.Net.Webclient object.

Using WebClient has pros and cons.  The biggest pro is that it relies on .NET, rather than on a particular browser, which means that you can use it without IE.  The biggest con is that a lot of web pages do checks on the browser in order to change how they display.

Get-Web is below.  As with Get-HashtableAsObject, I’m using comments to declare some nifty inline help.

 function Get-Web($url, 
    [switch]$self,
    $credential, 
    $toFile,
    [switch]$bytes)
{
    #.Synopsis
    #    Downloads a file from the web
    #.Description
    #    Uses System.Net.Webclient (not the browser) to download data
    #    from the web.
    #.Parameter self
    #    Uses the default credentials when downloading that page (for downloading intranet pages)
    #.Parameter credential
    #    The credentials to use to download the web data
    #.Parameter url
    #    The page to download (e.g. www.msn.com)    
    #.Parameter toFile
    #    The file to save the web data to
    #.Parameter bytes
    #    Download the data as bytes   
    #.Example
    #    # Downloads www.live.com and outputs it as a string
    #    Get-Web https://www.live.com/
    #.Example
    #    # Downloads www.live.com and saves it to a file
    #    Get-Web https://wwww.msn.com/ -toFile www.msn.com.html
    $webclient = New-Object Net.Webclient
    if ($credential) {
        $webClient.Credential = $credential
    }
    if ($self) {
        $webClient.UseDefaultCredentials = $true
    }
    if ($toFile) {
        if (-not "$toFile".Contains(":")) {
            $toFile = Join-Path $pwd $toFile
        }
        $webClient.DownloadFile($url, $toFile)
    } else {
        if ($bytes) {
            $webClient.DownloadData($url)
        } else {
            $webClient.DownloadString($url)
        }
    }
}

To walk through a few examples of Get-Web, simply point it to any webpage.

   Get-Web https://en.wikipedia.org/

To save a page to disk

   Get-Web https://www.msn.com/ -toFile www.msn.com.html

Just downloading the data is only the first step.  All being able to download web data gives you is a way to get the mist into a bottle, but it doesn’t help you give it form.  The next piece will cover how to pull the data out of the web and into PowerShell.

Hope this helps,

James Brundage [MSFT]