Fun with PowerShell: Part 1 – Loading Demo Data

I know I am relatively late to the game when it comes to PowerShell, but I’ve found myself using it more and more lately in places where I would traditionally write a bit of C# code. Yesterday as I worked to put together yet another demo I created an interesting script worth sharing.

For the demo, I wanted to use a couple of years worth of the airline on-time performance data published by the Bureau of Transportation Statistics. The problem with this data is that it is made available monthly and is packed in a zip file. The zip file is problematic since Hive cannot natively read them and decompressing the monthly data file results in files that are somewhere north of 220Kb in size.

The solution to this problem was pretty simple. I needed to extract the data file from the *.zip file and recompress it into a more Hadoop friendly format (GZip).

Overview of the Script

The PowerShell script below performs the following tasks iteratively for a specified range of years and months:

  1. Checks for the configured storage account and container, creating them if they do not exist.
  2. Downloads the zipped data file from the https://www.transtats.bts.gov website.
  3. Unblocks the downloaded zip file.
  4. Decompresses the zip file to a working directory.
  5. Compresses the CSV data file using GZip.
  6. Uploads the GZip file to the specified storage account and container.

To run the script, you simplify specify:

  • The local path to your Azure Publish Settings file
  • An Azure Storage Account Name
  • A Container name, where the data (blobs) will be loaded to
  • The geographical location for the Azure Storage Account (only used if the script will create storage account)
  • A local path to a working directory (the script will create it if it does not exist)
  • The year and month ranges for data you want to load

The script is provided in-line for your review and you can download the script file directly from HERE.

 $publishSettings = "<<PUBLISH SETTINGS FILE>>"
$storageAccountName = "<<STORAGE ACCOUNT NAME>>"
$containerName = "<<CONTAINER NAME>>"
$location = "East US"
$workingDir = "c:\temp\"

$beginYear = 2011
$endYear = 2011
$startMonth = 1
$endMonth = 12

#Start-up 
Import-Module 'C:\Program Files (x86)\Microsoft SDKs\Azure\PowerShell\ServiceManagement\Azure\Azure.psd1'
Import-AzurePublishSettingsFile $publishSettings

if ((Test-Path -Path $workingDir) -ne $True)
{
    Write-Host "Creating Working Directory...."
    New-Item $workingDir -ItemType Directory
}
else 
{
    Remove-Item "$workingDir\*.*"
}

# Check the storage account, create if it doesn't exist
Write-Host "Checking Storage Account..."
$stgAcct = Get-AzureStorageAccount | Where-Object {$_.StorageAccountName -eq $storageAccountName }

if (!$stgAcct){
    Write-Host "Storage Account Not Found! Creating it...."
    New-AzureStorageAccount -Location $location -StorageAccountName $storageAccountName
}

# Get the storage account key
$storageAccountKey = Get-AzureStorageKey $storageAccountName | %{ $_.Primary }

# Create a storage context object
$storageContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey  

# Check the storage container, create if it doesn't exist
Write-Host "Checking Storage Container..."
$stgContainer = Get-AzureStorageContainer -Context $storageContext | Where-Object {$_.Name -eq $containerName }

if (!$stgContainer){
   Write-Host "Storage Container not found! Creating it...."
   New-AzureStorageContainer -Name $containerName -Context $storageContext
}

$year = $beginYear
$month = $startMonth

while ($year -le $endYear)
{
    while ($month -le $endMonth)
    {
        $url = "https://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_{0}_{1}.zip" -f $year, $month
        $file = "$workingDir\{0}_{1}.zip" -f $year, $month

        #File Download
        Write-Host "Downloading File for $Month/$Year..."
        $webClient = New-Object System.Net.WebClient
        $webClient.DownloadFile($url, $file)

        #Unblock the file since we downloaded it from the net
        Unblock-File $file

        #Unzip the files
        Write-Host "Unzipping File for $Month/$Year..."
        $shell = New-Object -ComObject Shell.Application
        $files = $shell.NameSpace($file).Items()
        $shell.Namespace($workingDir).CopyHere($files)

        $csvFile = Get-ChildItem -path $workingDir -Filter *.csv | Foreach-Object {$_.BaseName}

        $inFile = "$workingDir\$csvFile.csv"
        $outFile = "$workingDir\$csvFile.gz"

        #GZip the CSV data file
        Write-Host "Creating GZip for $Month/$Year..." 
        $input = New-Object System.IO.FileStream $inFile, ([IO.FileMode]::Open), ([IO.FileAccess]::Read), ([IO.FileShare]::Read)
 
        $buffer = New-Object byte[]($input.Length)
        $byteCount = $input.Read($buffer, 0, $input.Length)
 
        if ($byteCount -ne $input.Length)
        {
            $input.Close()
            Write-Host "Failure reading $inFile."
            exit
        }

        $input.Close()
 
        $output = New-Object System.IO.FileStream $outFile, ([IO.FileMode]::Create), ([IO.FileAccess]::Write), ([IO.FileShare]::None)
        $gzipStream = New-Object System.IO.Compression.GzipStream $output, ([IO.Compression.CompressionMode]::Compress)
 
        $gzipStream.Write($buffer, 0, $buffer.Length)
        $gzipStream.Close()

        $output.Close()
        
        Write-Host "Copying File to Azure for $Month/$Year..."
        Set-AzureStorageBlobContent -Blob "$year/$month.gz" -File $outFile -Container $containerName -Context $storageContext -Force

        Write-Host "Cleaning up working directory..."
        Remove-Item "$workingDir\*.*"

        $month = $month + 1
    }

    $year = $year + 1
    $month = $startMonth
}

Write-Host "Processing Complete!"

Please keep in mind that I am not (nor will I ever claim to be) a PowerShell expert and the the code is provided-as-is and does not contain any error checking or handling logic. Enjoy.

Till next time!

Chris