When running Pig in a production environment, you'll likely have one or more Pig Latin scripts that run on a recurring basis (daily, weekly, monthly, etc.) that need to locate their input data based on when or where they are run. For example, you may have a Pig job that performs daily log ingestion by geographic region. It would be costly and error prone to manually edit the script to reference the location of the input data each time log data needs to be ingested. Ideally, you'd like to pass the date and geographic region to the Pig script as parameters at the time the script is executed. Fortunately, Pig provides this capability via parameter substitution. There are four different mechanisms to define parameters that can be referenced in a Pig Latin script:
- Parameters can be defined as command line arguments; each parameter is passed to Pig as a separate argument using -param switches at script execution time
- Parameters can be defined in a parameter file that's passed to Pig using the -param_file command line argument when the script is executed
- Parameters can be defined inside Pig Latin scripts using the "%declare" and "%default" preprocessor statements
You can use none, one or any combination of the above options.
Let's look at an example Pig script that could be run to perform IIS log ingestion. The script loads and filters an IIS log looking for requests that didn't complete with status-code of 200 or 201.
Note that parameter names in Pig Latin scripts are preceded by a dollar sign, $. For example, the LOAD statement references six parameters; $WASB_SCHEME, $ROOT_FOLDER, $YEAR, $MONTH, $DAY and $INPUTFILE.
Note also the script makes use of the %default preprocessor statement to define default values for the WASB_SCHEME and ROOT_FOLDER parameters:
Specifying Parameters in a Parameter File
Parameters are defined as key-value pairs. Below is an example parameter file that defines four parameters referenced by the above script; YEAR, MONTH, DAY and INPUTFILE. The YEAR key has a value of 2014, the DAY key has a value of 27 the MONTH key has a value of 07 and the INPUTFILE key has a value of iis.log:
The Pig preprocessor locates parameters in the Pig script by searching for the parameter name prepended with a dollar sign, $, and substitutes the value of the key for the parameter. You can pass the parameter file to Pig using the -param_file command line argument:
|pig -param_file d:\users\rdpuser\documents\parameters.txt -f d:\users\rdpuser\documents\LoadLog.pig|
Specifying Parameters on the Command Line
The second method of passing parameters to your Pig script at execution time is to pass each parameter as a separate key-value pair using individual -param arguments.
|pig -param "YEAR=2014" -param "MONTH=07" -param "DAY=27" -param "INPUTFILE=iis.log" -f d:\users\rdpuser\documents\LoadLog.pig
Note: On Windows key-value pairs must be enclosed in double quotes as the equal sign, =, is an assignment operator in the Windows cmd shell.
Testing Parameter Substitution Using the -dryrun Command Line Option
Before submitting the Pig script to the cluster's Templeton endpoint for execution using PowerShell, let's make sure that parameter substitution will work as desired. There's a useful Pig command line parameter, -dryrun, that can be used to test parameter substitution. The -dryrun option directs Pig to substitute parameter values for parameters in the Pig script, write the resulting script to a file named <original_script_name>.substituted and shut down without executing the script. The best way to try -dryrun is to enable remote access to your cluster, and use RDP to log into your HDInsight cluster's active headnode. Once you're logged in, you can execute PIG.CMD interactively as demonstrated below. Pig will report the name and location of the substituted file before it shuts down:
| C:\apps\dist\pig-0.12.1.2.1.3.0-1887\bin>pig -param_file d:\users\rdpuser\documents\parameters.txt -param "MONTH=08" -param "DAY=24" -dryrun -f d:\users\rdpuser\documents\LoadLog.pig
. . .
2014-08-24 15:58:37,625 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file D:\Users\dansha/.pigbootup not found
Precedence Rules for Parameter Substitution
Note the "Warning" messages that showed up in the -dryrun output. If a parameter is defined more than once, there are precedence rules that determine what the final value of the parameter will be. The following precedence order is documented in the Pig parameter substitution documentation. The list is ordered from highest to lowest precedence.
- Parameters defined using a declare preprocessor statement have the highest precedence
- Parameters defined on the command line using -param have the second highest precedence
- Parameters defined in parameter files have the third highest precedence
- Parameters defined using the default preprocessor statement have the lowest precedence
Given the above precedence rules, even though the MONTH and DAY parameters were defined in the parameter file, the individual command line parameters specified with the -param arguments overrode them.
Below please find the content of the LoadLog.pig.substituted file that was output by the -dryrun command. Note that all parameters were replaced with values. Some parameters were replaced by values specified in the parameter file, some were replaced by parameters passed via the -param argument, and others were replaced by parameters defined with the default preprocessor statements.
Submitting a Pig Job that Uses Parameters with PowerShell
Now, let's bring it all together with an example that demonstrates how to use the Azure HDInsight PowerShell cmdlets to submit a Pig job that uses command line parameters and a parameter file.
There are a couple of things in the script that are worthy of closer examination. First, if the job will reference any files they need to be copied to one of the storage accounts the target HDInsight cluster is configured to use. This gives the Templeton server access to the files to set the job up for execution. For the example we've been referring to, we needed to copy the Pig Latin script, LoadLog.pig, and the parameter file, Parameters.txt, to Azure blob storage using the Set-AzureStorageBlobContent cmdlet.
# Get storage context
$AzureStorageContext = New-AzureStorageContext -StorageAccountName $BlobStorageAccount -StorageAccountKey $PrimaryStorageKey
# Copy pig script and parameter file up to Azure storage where they can be accessed by the Templeton server while setting up the job for execution
Set-AzureStorageBlobContent -File C:\src\Hadoop\Pig\LoadLog.pig -BlobType Block -Container $DefaultStorageContainer -Context $AzureStorageContext -Blob http://$BlobStorageAccount.blob.core.windows.net/$DefaultStorageContainer/$ScriptsFolder/$ScriptName
Set-AzureStorageBlobContent -File C:\src\Hadoop\Pig\ParamFile.txt -BlobType Block -Container $DefaultStorageContainer -Context $AzureStorageContext -Blob http://$BlobStorageAccount.blob.core.windows.net/$DefaultStorageContainer/$ScriptsFolder/$ParamFile
Passing Command Line Options via PowerShell
Passing parameters to Pig jobs via the PowerShell cmdlets can be a bit confusing, and we've received a number of inquiries how to go about it. Keeping that in mind, the most important thing to "call out" from the job submission script is how to pass parameters to a Pig script using the -param and -param_file Pig command line arguments. Command line arguments are specified at the time the Pig job is defined with the New-AzureHDInsightPigJobDefinition cmdlet. A job's command line arguments must be passed to New-AzureHDInsightPigJobDefinition as an array of String objects using the -Arguments parameter. Each command line element that will be passed to Pig is stored as a separate array entry. This is straight forward for command line options that are "switches" with no associated arguments like "-verbose", "-warning" and "-stop_on_failure"; each of these command line arguments are added as separate entries to the $pigParams array:
|$pigParams = "-verbose","-warning","-stop_on_failure"|
However, things get tricky when passing command line arguments that have associated values. Individual Pig parameters are passed using a -param command line argument followed directly its associated key-value pair. The key-value pair is added to the $pigParams array as a separate, but adjacent, array entry.
For example, consider the first line of code below where the INPUTFILE parameter is added to the $pigParams parameter array. First the command line parameter, "-param" is added. Next, the key-value pair associated with the -param argument, "INPUTFILE=$InputFile" are added in the adjacent array entry. The pattern simply repeats for each successive command line parameter.
$pigParams += "-param","INPUTFILE=$InputFile"
$pigParams += "-param","MONTH=$Month"
$pigParams += "-param","DAY=$Day"
For the parameter file, the "-param_file" argument is added to the $pigParams array followed by a separate, but adjacent, array entry that specifies the parameter file name. Finally, the $pigParams are passed to New-AzureHDInsightPigJobDefinition using the -Arguments parameter.
$pigParams += "-param_file","$param_file"
# Create pig job definition
$pigJobDefinition = New-AzureHDInsightPigJobDefinition -File $PigScript -Arguments $PigParams
The job definition created by New-AzureHDInsightPigJobDefinition is then used by the Start-AzureHDInsightJob cmdlet to submit the Pig script to the Azure HDInsight cluster for execution:
$pigJob = Start-AzureHDInsightJob -Subscription $subscriptionName -Cluster $clusterDnsName -JobDefinition $pigJobDefinition
I hope this post clears up questions some have had about how to pass parameters to Pig jobs via PowerShell, and that you found it informative. Please let us know how we are doing, and what kind of content you would like us to write about in the future.