提交HDInsight Pig作业

Pig是Hadoop常用的一个模块,Azure提供了使用PowerShell提交Pig作业的方式,当Pig Latin的脚本较为短小时,可使用New-AzureHDInsightPigJobDefinition的-Query直接指定脚本内容,示例如下:

  $clusterName = "HDIDemo"
 $QueryString = "intxt1 = load 'wasb://hdirawdata@teststorage.blob.core.chinacloudapi.cn/userbehavior.log' ;" +
 "store intxt1 into 'wasb:///home/mytest1' ;"
 
 $pigJobDefinition = New-AzureHDInsightPigJobDefinition -Query $QueryString 
 
 $pigJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $pigJobDefinition 
 
 Wait-AzureHDInsightJob -Job $pigJob -WaitTimeoutInSeconds 3600
 
 Write-Host "Display the standard output ..." -ForegroundColor Green
 Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $pigJob.JobId -StandardError 

当Pig Latin的脚本过长时,则会遇到”The input line is too long”的错误,这是由于一次提交的batch过长而导致,这时调用Pig Latin文本就显得尤为重要。具体方法如下:

  1. 将如下Pig Latin的脚本存于后缀名为.pig的文件中(本例中为PigLatinTest.pig),并将其存储于Azure blob storage中。

intxt1 = load 'wasb://amberhdirawdata@amberstorage.blob.core.chinacloudapi.cn/userbehavior.log'; store intxt1 into 'wasb:///home/mytest1' ;

2. 使用如下命令调用Pig Latin脚本、并执行Pig作业:

 $clusterName = "AmberHDIDemo"
 
 $pigJobDefinition = New-AzureHDInsightPigJobDefinition -File "wasb://hdirawdata@teststorage.blob.core.chinacloudapi.cn/userbehavior.pig" -StatusFolder $statusFolder -Verbose
 
 $pigJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $pigJobDefinition 
 
 Wait-AzureHDInsightJob -Job $pigJob -WaitTimeoutInSeconds 3600
 
 # Print the output of the Pig job.
 Write-Host "Display the standard output ..." -ForegroundColor Green
 Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $pigJob.JobId -StandardError