Internals of Hadoop Pig Operators as MapReduce Job

I was recently asked to show that Pig scripts are actually MapReduce jobs so to explain it in very simple way I have created the following example:

 

  1. Read a text file using Pig Script
  2. Dump the content of the file

 

As you can see below that when “dump” command was used a MapReduce job was initiated:

  c:\apps\dist>pig
 2012-02-09 05:19:12,777 [main] INFO org.apache.pig.Main - Logging error messages to: c:\apps\dist\pig_1328764752777.log
 2012-02-09 05:19:13,198 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.114.226.34:9000
 2012-02-09 05:19:13,652 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.114.226.34:9010
 grunt> raw = load 'avkashwordfile.txt'; 
 
 grunt> dump raw; 
 2012-02-09 05:19:46,542 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
 2012-02-09 05:19:46,542 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used.
 2012-02-09 05:19:46,761 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: raw: Store(hdfs://10.114.226.34:9000/tmp/temp-1709215369/tmp275450578:org.apache.pig.impl.io.InterStorage) - scope-1 Operator Key: scope-1)
 2012-02-09 05:19:46,776 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
 2012-02-09 05:19:46,823 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
 2012-02-09 05:19:46,823 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
 2012-02-09 05:19:46,995 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
 2012-02-09 05:19:47,026 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
 2012-02-09 05:19:48,308 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
 2012-02-09 05:19:48,339 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting forsubmission.
 2012-02-09 05:19:48,839 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
 2012-02-09 05:19:48,870 [Thread-6] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
 2012-02-09 05:19:48,870 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
 2012-02-09 05:19:48,886 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
 2012-02-09 05:19:51,183 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201202082253_0006
 2012-02-09 05:19:51,183 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.1
 14.226.34:50030/jobdetails.jsp?jobid=job_201202082253_0006
 2012-02-09 05:20:15,198 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
 2012-02-09 05:20:16,198 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
 2012-02-09 05:20:21,198 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
 2012-02-09 05:20:30,932 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
 2012-02-09 05:20:30,932 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics:
 
 HadoopVersion PigVersion UserId StartedAt FinishedAt Features
 0.20.203.1-SNAPSHOT 0.8.1-SNAPSHOT avkash 2012-02-09 05:19:46 2012-02-09 05:20:30 UNKNOWN
 
 Success!
 
 Job Stats (time in seconds):
 JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
 job_201202082253_0006 1 0 12 12 12 0 0 0 raw MAP_ONLY hdfs://10.114.226.34:9000/tmp/temp-170
 9215369/tmp275450578,
 
 Input(s):
 Successfully read 15 records (482 bytes) from: "hdfs://10.114.226.34:9000/user/avkash/avkashwordfile.txt"
 
 Output(s):
 Successfully stored 15 records (183 bytes) in: "hdfs://10.114.226.34:9000/tmp/temp-1709215369/tmp275450578"
 
 Counters:
 Total records written : 15
 Total bytes written : 183
 Spillable Memory Manager spill count : 0
 Total bags proactively spilled: 0
 Total records proactively spilled: 0
 
 Job DAG:
 job_201202082253_0006
 
 
 2012-02-09 05:20:30,948 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
 2012-02-09 05:20:30,979 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
 2012-02-09 05:20:30,979 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
 (avkash)
 (amit)
 (akhil)
 (avkash)
 (hello)
 (world)
 (hello)
 (state)
 (avkash)
 (akhil)
 (world)
 (state)
 (world)
 (state)
 (hello)
 grunt>