Apache Hadoop on Windows Azure Part 10 - Running a JavaScript Map/Reduce Job from Interactive JavaScript Console

Microsoft distribution of Apache Hadoop on Windows Azure, let you run JavaScript Map/Reduce jobs directly from web based Interactive JavaScript Console. To start with lets write a JavaScript code for Map/Reduce wordcount jobs as below:

 

FileName #Wordcount.js:

 var map = function (key, value, context) {
 var words = value.split(/[^a-zA-Z]/);
 for (var i = 0; i < words.length; i++) {
 if (words[i] !== "") {
 context.write(words[i].toLowerCase(), 1);
 }
 }
 };var reduce = function (key, values, context) {
 var sum = 0;
 while (values.hasNext()) {
 sum += parseInt(values.next());
 }
 context.write(key, sum);
 };
 

After that you can upload this wordcount.js file to HDFS and verify it as below:

 js> fs.put()
 js> #ls
 Found 2 items
 drwxr-xr-x   - avkash supergroup          0 2012-01-02 20:25 /user/avkash/.oink
 -rw-r--r--   3 avkash supergroup        418 2012-01-02 20:17 /user/avkash/wordcount.js

Now you can create a folder name “wordsfolder” and upload a few txt files. We will use this folder as input folder to run the word count map/reduce job.

 js> #ls
 Found 3 items
 drwxr-xr-x   - avkash supergroup          0 2012-01-02 20:25 /user/avkash/.oink
 -rw-r--r--   3 avkash supergroup        418 2012-01-02 20:17 /user/avkash/wordcount.js
 drwxr-xr-x   - avkash supergroup          0 2012-01-02 20:22 /user/avkash/wordsfolder
  

js> #ls wordsfolder

Found 3 items

-rw-r--r--   3 avkash supergroup    1395667 2012-01-02 20:22 /user/avkash/wordsfolder/davinci.txt

-rw-r--r--   3 avkash supergroup     674762 2012-01-02 20:22 /user/avkash/wordsfolder/outlineofscience.txt

-rw-r--r--   3 avkash supergroup    1573044 2012-01-02 20:22 /user/avkash/wordsfolder/ulysses.txt

  

Now we can run the JavaScript Map/Reduce job to count the top 15 words in descending order in the folder name “top15words” as below:

 js> from("wordsfolder").mapReduce("wordcount.js", "word, count:long").orderBy("count DESC").take(15).to("top15words")
 View Log
  

If you click the “View Log” link above in a new tab, you can see the activity about Map/Reduce job which I have added at the end of this blog:

  

Finally when the job is completed, the following folder “top15words” will be created as below:

 js> #ls 
 Found 4 items
 drwxr-xr-x   - avkash supergroup          0 2012-01-02 20:26 /user/avkash/.oink
 drwxr-xr-x   - avkash supergroup          0 2012-01-02 20:31 /user/avkash/top15words
 -rw-r--r--   3 avkash supergroup        418 2012-01-02 20:17 /user/avkash/wordcount.js
 drwxr-xr-x   - avkash supergroup          0 2012-01-02 20:22 /user/avkash/wordsfolder
  

Now we can read the data from the “top15words” folder:

 js> file = fs.read("top15words")
 the    47430
 of     25263
 and    18664
 a      14213
 in     13125
 to     12634
 is     7876
 that   7057
 it     7005
 on     5081
 he     5037
 with   4931
 his    4314
 as     4289
 by     4119

Let’s parse the data also:

 js> data = parse(file.data,"word, count:long")
 [
     0: {
         word: "the"
         count: 47430
     }
     1: {
         word: "of"
         count: 25263
     }
     2: {
         word: "and"
         count: 18664
     }
     3: {
         word: "a"
         count: 14213
     }
     4: {
         word: "in"
         count: 13125
     }
     5: {
         word: "to"
         count: 12634
     }
     6: {
         word: "is"
         count: 7876
     }
     7: {
         word: "that"
         count: 7057
     }
     8: {
         word: "it"
         count: 7005
     }
     9: {
         word: "on"
         count: 5081
     }
     10: {
         word: "he"
         count: 5037
     }
     11: {
         word: "with"
         count: 4931
     }
     12: {
         word: "his"
         count: 4314
     }
     13: {
         word: "as"
         count: 4289
     }
     14: {
         word: "by"
         count: 4119
     }

]

Finally lets create a line graph from the results:

Here is the Map/Reduce Job results:

 

2012-01-02 20:26:52,304 [main] INFO  org.apache.pig.Main - Logging error messages to: c:\apps\dist\bin\pig_1325536012304.log

2012-01-02 20:26:52,570 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.26.104.45:9000

2012-01-02 20:26:53,038 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.26.104.45:9010

2012-01-02 20:26:53,304 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: ORDER_BY,LIMIT,NATIVE

2012-01-02 20:26:53,304 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used.

2012-01-02 20:26:53,507 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - (Name: q2: Store(hdfs://10.26.104.45:9000/user/avkash/top15words:org.apache.pig.builtin.PigStorage) - scope-12 Operator Key: scope-12)

2012-01-02 20:26:53,523 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false

2012-01-02 20:26:53,742 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 5

2012-01-02 20:26:53,742 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 5

2012-01-02 20:26:53,945 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

2012-01-02 20:26:53,992 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-01-02 20:26:55,179 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job

2012-01-02 20:26:55,210 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.

2012-01-02 20:26:55,710 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete

2012-01-02 20:26:55,835 [Thread-4] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 3

2012-01-02 20:26:55,835 [Thread-4] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 3

2012-01-02 20:26:55,882 [Thread-4] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1

2012-01-02 20:26:57,226 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201021955_0002

2012-01-02 20:26:57,226 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.26.104.45:50030/jobdetails.jsp?jobid=job_201201021955_0002

2012-01-02 20:27:28,772 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 10% complete

2012-01-02 20:27:40,771 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 20% complete

2012-01-02 20:27:42,646 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2012-01-02 20:27:43,209 [main] INFO  org.apache.hadoop.mapred.JobClient - Running job: job_201201021955_0003

2012-01-02 20:27:44,224 [main] INFO  org.apache.hadoop.mapred.JobClient -  map 0% reduce 0%

2012-01-02 20:28:12,223 [main] INFO  org.apache.hadoop.mapred.JobClient -  map 100% reduce 0%

2012-01-02 20:28:36,223 [main] INFO  org.apache.hadoop.mapred.JobClient -  map 100% reduce 100%

2012-01-02 20:28:47,222 [main] INFO  org.apache.hadoop.mapred.JobClient - Job complete: job_201201021955_0003

2012-01-02 20:28:47,222 [main] INFO  org.apache.hadoop.mapred.JobClient - Counters: 25

2012-01-02 20:28:47,222 [main] INFO  org.apache.hadoop.mapred.JobClient -   Job Counters

2012-01-02 20:28:47,222 [main] INFO  org.apache.hadoop.mapred.JobClient -     Launched reduce tasks=1

2012-01-02 20:28:47,222 [main] INFO  org.apache.hadoop.mapred.JobClient -     SLOTS_MILLIS_MAPS=32061

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Total time spent by all reduces waiting after reserving slots (ms)=0

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Total time spent by all maps waiting after reserving slots (ms)=0

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Launched map tasks=1

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Data-local map tasks=1

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     SLOTS_MILLIS_REDUCES=21531

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -   File Output Format Counters

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Bytes Written=424066

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -   FileSystemCounters

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     FILE_BYTES_READ=11850310

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     HDFS_BYTES_READ=3597791

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     FILE_BYTES_WRITTEN=17819374

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     HDFS_BYTES_WRITTEN=424066

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -   File Input Format Counters

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Bytes Read=3597657

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -   Map-Reduce Framework

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce input groups=39491

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output materialized bytes=5924329

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Combine output records=0

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map input records=77934

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce shuffle bytes=0

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce output records=39491

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Spilled Records=1890066

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output bytes=4664279

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Combine input records=0

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output records=630022

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     SPLIT_RAW_BYTES=134

2012-01-02 20:28:47,238 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce input records=630022

2012-01-02 20:28:47,238 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 40% complete

2012-01-02 20:28:47,238 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

2012-01-02 20:28:47,238 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-01-02 20:28:48,629 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job

2012-01-02 20:28:48,644 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.

2012-01-02 20:28:49,035 [Thread-24] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2012-01-02 20:28:49,035 [Thread-24] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

2012-01-02 20:28:49,035 [Thread-24] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1

2012-01-02 20:28:50,050 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201021955_0004

2012-01-02 20:28:50,050 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.26.104.45:50030/jobdetails.jsp?jobid=job_201201021955_0004

2012-01-02 20:29:17,550 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete

2012-01-02 20:29:20,049 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete

2012-01-02 20:29:25,049 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete

2012-01-02 20:29:29,549 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

2012-01-02 20:29:29,549 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-01-02 20:29:30,768 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job

2012-01-02 20:29:30,830 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.

2012-01-02 20:29:31,205 [Thread-34] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2012-01-02 20:29:31,205 [Thread-34] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

2012-01-02 20:29:31,205 [Thread-34] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1

2012-01-02 20:29:31,330 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 60% complete

2012-01-02 20:29:32,252 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201021955_0005

2012-01-02 20:29:32,252 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.26.104.45:50030/jobdetails.jsp?jobid=job_201201021955_0005

2012-01-02 20:30:11,251 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete

2012-01-02 20:30:12,251 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete

2012-01-02 20:30:17,251 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete

2012-01-02 20:30:22,251 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete

2012-01-02 20:30:27,250 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 70% complete

2012-01-02 20:30:32,750 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete

2012-01-02 20:30:37,250 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete

2012-01-02 20:30:42,250 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete

2012-01-02 20:30:46,765 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job

2012-01-02 20:30:46,765 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3

2012-01-02 20:30:47,937 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job

2012-01-02 20:30:47,984 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.

2012-01-02 20:30:48,406 [Thread-45] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1

2012-01-02 20:30:48,406 [Thread-45] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

2012-01-02 20:30:48,406 [Thread-45] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1

2012-01-02 20:30:48,484 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 80% complete

2012-01-02 20:30:49,390 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_201201021955_0006

2012-01-02 20:30:49,390 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - More information at: https://10.26.104.45:50030/jobdetails.jsp?jobid=job_201201021955_0006

2012-01-02 20:31:17,889 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-02 20:31:19,389 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-02 20:31:24,389 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-02 20:31:34,389 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 90% complete

2012-01-02 20:31:48,982 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete

2012-01-02 20:31:48,998 [main] INFO  org.apache.pig.tools.pigstats.PigStats - Script Statistics:

 

HadoopVersion      PigVersion         UserId    StartedAt FinishedAt         Features

0.20.203.1-SNAPSHOT 0.8.1-SNAPSHOT     avkash    2012-01-02 20:26:53 2012-01-02 20:31:48          ORDER_BY,LIMIT,NATIVE

 

Success!

 

Job Stats (time in seconds):

JobId     Maps      Reduces   MaxMapTime         MinMapTIme         AvgMapTime          MaxReduceTime      MinReduceTime      AvgReduceTime      Alias     Feature   Outputs

job_201201021955_0002        1         0         15        15        15        0         0         0          q0        MAP_ONLY 

job_201201021955_0004        1         0         12        12        12        0         0         0          q1        MAP_ONLY 

job_201201021955_0005        1         1         11        11        11        21        21        21          q2        SAMPLER  

job_201201021955_0006        1         1         12        12        12        18        18        18          q2        ORDER_BY,COMBINER  hdfs://10.26.104.45:9000/user/avkash/top15words,

job_D:/Users/avkash/AppData/Local/Temp/MRjs1699097122276446870.jar__0001     0         0         0          0         0         0         0         0                  NATIVE   

 

Input(s):

Successfully read 77934 records (3644014 bytes) from: "hdfs://10.26.104.45:9000/user/avkash/wordsfolder"

Successfully read 39491 records (424454 bytes) from: "hdfs://10.26.104.45:9000/user/avkash/.oink/output2/mr/out"

 

Output(s):

Successfully stored 15 records (132 bytes) in: "hdfs://10.26.104.45:9000/user/avkash/top15words"

 

Counters:

Total records written : 15

Total bytes written : 132

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0

 

Job DAG:

job_201201021955_0002        ->          job_D:/Users/avkash/AppData/Local/Temp/MRjs1699097122276446870.jar__0001,

job_D:/Users/avkash/AppData/Local/Temp/MRjs1699097122276446870.jar__0001     ->          job_201201021955_0004,

job_201201021955_0004        ->        job_201201021955_0005,

job_201201021955_0005        ->        job_201201021955_0006,

job_201201021955_0006

 

 

2012-01-02 20:31:49,092 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

 

 

Keywords: Windows Azure, Hadoop, Apache, BigData, Cloud, MapReduce