Running Apache Mahout at Hadoop on Windows Azure (www.hadooponazure.com)

Once you have access enabled to Hadoop on Windows Azure you can run any mahout sample on head node. I am just trying to run original Apache Mahout (https://mahout.apache.org/) sample which is derived from the clustering sample on Mahout's website (https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data).

Step 1: Please RDP to your head node and open the Hadoop command line window.
Here you can just launch MAHOUT to see what happens

Step 2: Download necessary data file from the Internet:

Please download Synthetic control data from https://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data and place it under c:\apps\dist\mahout\examples\bin\work\synthetic_control.data"

Step 3: Go to folder c:\apps\dist\mahout\examples\bin and Run command "build-cluster-syntheticcontrol.cmd" and select the desired clustering algorithm from the driver script.

 c:\Apps\dist\mahout\examples\bin>build-cluster-syntheticcontrol.cmd
 "Please select a number to choose the corresponding clustering algorithm"
 "1. canopy clustering"
 "2. kmeans clustering"
 "3. fuzzykmeans clustering"
 "4. dirichlet clustering"
 "5. meanshift clustering"
 Enter your choice:1
 "ok. You chose 1 and we'll use canopy Clustering"
 "DFS is healthy... "
 "Uploading Synthetic control data to HDFS"
 rmr: cannot remove testdata: No such file or directory.
 "Successfully Uploaded Synthetic control data to HDFS "
 "Running on hadoop, using HADOOP_HOME=c:\Apps\dist"
 c:\Apps\dist\bin\hadoop jar c:\Apps\dist\mahout\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver org.apache.mahout.clustering.synthet
 iccontrol.canopy.Job
 12/03/06 00:50:10 WARN driver.MahoutDriver: No org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props found on classpath, will use command-lin
 e arguments only
 12/03/06 00:50:10 INFO canopy.Job: Running with default arguments
 12/03/06 00:50:17 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
 12/03/06 00:50:18 INFO input.FileInputFormat: Total input paths to process : 1
 12/03/06 00:50:20 INFO mapred.JobClient: Running job: job_201203052259_0001
 12/03/06 00:50:21 INFO mapred.JobClient: map 0% reduce 0%
 12/03/06 00:51:00 INFO mapred.JobClient: map 100% reduce 0%
 12/03/06 00:51:11 INFO mapred.JobClient: Job complete: job_201203052259_0001
 12/03/06 00:51:11 INFO mapred.JobClient: Counters: 16
 12/03/06 00:51:11 INFO mapred.JobClient: Job Counters
 12/03/06 00:51:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=33969
 12/03/06 00:51:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
 12/03/06 00:51:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
 12/03/06 00:51:11 INFO mapred.JobClient: Launched map tasks=1
 12/03/06 00:51:11 INFO mapred.JobClient: Data-local map tasks=1
 12/03/06 00:51:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
 12/03/06 00:51:11 INFO mapred.JobClient: File Output Format Counters
 12/03/06 00:51:11 INFO mapred.JobClient: Bytes Written=335470
 12/03/06 00:51:11 INFO mapred.JobClient: FileSystemCounters
 12/03/06 00:51:11 INFO mapred.JobClient: FILE_BYTES_READ=130
 12/03/06 00:51:11 INFO mapred.JobClient: HDFS_BYTES_READ=288508
 12/03/06 00:51:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21557
 12/03/06 00:51:11 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=335470
 12/03/06 00:51:11 INFO mapred.JobClient: File Input Format Counters
 12/03/06 00:51:11 INFO mapred.JobClient: Bytes Read=288374
 12/03/06 00:51:11 INFO mapred.JobClient: Map-Reduce Framework
 12/03/06 00:51:11 INFO mapred.JobClient: Map input records=600
 12/03/06 00:51:11 INFO mapred.JobClient: Spilled Records=0
 12/03/06 00:51:11 INFO mapred.JobClient: Map output records=600
 12/03/06 00:51:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=134
 12/03/06 00:51:11 INFO canopy.CanopyDriver: Build Clusters Input: output/data Out: output Measure: org.apache.mahout.common.distance.EuclideanDistance
 Measure@1997c1d8 t1: 80.0 t2: 55.0
 12/03/06 00:51:11 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
 12/03/06 00:51:12 INFO input.FileInputFormat: Total input paths to process : 1
 12/03/06 00:51:13 INFO mapred.JobClient: Running job: job_201203052259_0002
 12/03/06 00:51:14 INFO mapred.JobClient: map 0% reduce 0%
 12/03/06 00:51:58 INFO mapred.JobClient: map 100% reduce 0%
 12/03/06 00:52:16 INFO mapred.JobClient: map 100% reduce 100%
 12/03/06 00:52:27 INFO mapred.JobClient: Job complete: job_201203052259_0002
 12/03/06 00:52:27 INFO mapred.JobClient: Counters: 25
 12/03/06 00:52:27 INFO mapred.JobClient: Job Counters
 12/03/06 00:52:27 INFO mapred.JobClient: Launched reduce tasks=1
 12/03/06 00:52:27 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=30345
 12/03/06 00:52:27 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
 12/03/06 00:52:27 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
 12/03/06 00:52:27 INFO mapred.JobClient: Launched map tasks=1
 12/03/06 00:52:27 INFO mapred.JobClient: Data-local map tasks=1
 12/03/06 00:52:27 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15968
 12/03/06 00:52:27 INFO mapred.JobClient: File Output Format Counters
 12/03/06 00:52:27 INFO mapred.JobClient: Bytes Written=6615
 12/03/06 00:52:27 INFO mapred.JobClient: FileSystemCounters
 12/03/06 00:52:27 INFO mapred.JobClient: FILE_BYTES_READ=14296
 12/03/06 00:52:27 INFO mapred.JobClient: HDFS_BYTES_READ=335597
 12/03/06 00:52:27 INFO mapred.JobClient: FILE_BYTES_WRITTEN=73063
 12/03/06 00:52:27 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=6615
 12/03/06 00:52:27 INFO mapred.JobClient: File Input Format Counters
 12/03/06 00:52:27 INFO mapred.JobClient: Bytes Read=335470
 12/03/06 00:52:27 INFO mapred.JobClient: Map-Reduce Framework
 12/03/06 00:52:27 INFO mapred.JobClient: Reduce input groups=1
 12/03/06 00:52:27 INFO mapred.JobClient: Map output materialized bytes=13906
 12/03/06 00:52:27 INFO mapred.JobClient: Combine output records=0
 12/03/06 00:52:27 INFO mapred.JobClient: Map input records=600
 12/03/06 00:52:27 INFO mapred.JobClient: Reduce shuffle bytes=0
 12/03/06 00:52:27 INFO mapred.JobClient: Reduce output records=6
 12/03/06 00:52:27 INFO mapred.JobClient: Spilled Records=50
 12/03/06 00:52:27 INFO mapred.JobClient: Map output bytes=13800
 12/03/06 00:52:27 INFO mapred.JobClient: Combine input records=0
 12/03/06 00:52:27 INFO mapred.JobClient: Map output records=25
 12/03/06 00:52:27 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
 12/03/06 00:52:27 INFO mapred.JobClient: Reduce input records=25
 12/03/06 00:52:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
 12/03/06 00:52:27 INFO input.FileInputFormat: Total input paths to process : 1
 12/03/06 00:52:28 INFO mapred.JobClient: Running job: job_201203052259_0003
 12/03/06 00:52:29 INFO mapred.JobClient: map 0% reduce 0%
 12/03/06 00:53:46 INFO mapred.JobClient: map 100% reduce 0%
 12/03/06 00:58:20 INFO mapred.JobClient: Job complete: job_201203052259_0003
 12/03/06 00:58:20 INFO mapred.JobClient: Counters: 16
 12/03/06 00:58:20 INFO mapred.JobClient: Job Counters
 12/03/06 00:58:20 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=30407
 12/03/06 00:58:20 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
 12/03/06 00:58:20 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
 12/03/06 00:58:20 INFO mapred.JobClient: Rack-local map tasks=1
 12/03/06 00:58:20 INFO mapred.JobClient: Launched map tasks=1
 12/03/06 00:58:20 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
 12/03/06 00:58:20 INFO mapred.JobClient: File Output Format Counters
 12/03/06 00:58:20 INFO mapred.JobClient: Bytes Written=340891
 12/03/06 00:58:20 INFO mapred.JobClient: FileSystemCounters
 12/03/06 00:58:20 INFO mapred.JobClient: FILE_BYTES_READ=130
 12/03/06 00:58:21 INFO mapred.JobClient: HDFS_BYTES_READ=342212
 12/03/06 00:58:21 INFO mapred.JobClient: FILE_BYTES_WRITTEN=22251
 12/03/06 00:58:21 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=340891
 12/03/06 00:58:21 INFO mapred.JobClient: File Input Format Counters
 12/03/06 00:58:21 INFO mapred.JobClient: Bytes Read=335470
 12/03/06 00:58:21 INFO mapred.JobClient: Map-Reduce Framework
 12/03/06 00:58:21 INFO mapred.JobClient: Map input records=600
 12/03/06 00:58:21 INFO mapred.JobClient: Spilled Records=0
 12/03/06 00:58:21 INFO mapred.JobClient: Map output records=600
 12/03/06 00:58:21 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
 C-0{n=21 c=[29.552, 33.073, 35.876, 36.375, 35.118, 32.761, 29.566, 26.983, 25.272, 24.967, 25.691, 28.252, 30.994, 33.088, 34.015, 34.349, 32.826, 31
 .053, 29.116, 27.975, 27.879, 28.103, 28.775, 30.585, 31.049, 31.652, 31.956, 31.278, 30.719, 29.901, 29.545, 30.207, 30.672, 31.366, 31.032, 31.567,
 30.610, 30.204, 29.266, 29.753, 29.296, 29.930, 31.207, 31.191, 31.474, 32.154, 31.746, 30.771, 30.250, 29.807, 29.543, 29.397, 29.838, 30.489, 30.705
 , 31.503, 31.360, 30.827, 30.426, 30.399] r=[0.979, 3.352, 5.334, 5.851, 4.868, 3.000, 3.376, 4.812, 5.159, 5.596, 4.940, 4.793, 5.415, 5.014, 5.155,
 4.262, 4.891, 5.475, 6.626, 5.691, 5.240, 4.385, 5.767, 7.035, 6.238, 6.349, 5.587, 6.006, 6.282, 7.483, 6.872, 6.952, 7.374, 8.077, 8.676, 8.636, 8.6
 97, 9.066, 9.835, 10.148, 10.091, 10.175, 9.929, 10.241, 9.824, 10.128, 10.595, 9.799, 10.306, 10.036, 10.069, 10.058, 10.008, 10.335, 10.160, 10.249,
 10.222, 10.081, 10.274, 10.145]}
 Weight: Point:
 ……...
 ……..
 …….
 
 1.0: [27.414, 25.397, 26.460, 31.978, 26.125, 27.463, 30.489, 34.929, 27.558, 30.686, 27.511, 32.269, 32.834, 27.129, 24.991, 32.610, 25.387,
 32.674, 34.607, 33.519, 29.012, 28.705, 32.116, 29.121, 26.424, 33.452, 33.623, 29.457, 35.025, 26.607, 34.442, 34.847, 28.897, 34.439, 32.011, 34.816
 , 27.773, 11.549, 20.219, 19.678, 14.715, 14.384, 15.556, 9.573, 10.636, 16.639, 17.236, 19.643, 18.317, 15.323, 19.106, 11.455, 16.888, 18.269, 11.58
 3, 112/03/06 00:58:24 INFO driver.MahoutDriver: Program took 493470 ms
 

After the Mahout job was completed the output was stored as below:

js> #ls

Found 3 items

drwxr-xr-x   - avkash supergroup          0 2012-03-06 01:05 /user/avkash/.oink

drwxr-xr-x   - avkash supergroup          0 2012-03-06 00:52 /user/avkash/output

drwxr-xr-x   - avkash supergroup          0 2012-03-06 00:49 /user/avkash/testdata

js> #ls /user/avkash/output

Found 3 items

drwxr-xr-x   - avkash supergroup          0 2012-03-06 00:53 /user/avkash/output/clusteredPoints

drwxr-xr-x   - avkash supergroup          0 2012-03-06 00:52 /user/avkash/output/clusters-0

drwxr-xr-x   - avkash supergroup          0 2012-03-06 00:51 /user/avkash/output/data

 

Now let’s analyzing mahout cluster output using clusterdump utility:

 

Clusterdump utility takes 3 parameters:

  1. –seqFileDir – this is the path folder where clustering sequence folder is (in this case output/clusters-0)
  2. –pointsDir – this is the path folder where clustering points folder is (in this case output/clusteredPoints)
  3. --output– this is the path where you would want to create your analysis result.
    1. Be sure that this parameter will force to create analysis result text in local machine not on HDFS

Running the command as below:

 c:\Apps\dist\mahout\examples\bin>mahout clusterdump --seqFileDir output\clusters-0 --pointsDir output\clusteredPoints --output clusteranalyze.txt
 
 "Running on hadoop, using HADOOP_HOME=c:\Apps\dist"
 c:\Apps\dist\bin\hadoop jar c:\Apps\dist\mahout\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver clusterdump --seqFileDir output\clusters-0 --pointsDir output\clusteredPoints --output clusteranalyze.txt
 12/03/06 21:05:53 WARN driver.MahoutDriver: No clusterdump.props found on classpath, will use command-line arguments only
 12/03/06 21:05:53 INFO common.AbstractJob: Command line arguments: {--dictionaryType=text, --endPhase=2147483647, --output=clusteranalyze.txt, --pointsDir=output\clusteredPoints, --seqFileDir=output\clusters-0, --startPhase=0, --tempDir=temp}
 12/03/06 21:05:55 INFO driver.MahoutDriver: Program took 2031 ms

 

Now if you open folder at your machine, will find “clusteranalyze.txt” as below:

 

Opening clusteranalyze.txt shows the data as below:

 

Cluster Dumper Reference: