Hadoop adventures with Microsoft HDInsight

What is HDInsight?  HDinsight is the product name for Microsoft installation of Hadoop and Hadoop on azure service. HDInsight is Microsoft’s 100% Apache compatible Hadoop distribution, supported by Microsoft. HDInsight, available both on Windows Server or as an Windows Azure service, empowers organizations with new insights on previously untouched unstructured data, while connecting to the…

3

Programmatically retrieving Task ID and Unique Reducer ID in MapReduce

For each Mapper and Reducer you can get Task attempt id and Task ID both. This can be done when you set up your map using the Context object. You may also know that the when setting a Reducer an unique reduce ID is used inside reducer class setup method. You can get this ID…

1

Programmatically setting number of reducers with MapReduce job in Hadoop Cluster

When submitting a Map/Reduce job in Hadoop cluster, you can provide number of map task for the jobs and the number of reducers are created depend on the Mappers input and the Hadoop cluster capacity. Or you can push the job and Map/Reduce framework will adjust it per cluster configuration. So setting the total number…

1

Processing already sorted data with Hadoop Map/Reduce jobs without performance overhead

While working with Map/Reduce jobs in Hadoop, it is very much possible that you have got “sorted data” stored in HDFS. As you may know the “Sort function” exists not only after map process in map task but also with merge process during reduce task, so having sorted data to sort again would be a…

0

How to submit Hadoop Map/Reduce jobs in multiple command shell to run in parallel

Sometimes it is required to run multiple Map/Reduce jobs in same Hadoop cluster however opening several Hadoop command shell or (Hadoop terminal) could be trouble. Note that depend on your Hadoop cluster size and configuration, you can run limited amount of Map/Reduce jobs in parallel however if you would need to do so, here is…

0

Listing current running Hadoop Jobs and Killing running Jobs

When you have jobs running in Hadoop, you can use the map/reduce web view to list the current running jobs however what if you would need to kill any current running job because the submitted jobs started malfunctioning or in worst case scenario, the job is stuck in infinite loops. I have seen several scenarios…

1

How to troubleshoot MapReduce jobs in Hadoop

When writing MapReduce programs you definitely going to hit problems in your programs such as infinite loops, crash in MapReduce, Incomplete jobs etc. Here are a few things which will help you to isolate these problems:   Map/Reduce Logs Files: All MapReduce jobs activities are logged by default in Hadoop. By default, log files are…

1

How to chain multiple MapReduce jobs in Hadoop

When running MapReduce jobs it is possible to have several MapReduce steps with overall job scenarios means the last reduce output will be used as input for the next map job. Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3… While searching for an answer to my MapReduce job, I stumbled upon several cool new…

3

How to wipe out the DFS in Hadoop?

If you format only Namenode, it will remove the metadata stored by the Namenode, however all the temporary storage and Datanode blocks will still be there. To remove temporary storage and all the Datanode blocks you would need to delete the main Hadoop storage directory from every node. This directory is defined by the hadoop.tmp.dir…

0

Running Apache Mahout at Hadoop on Windows Azure (www.hadooponazure.com)

Once you have access enabled to Hadoop on Windows Azure you can run any mahout sample on head node. I am just trying to run original Apache Mahout (http://mahout.apache.org/) sample which is derived from the clustering sample on Mahout’s website (https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data). Step 1: Please RDP to your head node and open the Hadoop command line…

1