Apache Hadoop on Windows Azure: Few tips and tricks to manage your Hadoop cluster in Windows Azure

In Hadoop cluster, namenode communicate with all the other nodes. Apache Hadoop on Windows Azure have the following XML file which includes all the primary settings for Hadoop:

 

C:\Apps\Dist\conf\HDFS-SITE.XML

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>

    <name>dfs.permissions</name>

    <value>false</value>

  </property>

  <property>

    <name>dfs.replication</name>

    <value>3</value>

  </property>

  <property>

    <name>dfs.datanode.max.xcievers</name>

    <value>4096</value>

  </property>

  <property>

    <name>dfs.name.dir</name> <======This is the NAME node data directory

    <value>c:\hdfs\nn</value>

  </property>

  <property>

    <name>dfs.data.dir</name> <========= This is the DATA node data directory

    <value>c:\hdfs\dn</value>

  </property>

</configuration>

 

 

C:\Apps\Dist\conf\Core-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>

    <name>hadoop.tmp.dir</name>

    <value>/hdfs/tmp</value>

    <description>A base for other temporary directories.</description>

  </property>

  <property>

    <name>fs.default.name</name>

    <value>hdfs://10.26.104.45:9000</value> <== After the role started the VM gets IP Address and then included here

  </property>

  <property>

    <name>io.file.buffer.size</name>

    <value>131072</value>

  </property>

</configuration>

 

C:\Apps\Dist\conf\Mapred-site.xml:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>

    <name>mapred.job.tracker</name>

    <value>10.26.104.45:9010</value>

  </property>

  <property>

    <name>mapred.local.dir</name>

    <value>/hdfs/mapred/local</value>

  </property>

  <property>

    <name>mapred.tasktracker.map.tasks.maximum</name>

    <value>2</value>

  </property>

  <property>

    <name>mapred.tasktracker.reduce.tasks.maximum</name>

    <value>1</value>

  </property>

  <property>

    <name>mapred.child.java.opts</name>

    <value>-Xmx1024m</value>

  </property>

  <property>

    <name>mapreduce.client.tasklog.timeout</name>

    <value>6000000</value>

  </property>

  <property>

    <name>mapred.task.timeout</name>

    <value>6000000</value>

  </property>

  <property>

    <name>mapreduce.reduce.shuffle.connect.timeout</name>

    <value>600000</value>

  </property>

  <property>

    <name>mapreduce.reduce.shuffle.read.timeout</name>

    <value>600000</value>

  </property>

</configuration>

 

You sure can make necessary changes to above setting however after that you would need to restart namenode as below:

 

  • C:\Apps\Dist \> Hadoop namenode -format

 

For more command you can see check the Hadoop command line shortcut:

c:\apps\dist>hadoop

Usage: hadoop [--config confdir] COMMAND

where COMMAND is one of:

  namenode -format     format the DFS filesystem

  secondarynamenode    run the DFS secondary namenode

  namenode             run the DFS namenode

  datanode             run a DFS datanode

 dfsadmin             run a DFS admin client

  mradmin              run a Map-Reduce admin client

  fsck                 run a DFS filesystem checking utility

  fs                   run a generic filesystem user client

  balancer             run a cluster balancing utility

  jobtracker           run the MapReduce job Tracker node

  pipes                run a Pipes job

  tasktracker          run a MapReduce task Tracker node

  job                  manipulate MapReduce jobs

  queue                get information regarding JobQueues

  version              print the version

  jar <jar>            run a jar file

 

  distcp <srcurl> <desturl> copy file or directories recursively

  archive -archiveName NAME <src>* <dest> create a hadoop archive

  daemonlog            get/set the log level for each daemon

or

  CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

 

 

You can also use the following configuration related with Java logging which can be modified, however you would need to re-launch Java process again:

C:\Apps\Dist\conf\Log4j.properties:

hadoop.log.file=hadoop.log

log4j.rootLogger=${hadoop.root.logger}, EventCounter

log4j.threshhold=ALL

 

#

# TaskLog Appender

#

 

#Default values

hadoop.tasklog.taskid=null

hadoop.tasklog.noKeepSplits=4

hadoop.tasklog.totalLogFileSize=100

hadoop.tasklog.purgeLogSplits=true

hadoop.tasklog.logsRetainHours=12

 

log4j.appender.TLA=org.apache.hadoop.mapred.TaskLogAppender

log4j.appender.TLA.taskId=${hadoop.tasklog.taskid}

log4j.appender.TLA.totalLogFileSize=${hadoop.tasklog.totalLogFileSize}

 

log4j.appender.TLA.layout=org.apache.log4j.PatternLayout

log4j.appender.TLA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n

 

# FSNamesystem Audit logging

log4j.logger.org.apache.hadoop.fs.FSNamesystem.audit=WARN

 

# Custom Logging levels

#log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG

#log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG

#log4j.logger.org.apache.hadoop.fs.FSNamesystem=DEBUG

 

# Jets3t library

log4j.logger.org.jets3t.service.impl.rest.httpclient.RestS3Service=ERROR

 

# Event Counter Appender

# Sends counts of logging messages at different severity levels to Hadoop Metrics.

log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter

 

Resources:

https://hadoop.apache.org/common/docs/current/cluster_setup.html

https://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/