Apache Hadoop on Windows Azure: Few tips and tricks to manage your Hadoop cluster in Windows Azure

Article
01/06/2012

In Hadoop cluster, namenode communicate with all the other nodes. Apache Hadoop on Windows Azure have the following XML file which includes all the primary settings for Hadoop:

C:\Apps\Dist\conf\HDFS-SITE.XML

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>dfs.permissions</name>

<value>false</value>

</property>

<name>dfs.replication</name>

</property>

<name>dfs.datanode.max.xcievers</name>

</property>

<name>dfs.name.dir</name> <======This is the NAME node data directory

</property>

<name>dfs.data.dir</name> <========= This is the DATA node data directory

</property>

</configuration>

C:\Apps\Dist\conf\Core-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>hadoop.tmp.dir</name>

<description>A base for other temporary directories.</description>

</property>

<name>fs.default.name</name>

<value>hdfs://10.26.104.45:9000</value> <== After the role started the VM gets IP Address and then included here

</property>

<name>io.file.buffer.size</name>

</property>

</configuration>

C:\Apps\Dist\conf\Mapred-site.xml:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>mapred.job.tracker</name>

</property>

<name>mapred.local.dir</name>

<value>/hdfs/mapred/local</value>

</property>

<name>mapred.tasktracker.map.tasks.maximum</name>

</property>

<name>mapred.tasktracker.reduce.tasks.maximum</name>

</property>

<name>mapred.child.java.opts</name>

</property>

<name>mapreduce.client.tasklog.timeout</name>

</property>

<name>mapred.task.timeout</name>

</property>

<name>mapreduce.reduce.shuffle.connect.timeout</name>

</property>

<name>mapreduce.reduce.shuffle.read.timeout</name>

</property>

</configuration>

You sure can make necessary changes to above setting however after that you would need to restart namenode as below:

C:\Apps\Dist \> Hadoop namenode -format

For more command you can see check the Hadoop command line shortcut:

c:\apps\dist>hadoop

Usage: hadoop [--config confdir] COMMAND

where COMMAND is one of:

namenode -format format the DFS filesystem

secondarynamenode run the DFS secondary namenode

namenode run the DFS namenode

datanode run a DFS datanode

dfsadmin run a DFS admin client

mradmin run a Map-Reduce admin client

fsck run a DFS filesystem checking utility

fs run a generic filesystem user client

balancer run a cluster balancing utility

jobtracker run the MapReduce job Tracker node

pipes run a Pipes job

tasktracker run a MapReduce task Tracker node

job manipulate MapReduce jobs

queue get information regarding JobQueues

version print the version

jar <jar> run a jar file

distcp <srcurl> <desturl> copy file or directories recursively

archive -archiveName NAME <src>* <dest> create a hadoop archive

daemonlog get/set the log level for each daemon

CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

You can also use the following configuration related with Java logging which can be modified, however you would need to re-launch Java process again:

C:\Apps\Dist\conf\Log4j.properties:

hadoop.log.file=hadoop.log

log4j.rootLogger=${hadoop.root.logger}, EventCounter

log4j.threshhold=ALL

# TaskLog Appender

#Default values

hadoop.tasklog.taskid=null

hadoop.tasklog.noKeepSplits=4

hadoop.tasklog.totalLogFileSize=100

hadoop.tasklog.purgeLogSplits=true

hadoop.tasklog.logsRetainHours=12

log4j.appender.TLA=org.apache.hadoop.mapred.TaskLogAppender

log4j.appender.TLA.taskId=${hadoop.tasklog.taskid}

log4j.appender.TLA.totalLogFileSize=${hadoop.tasklog.totalLogFileSize}

log4j.appender.TLA.layout=org.apache.log4j.PatternLayout

log4j.appender.TLA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n

# FSNamesystem Audit logging

log4j.logger.org.apache.hadoop.fs.FSNamesystem.audit=WARN

# Custom Logging levels

#log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG

#log4j.logger.org.apache.hadoop.mapred.TaskTracker=DEBUG

#log4j.logger.org.apache.hadoop.fs.FSNamesystem=DEBUG

# Jets3t library

log4j.logger.org.jets3t.service.impl.rest.httpclient.RestS3Service=ERROR

# Event Counter Appender

# Sends counts of logging messages at different severity levels to Hadoop Metrics.

log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter

Resources:

https://hadoop.apache.org/common/docs/current/cluster_setup.html

https://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/

Apache Hadoop on Windows Azure: Few tips and tricks to manage your Hadoop cluster in Windows Azure

Additional resources