Keys to understand relationship between MapReduce and HDFS

Map Task (HDFS data localization):

The unit of input for a map task is an HDFS data block of the input file. The map task functions most efficiently if the data block it has to process is available locally on the node on which the task is scheduled. This approach is called HDFS data localization.

 

An HDFS data locality miss occurs if the data needed by the map task is not available locally. In such a case, the map task will request the data from another node in the cluster: an operation that is expensive and time consuming, leading to inefficiencies and, hence, delay in job completion.

 

Clients, Data Nodes, and HDFS Storage:

Input data is uploaded to the HDFS file system in either of following two ways:

  1. An HDFS client has a large amount of data to place into HDFS.
  2. An HDFS client is constantly streaming data into HDFS.

 

Both these scenarios have the same interaction with HDFS, except that in the streaming case, the client waits for enough data to fill a data block before writing to HDFS. Data is stored in HDFS in large blocks, generally 64 to 128 MB or more in size. This storage approach allows easy parallel processing of data.

 

HDFS-SITE.XML

<property>

<name>dfs.block.size</name>

<value>134217728</value> ç 128MB Block size

</property>

 

OR

 

<property>

<name>dfs.block.size</name>

<value>67108864</value> ç 64MB Block size (Default is this value is not set)

</property>

 

Block Replication Factor:

During the process of writing to HDFS, the blocks are generally replicated to multiple data nodes for redundancy. The number of copies, or the replication factor, is set to a default of 3 and can be modified by the cluster administrator as below:

 

HDFS-SITE.XML

<property>

<name>dfs.replication</name>

<value>3</value>

</property>

<property>

 

When the replication factor is three, HDFS’s placement policy is to:

-     Put one replica on one node in the local rack,

-     Another on a node in a different (remote) rack,

-     Last on a different node in the same remote rack.

 

When a new data block is stored on a data node, the data node initiates a replication process to replicate the data onto a second data node. The second data node, in turn, replicates the block to a third data node, completing the replication of the block.

 

With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.

 

Keywords: Hadoop, MapReduce, HDFS, Performance,