Implementing Hadoop Rack Awareness with PowerShell

This post walks-through building a PowerShell script for enabling Rack Awareness in Hadoop. While several example scripts can be found online for Linux, samples building a script for Windows is less common.

Hadoop divides the data into multiple file blocks and stores them on different machines. By default all machines are deemed to be on the same rack, and thus if rack awareness is not configured there the possibility that all replicated copies of a block end up in same rack. Although rare, this could result in data loss when that rack fails. This can be avoided by explicitly configuring Rack Awareness.

Rack Awareness is implemented using the property “topology.script.file.name” in the core-site.xml configuration file. The script accepts a list of IP addresses and return the corresponding list of rack identifiers.

If this “topology.script.file.name” is not configured the default value “/default-rack” is returned for any IP address; thus all nodes considered to be on the same rack.

This sample implementation uses PowerShell to relate a machine name to a predefined rack identifier. One can of course implement any logic one needs. The implementation consists of three files:

  • hadoop-rack-configuration.cmd – Stub CMD file called by Hadoop
  • hadoop-rack-configuration.ps1 – PowerShell file called by the CMD stub
  • hadoop-rack-configurations.txt – Mapping file from machine names to rack identifiers

Each of these files will now be defined.

hadoop-rack-configuration.cmd

The stub CMD file that calls a PS1 file, as follows:

 @echo off
 
set cmd_args=%*
set root_dir=%HADOOP_HOME%\bin
 
set script_path=%root_dir%\hadoop-rack-configuration.ps1
set config_path=%root_dir%\hadoop-rack-configurations.txt
 
PowerShell -NoProfile -ExecutionPolicy Bypass -Command "& '%script_path%' -config_path '%config_path%' %cmd_args%"

This file acts as a stub to pass in the IP addresses and a mapping file, which in this case is a mapping from machine name to rack, to the PowerShell file that implements the necessary mapping logic.

The @echo off commands are needed to ensure that the output is limited to the write-output commands.

hadoop-rack-configurations.txt

This sample implements a mapping of IP address to a rack identifier using the actual machine name through a mapping file. A sample file line would be:

 HDP-DATA01.hdp.lab /hdp/rack1
HDP-DATA02.hdp.lab    /hdp/rack2
HDP-DATA03.hdp.lab    /hdp/rack1
HDP-DATA04.hdp.lab    /hdp/rack2

The machine name and rack identifier are just separated with a TAB. In this implementation the machines are part of a domain “hdp.lab” and as such fully qualified names are used in the mapping.

hadoop-rack-configuration.ps1

The PowerShell file can then use the mapping file to map the IP address to a rack definition.

A sample PowerShell file that uses the configuration mapping file would be as follows:

 param([string] $config_path ) # followed by $cmd_args
 
$listofIPs = $args | % {$_.Trim()};
$configurationFile = $config_path;
 
$DEFAULT_RACK = "/default/rack";
 
function Get-LookupTable {
 
    $hash = New-Object Collections.Hashtable([StringComparer]::InvariantCultureIgnoreCase);
 
    if (Test-Path $configurationFile) {
        Get-Content $configurationFile | ForEach-Object { $split = $_ -Split "\s+|\t+"; $hash[$split[0].Trim()] = $split[1].Trim() };
        Trap { Continue; }
    }
 
    return $hash;
}
 
function Get-ComputerNameByIP {
    param($ipAddress = $null)
    
    begin { }
 
    process {
        if (($ipAddress -and $_) -or (!$ipAddress -and !$_)) {
            throw "Please use either pipeline or input parameter";
            break;
        } else {
            $result = $null;
            $lookup = if ($ipAddress) { $ipAddress.Trim(); } else { $_.Trim(); }
            if ($lookup -as [ipaddress]) {
                $result = [System.Net.Dns]::GetHostbyAddress($lookup);
                Trap { Continue; }
            }
            if ($result) 
            { 
                [string]$result.HostName;
            } 
            else
            {
                $lookup;
            }
        }
    }
 
    end  { }
}
 
function Get-ComputerRack
{
    begin
    {
        $lookupTable = Get-LookupTable;  
    }
 
    process
    {
        $lookup = ($_).trim();
 
        if ($lookupTable.ContainsKey($lookup)) {
            $lookupTable[$lookup];
        } else {
            $DEFAULT_RACK;
        }
    }
 
    end { }
}
 
 
$listofNames = $listofIPs | Get-ComputerNameByIP;
$listofRacks = $listofNames | Get-ComputerRack
write-output ($listofRacks -join " ")

Of course this does mean one can implement any logic you desire now in a PS1 file.

Deploying

Once you have written your PowerShell implementation the final step is to deploy to you cluster.

As mentioned, the core-site.xml is updated to include this additional node:

 <property>
   <name>topology.script.file.name</name>
   <value>hadoop-rack-configuration.cmd</value>
</property>

Finally the files need to be deployed to the cluster. One has to copy the CMD and PS1 files to the %HADOOOP_BIN_PATH% directory.

Once deployed the nodes can then query the script to find out their non-default rack identifier.