Compile and build specific Hadoop source code branch using Azure VM


Sometimes you may want to test a Hadoop feature that is available in a specific branch that is not available as a binary release. For example, in my case, I want to try accessing Azure Data Lake Store (ADLS) via its WebHDFS endpoint. Access to ADLS requires OAuth2, support for which was added in Hadoop 2.8 (HDFS-8155) but is not available in the current Hadoop 2.7.x releases.

Hadoop source code is available in this mirrored GitHub repo https://github.com/apache/hadoop. Version 2.8 specific code is available in the branch appropriately called "branch-2.8"

image

Deploy Azure VM with Ubuntu 14.04-LTS

As is described in the Building instructions for Hadoop, "the easiest way to get an environment with all the appropriate tools is by means of the provided Docker config" (for Linux or Mac). Since my primary laptop is running Windows 10, I will deploy a Ubuntu 14.04 LTS virtual machine in my Azure subscription, use it to build Hadoop 2.8 binary tar.gz file, download the resultant file, and delete the VM once I am done.

I am using Standard_DS2 VM size created from Canonical Ubuntu 14.04 LTS Azure gallery image https://portal.azure.com/#create/Canonical.UbuntuServer1404LTS-ARM

image

Install Docker on Ubuntu 14.04

After the VM is deployed, I SSH into it using its public IP and quickly install Docker following the instructions for Ubuntu 14.04 from https://docs.docker.com/engine/installation/linux/ubuntulinux/

sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates
sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" | sudo tee --append /etc/apt/sources.list.d/docker.list
sudo apt-get update
sudo apt-get purge lxc-docker
apt-cache policy docker-engine
sudo apt-get install linux-image-extra-$(uname -r)
sudo apt-get install docker-engine
sudo service docker start
sudo docker run hello-world

By default, I am not able to run "docker run hello-world" using my user account (i.e. azureuser) without using sudo. When I try it, I get back this message "docker: Cannot connect to the Docker daemon. Is the docker daemon running on this host?" This happens because by default docker daemon's Unix socket is owned by the user root and other users can access it only with sudo.

To enable azureuser to run docker without sudo, I follow the instructions from Docker to create group called "docker", add my user to that group, logout, log back in, and try docker run again.

sudo groupadd docker
sudo usermod -aG docker `whoami`
logout

After logging back in, I now can run "docker run hello-world" without problems.

Clone Hadoop 2.8 Branch

Since I want to compile specifically the branch called "branch-2.8", I use Git to clone only that specific branch to my home directory (/home/azureuser/hadoop-2.8) using this command:

git clone -b branch-2.8 --single-branch https://github.com/apache/hadoop.git hadoop-2.8

image

Start Docker Container with Hadoop Build Environment

Following instructions from https://github.com/apache/hadoop/blob/trunk/BUILDING.txt, I start the Hadoop build environment using the provided script:

cd hadoop-2.8/
./start-build-env.sh

image

This process will take some time (~5-10 min) since it installs all of the required build environment tools (JDK, Maven, etc.) in the container.

Building Hadoop within the Docker Container

After the creation process is finished, I see my Hadoop Dev docker container running.

image

I try to start the Maven binary distribution build without native code, without running the tests, and without documentation.

mvn package -Pdist -DskipTests -Dtar

Resolving Permissions Error

However, I get a permissions error regarding the /home/azureuser/.m2 directory (used by Maven).

image

To fix this problem, I exit the docker container, and set the ownership of the /home/azureuser/.m2 directory to azureuser:azureuser.

sudo chown azureuser:azureuser ~/.m2

image

Restarting Container and Starting Maven Build

After the permission problem is resolved, I restart the docker container:

cd hadoop-2.8/
./start-build-env.sh

Once within the container, I again try to start the Maven build and package:

mvn package -Pdist -DskipTests -Dtar

image

This process will take some time to complete. For me, on the Standard_DS2 Azure VM, it took about 9 minutes.

image

Download Binary Distribution File

After the build process is complete, the resultant files are found in the hadoop-dist/target directory.

image

I download the hadoop-dist-2.8.0-SNAPSHOT.tar.gz (200MB) file to my local machine from the Ubuntu Azure VM (e.g. using WinSCP, MobaXterm SFTP, etc.).

I also store this file as a block blob in a Azure Storage container so that I can quickly download it from there without rebuilding (https://avdatarepo1.blob.core.windows.net:443/hadoop/hadoop-2.8.0-SNAPSHOT.tar.gz)

Once I have the binary distribution file ready, I can go ahead and delete my Azure VM.

Conclusion

It is very convenient and quick to be able to use an Azure VM running Ubuntu 14.04-LTS and Docker to setup the temporary Hadoop build environment. Although in this case I specifically built the "branch-2.8" branch, the same process can be used to build other Hadoop branches (or trunk) from source.

I’m looking forward to your feedback and questions via Twitter https://twitter.com/ArsenVlad

Comments (0)

Skip to main content