Working on Hadoop code on Windows

As a member of the HDInsight team I worked a bit on Hadoop code on Windows and contributed a couple of JIRA's there (JIRA is a bug tracking system Apache uses - contributing code usually involves filing JIRA's and posting patches there). Even if you don't want to contribute to Hadoop code (though it's fun), it's useful to be able to dive into the Hadoop code when you're running your Map-Reduce jobs against it to see what's going on under the hood and maybe debug problems. Anyway, whatever your reasons may be, in this post I'll guide you through how I personally setup my development environment for working on Hadoop on Windows in PowerShell.

Preliminaries

  1. Install the JDK. As of the time of writing this post compiling Hadoop code doesn't work (at least not easily) on JDK 8 due to Javadoc becoming more strict, so you'll probably be better off installing JDK 7 until that's fixed (or you can be the hero that fixes it).
  2. Install Git.
  3. Install the Microsoft Windows SDK needed to compile some of the native libraries in Hadoop. Note: if like me you had Visual Studio 2012 installed before starting this, you may need to uninstall the VS 2010 Redistributable packages for this install to succeed (you may see an obscure error message in the log about "ProductFamily" missing otherwise).
  4. Install Cygwin. I was OK just going with the default included packages. For you to follow along this blog post smoothly, please enter "C:\HadoopTrunk\Tools\cygwin" as the directory where to install it.

The juicy stuff

After hopefully the preliminaries went off without a hitch, now we get to the slightly more exciting parts. We'll mostly be using a PowerShell script I wrote and pasted in PoshCode here. The script is meant to be modified and has a lot of hard-coded assumptions so you may want to get familiar with it, but if you're lucky you may be able to just follow along below without looking at it.

  1. Create the directory "C:\HadoopTrunk\Scripts" and download the Hdoop-Dev.ps1 script there.

  2. In a new Administrator (*) PowerShell session, dot-source the script (which ensures that all the functions and variables in there get into your session):

     . C:\HadoopTrunk\Scripts\Hadoop-Dev.ps1
    
  3. Get Ant: Get-Ant

  4. Get Maven: Get-Maven

  5. Get Protobuf: Get-ProtoBuf

  6. Get the source code: Get-Trunk

  7. This step will hopefully get obsolete soon, but at the time of writing this post there's an active bug where Hadoop on trunk doesn't function that well if Java is installed in a path that has spaces in it, which is the default case if Java is installed in "Program Files". So if this bug is still active and this applies to you, then you may need to apply the patch for it: Apply-SpacePatch

  8. Build it! This is the most time-consuming step (about 10 minutes on my laptop), and the most error-prone one so good luck: Build-Package

  9. If your build works, you should be able to configure a single-node setup where all the important processes just launch on your box as a single-node cluster: Configure-SingleNode

  10. If all the above went well, you should have a directory (by default "C:\YarnSingleNode") that has everything configured to just work. To run the single-node setup, just run: Run-SingleNode

So now if you followed all these steps, you should have four PowerShell windows open running the four main Hadoop processes: ResourceManager and NodeManager for Yarn, and Namenode and Datanode for DFS. If after the last step you don't have that, I put the logs by default in "C:\YarnSingleNode\logs" so look through those.

Editing the code in Eclipse

A great way to edit and work with the Hadoop code is to use Eclipse. If you want to do that, it should be fairly simple:

  1. Download it and extract it say to "C:\Eclipse".

  2. In your PowerShell window from above (or a new one where mvn is in the PATH), run the following:

     md C:\HadoopTrunk\Workspace
    cd C:\HadoopTrunk\hadoop-common
    mvn '-D=eclipse.workspace="C:\HadoopTrunk\Workspace"' eclipse:configure-workspace
    mvn eclipse:eclipse
    

    Note: If the last step above doesn't work, you may need to run the following then re-run it:

     cd C:\HadoopTrunk\hadoop-common\hadoop-maven-plugins
    mvn install
    
  3. Open eclipse.exe from where you extracted it.

  4. Select "C:\HadoopTrunk\Workspace" as your workspace

  5. File->Import->General->Existing Projects into Workspace. Use C:\HadoopTrunk\hadoop-common as your root directory and select all projects there.

This should all the code in Eclipse and mostly building and navigable. I personally had to do the following steps to get hadoop-common test code to compile as well:

  1. Right-click hadoop-common, Build Path->Configure Build Path
  2. Source tab, Link Source. Location: "C:\HadoopTrunk\hadoop-common\hadoop-common-project\hadoop-common\target\generated-test-sources\java". Name: "gen-test-sources". Finish.

* You may be able to get away with running as normal user, but I ran into trouble compiling Hadoop because it creates the hadoop.dll file with wrong permissions and then can't package it. If you know why that is and can work around it please do post a comment.