Hadoop for .NET Developers: Programmatically Loading Data to HDFS

NOTE This post is one in a series on Hadoop for .NET Developers.

In the last blog post in this series, we discussed how to manually load data to a cluster. While this is fine for occasional needs, a programmatic approach is more typically preferred. To enable this, Hadoop presents a REST interface on HTTP port 50070. And while you could program data loads directly against that interface, the .NET SDK makes available a WebHDFS client to simplify the process.

To make use of the WebHDFS client, you must have knowledge of which storage system is used within the cluster to which you are loading data. By default, the WebHDFS client assumes the target cluster employs HDFS. (If you are loading to HDInsight in Azure, AVS is employed there and some slightly different steps are employed with the client to load data. Examples of loading data to HDInsight in Azure using the WebHDFS are available in online documentation.) In this post, we will focus on the use of the WebHDFS client against our local desktop development cluster which makes use of HDFS. These steps work equally well whether the cluster is local or remote to the application so long as the cluster makes use of HDFS:

1. Launch Visual Studio and create a new C# console application.

2. Using the NuGet package manager, add the Microsoft .NET API for Hadoop WebClient package to your project.

3. If it is not already open, open the Program.cs file and add the following directive:

using Microsoft.Hadoop.WebHDFS;

4. In the main method add the following code:

//set variables
string srcFileName = "c:\temp\ufo_awesome.tsv";
string destFolderName = "/demo/ufo/in";
string destFileName = "ufo_awesome.tsv";

//connect to hadoop cluster
Uri myUri = new Uri("https://localhost:50070");
string userName = "hadoop";
WebHDFSClient myClient = new WebHDFSClient(myUri, userName);

//drop destination directory (if exists)
myClient.DeleteDirectory(destFolderName, true).Wait();
           
//create destination directory
myClient.CreateDirectory(destFolderName).Wait();

 
//load file to destination directory
myClient.CreateFile(srcFileName, destFolderName + "/" + destFileName).Wait(); 

//list file contents of destination directory
Console.WriteLine();
Console.WriteLine("Contents of " + destFolderName);

myClient.GetDirectoryStatus(destFolderName).ContinueWith(
     ds => ds.Result.Files.ToList().ForEach(
     f => Console.WriteLine("t" + f.PathSuffix)
     ));

//keep command window open until user presses enter
Console.ReadLine();

5. Run the application to load the file to the destination folder. 

Most of the code is pretty straightforward. It starts with connecting to our local cluster identified as https://localhost:50070. The hadoop user is a default user account within hadoop which will be identified as the "owner" of the file we upload.

Once connected, an instruction to delete the destination folder is sent across followed by an instruction to create the destination folder. We could have been more sophisticated here but this simple code gets the job done.

A file is then created within HDFS using the contents of the source file. As we are running this on our local desktop development cluster, the source file happens to reside on the name node but that is not a requirement for the WebHDFS client. The only requirement is that the source file is accessible to our application and the destination folder is accessible.

Once the file is created, the contents of the destination folder are retrieved and printed to the console screen for us to review.

The only tricky element to this code is the use of the Wait method on the various calls. The REST calls made by the WebHDFS client are asynchronous by default and the use of the Wait method forces the code to wait for their completion before proceeding to the next line. To the best of my knowledge, there is no requirement for your code to wait before proceeding but as this example is very linear in its execution, waiting makes sense. More details on parallel task execution and the use of the Wait method are found here.