Hadoop for .NET Developers: Programmatically Loading Data to HDFS


NOTE This post is one in a series on Hadoop for .NET Developers.

In the last blog post in this series, we discussed how to manually load data to a cluster.  While this is fine for occasional needs, a programmatic approach is more typically preferred.  To enable this, Hadoop presents a REST interface on HTTP port 50070.  And while you could program data loads directly against that interface, the .NET SDK makes available a WebHDFS client to simplify the process.

To make use of the WebHDFS client, you must have knowledge of which storage system is used within the cluster to which you are loading data.  By default, the WebHDFS client assumes the target cluster employs HDFS.  (If you are loading to HDInsight in Azure, AVS is employed there and some slightly different steps are employed with the client to load data.  Examples of loading data to HDInsight in Azure using the WebHDFS are available in online documentation.)  In this post, we will focus on the use of the WebHDFS client against our local desktop development cluster which makes use of HDFS.   These steps work equally well whether the cluster is local or remote to the application so long as the cluster makes use of HDFS:

1. Launch Visual Studio and create a new C# console application.

2. Using the NuGet package manager, add the Microsoft .NET API for Hadoop WebClient package to your project.

3. If it is not already open, open the Program.cs file and add the following directive:

using Microsoft.Hadoop.WebHDFS;

4. In the main method add the following code:

//set variables
string srcFileName = “c:\temp\ufo_awesome.tsv”;
string destFolderName = “/demo/ufo/in”;
string destFileName = “ufo_awesome.tsv”;

//connect to hadoop cluster
Uri myUri = new Uri(“http://localhost:50070“);
string userName = “hadoop”;
WebHDFSClient myClient = new WebHDFSClient(myUri, userName);

//drop destination directory (if exists)
myClient.DeleteDirectory(destFolderName, true).Wait();
           
//create destination directory
myClient.CreateDirectory(destFolderName).Wait();

 
//load file to destination directory
myClient.CreateFile(srcFileName, destFolderName + “/” + destFileName).Wait(); 

//list file contents of destination directory
Console.WriteLine();
Console.WriteLine(“Contents of ” + destFolderName);

myClient.GetDirectoryStatus(destFolderName).ContinueWith(
     ds => ds.Result.Files.ToList().ForEach(
     f => Console.WriteLine(“t” + f.PathSuffix)
     ));

//keep command window open until user presses enter
Console.ReadLine();

5. Run the application to load the file to the destination folder. 

Most of the code is pretty straightforward.  It starts with connecting to our local cluster identified as http://localhost:50070. The hadoop user is a default user account within hadoop which will be identified as the “owner” of the file we upload.

Once connected, an instruction to delete the destination folder is sent across followed by an instruction to create the destination folder.  We could have been more sophisticated here but this simple code gets the job done.

A file is then created within HDFS using the contents of the source file. As we are running this on our local desktop development cluster, the source file happens to reside on the name node but that is not a requirement for the WebHDFS client.  The only requirement is that the source file is accessible to our application and the destination folder is accessible.

Once the file is created, the contents of the destination folder are retrieved and printed to the console screen for us to review.

The only tricky element to this code is the use of the Wait method on the various calls. The REST calls made by the WebHDFS client are asynchronous by default and the use of the Wait method forces the code to wait for their completion before proceeding to the next line.  To the best of my knowledge, there is no requirement for your code to wait before proceeding but as this example is very linear in its execution, waiting makes sense.  More details on parallel task execution and the use of the Wait method are found here.


Comments (10)

  1. Max says:

    What about security

  2. I'm an still very much learning Hadoop, so please take my comments on security with a grain of salt.  But from my perspective, security in Hadoop (as it stands today) does not appear to be strict.  Instead, the use of user names seems more of a means to avoid inadvertently stomping on other folks' work as opposed to imposing hard conditions on data access.

  3. Anoop says:

    Good intro. As the documentation regarding this is very less, here are few articles I published earlier regarding building .NET apps for HDInsight

    http://www.amazedsaint.com/…/Hadoop

  4. Anoop –

    Thanks for the link to your blog.  You've done a lot of work to help folks understand this stuff. I'll add it to the post that starts this series so that anyone that comes here can tap into your work.  Thanks again.

  5. Andres says:

    Good article. I'm using your code as a guide to create files in a Hadoop cluster in Amazon. However, I'm facing issues with big files (>1Mb) because the connection gets reset everytime and I haven't been able to solve the problem even changing the timeout for the request. Have you faced any problem using big files?

  6. @Andres –

    Sorry but I don't have any experience doing this in AWS. If your cluster used HDFS, I would expect the WebHDFS client to work the same with it. If it used Amazon S3 storage, it may or may not work as described above. Have you tried Azure? 🙂

  7. Pachi says:

    Good article

  8. tajinder says:

    hello guys i am also learning hadoop and i have to complete my masters project.. …. i would really appreciate if someone can help me .. .

    tajindersingh4487@gmail,com    is my email id …

    cheers

  9. Byrd,Jonathan says:

    may be related to @Andres question but anyone used WebHDFS with Cloudera (CDH)? I would imagine its the same. Although in our environment we have access to loading the files directly onto the Linux box and then simply copying them over within the Linux env. using normal hdfs copy commands.