Hadoop for .NET Developers: Understanding Hadoop Streaming

Article
09/09/2013

NOTE This post is one in a series on Hadoop for .NET Developers.

In the last post, we built a simple MapReduce job using C#. But Hadoop is a Java-based platform. So how is it we can execute a MapReduce job using a .NET language? The answer is Hadoop Streaming.

In a nutshell, Hadoop Streaming is a capability that allows any executable to serve as the mapper and/or reducer in a MapReduce job. Through this capability, MapReduce exchanges data with the executable using its standard input and output, i.e. stdin and stdout. And while you can write and compile your executables, load them to cluster, and execute a streaming job from the command-line, you can allow the .NET SDK to handle the setup and execution of a streaming job on your behalf, which is exactly what happened in the last post.

As flexible as Hadoop Streaming is, it comes with a few constraints. First, the executable(s) must be able to run on the data nodes in your cluster. With the .NET SDK, this means there is a dependency on the .NET 4.0 Framework (which is already available on your HDInsight data nodes) and as a result you cannot write MapReduce jobs using .NET languages unless Hadoop is deployed on Windows.

Next, Hadoop Streaming works with files in a limited number of formats. For Hadoop Streaming on HDInsight, the default limitations are files containing line-oriented text (with carriage return+line feed delimiters) and files in JSON format. If you need to process files in alternative formats, you can pass in the name of these files and leverage the flexibility of the map method and the .NET Framework to work around these constraints.

Finally, Hadoop Streaming requires data to flow from your Mappers and Reducers as text in a key+tab+value format. If you are using the .NET SDK to write your MapReduce job (as we did in the last post), then the .NET SDK will handle this last constraint for you.

If you’d like to learn a bit more about Hadoop Streaming in Hadoop, please check out this resource.

Hadoop for .NET Developers: Understanding Hadoop Streaming

Additional resources