To better support configuring the Stream environment whilst running .Net Streaming jobs I have made a change to the “Generics based Framework for .Net Hadoop MapReduce Job Submission” code.
I have fixed a few bugs around setting job configuration options which were being controlled by the submission code. However, more importantly, I have added support for two additional command lines submission options:
- jobconf – to support setting standard job configuration values
- cmdenv – to support setting environment variables which are accessible to the Map/Reduce code
The full set of options for the command line submission is now:
-help (Required=false) : Display Help Text
-input (Required=true) : Input Directory or Files
-output (Required=true) : Output Directory
-mapper (Required=true) : Mapper Class
-reducer (Required=true) : Reducer Class
-combiner (Required=false) : Combiner Class (Optional)
-format (Required=false) : Input Format |Text(Default)|Binary|Xml|
-numberReducers (Required=false) : Number of Reduce Tasks (Optional)
-numberKeys (Required=false) : Number of MapReduce Keys (Optional)
-outputFormat (Required=false) : Reducer Output Format |Text|Binary| (Optional)
-file (Required=true) : Processing Files (Must include Map and Reduce Class files)
-nodename (Required=false) : XML Processing Nodename (Optional)
-cmdenv (Required=false) : List of Environment Variables for Job (Optional)
-jobconf (Required=false) : List of Job Configuration Parameters (Optional)
-debug (Required=false) : Turns on Debugging Options
The job configuration option supports providing a list of standard job options. As an example, to set the name of a job and compress the Map output, which could improve performance, one would add:
For a complete list of options one would need to review the Hadoop documentation.
The command environment option supports setting environment variables accessible to the Streaming process. However, it will replace non-alphanumeric characters with an underscore “_” character. As an example if one wanted to pass in a filter into the Streaming process one would add:
This would then be accessible in the Streaming process.
To support providing better feedback into the Hadoop running environment I have added a new static class into the code; named Context. The Context object contains the original FormatKeys() and GetKeys() operations, along with the following additions:
The code contained in this Context object, although simple, will hopefully provide some abstraction over the idiosyncrasies of using the Streaming interface.