It has been a few months since I have made a change to the “Generics based Framework for .Net Hadoop MapReduce Job Submission” code. However I was going to put together a sample for a Reduce side join and came across a issue around the usage of partitioners. As such I decided to add support for custom partitioners and comparators, before stamping the release as a version 1.0.
The submission options to be added will be –numberPartitionKeys (for partitioning the data) and –comparatorOption (for sorting). Before talking about these options lets cover a little of the Hadoop Streaming Documentation.
A Useful Partitioner Class
Hadoop has a library class, KeyFieldBasedPartitioner, that allows the MapReduce framework to partition the map outputs on key segments rather than the whole key. Consider the following job submission options:
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner -Dstream.num.map.output.key.fields=4 -Dmapred.text.key.partitioner.options=-k1,2
The -Dstream.num.map.output.key.fields=4 option specifies that the map output keys will have 4 fields. In this case the the -Dmapred.text.key.partitioner.options=-k1,2 option tells the MapReduce framework to partition the map outputs by the first two fields of the key; rather than the full set of four keys.This guarantees that all the key/value pairs with the same first two fields in the keys will be partitioned to the same reducer.
To simplify this specification the submission framework supports the following command line options:
-numberKeys 4 -numberPartitionKeys 2
When writing Streaming applications one normally has to do the partition processing in the Reducer. However with these options the framework will correctly send the appropriate data to the Reducer. This was the biggest change that needed to be made to the Framework.
This partition processing is important for handling Reduce side joins; the topic of my next blog post.
A Useful Comparator Class
Hadoop also has a library class, KeyFieldBasedComparator, that provides sorting the key data. Consider the following job submission options:
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -D mapred.text.key.comparator.options=-k2,2nr
Here the MapReduce framework will sort the outputs on the second key field; the -Dmapred.text.key.comparator.options=-k2,2nr option. The -n specifies that the sorting is numerical and the -r specifies that the result should be reversed. The sort options are similar to a Unix sort, but here are some simple examples:
Specific sort specification:
For the general sort specification the -k flag is used, which allows you to specify the sorting key using the the x, y values; as in the sample above.
To simplify this specification the submission framework supports the following command line option:
The framework then takes care of setting the appropriate job configuration values.
Although one could define the Partitioner and Comparator options using the job configuration parameters hopefully these new options make the process a lot simpler. In the case of the Partitioner options it also allows the framework to easily identity the difference in the number of sorting and partitioner keys. This allows the correct data to be sent to each Reducer.
As mentioned, using these options, in my next post I will cover how to use the Framework for doing a Reduce side join.