Processing already sorted data with Hadoop Map/Reduce jobs without performance overhead

Article
04/03/2012

While working with Map/Reduce jobs in Hadoop, it is very much possible that you have got “sorted data” stored in HDFS. As you may know the “Sort function” exists not only after map process in map task but also with merge process during reduce task, so having sorted data to sort again would be a big performance overhead. In this situation you may want to have your Map/Reduce job not to sort the data.

Note: If you have tried changing map.sort.class to no-op, it would haven’t work as well.

So the question comes:

if it is possible to force Map/Reduce to not to sort the data again (as it is already sorted) after map phase?
Or “how to run Map/Reduce jobs in a way that you can control how do you want to results, sorted or unsorted”?

So if you do not need result be sorted the following Hadoop patch would be great place to start:

Support no sort dataflow in map output and reduce merge phrase :https://issues.apache.org/jira/browse/MAPREDUCE-3397

Note: Before using above Patch the I would suggest reading the following comment from Robert about this patch:

Combiners are not compatible with mapred.map.output.sort. Is there a reason why we could not make combiners work with this, so long as they must follow the same assumption that they will not get sorted input? If the algorithms you are thinking about would never get any benefit from a combiner, could you also add the check in the client. I would much rather have the client blow up with an error instead of waiting for my map tasks to launch and then blow up 4+ times before I get the error.
In your test you never validate that the output is what you expected it to be. That may be hard as it may not be deterministic because there is no sorting, but it would be nice to have something verify that the code did work as expected. Not just that it did not crash.
mapred-default.xml Please add mapred.map.output.sort to mapred-default.xml. Include with it a brief explanation of what it does.
There is no documentation or examples. This is a new feature that could be very useful to lots of people, but if they never know it is there it will not be used. Could you include in your patch updates to the documentation about how to use this, and some useful examples, preferably simple. Perhaps an example computing CTR would be nice.
Performance. The entire reason for this change is to improve performance, but I have not seen any numbers showing a performance improvement. No numbers at all in fact. It would be great if you could include here some numbers along with the code you used for your benchmark and a description of your setup. I have spent time on different performance teams, and performance improvement efforts from a huge search engine to an OS on a cell phone and the one thing I have learned is that you have to go off of the numbers because well at least for me my intuition is often wrong and what I thought would make it faster slowed it down instead.
Trunk. This patch is specific to 0.20/1.0 line. Before this can get merged into the 0.20/1.0 lines we really need an equivalent patch for trunk, and possibly 0.21, 0.22, and 0.23. This is so there are no regressions. It may be a while off after you get the 1.0 patch cleaned up though.

Keyword: Hadoop, Map/Reduce, Jobs Performance, Hadoop Patch

Processing already sorted data with Hadoop Map/Reduce jobs without performance overhead

Additional resources