NOTE This post is one in a series on Hadoop for .NET Developers.
In Hadoop, data processing is tackled through MapReduce jobs. A job consists of basic configuration information, e.g. paths to input files and an output folder, and are executed by Hadoop’s MapReduce layer as a series of tasks. These tasks have responsibility for first executing map and reduce functions in order to turn input data into output results.
To illustrate how MapReduce works, consider a simple input file consisting of tab-delimited records (on the far left of the diagram below). For illustrative purposes, each line will be labeled A through F.
A MapReduce job is specified against this input file and through some internal logic, let’s say MapReduce determines that records A through C should be executed by one map task (Map Task 0) and records D through F should be executed by another (Map Task 1).
When the MapReduce job was defined, a Mapper class was identified. The Mapper class is a class containing a map function. Each map task executes the map task against each input record so that Map Task 0 executes the map function three times, once against Record A, again against Record B, and once again against Record C. Similarly, Map Task 1 executes the map function against each of its three records D through F.
The map function is written to accept a single line of incoming data and is expected to emit output data in the form of key and value combinations. A typical map function emits a single key and value derived from the incoming data, but it is possible that the map function generates zero or multiple key and value combinations in response to the incoming data. In our example, there is a one-to-one correlation between input data lines and key-value combinations emitted by the map function.
What then should the keys and values emitted by the map function be? In our example, keys are identified as k1, k2, and k3 with various values associated with each. What is provided for the key and value by a given map function is totally up to the Mapper class developer, but typically the key is some value around which data can be grouped and the value is some input value – that can include a delimited list of values, an XML or JSON fragment, or something even more complex - that will be useful to the reduce function. If that isn’t 100% clear right now, then hopefully it will once we’ve covered the reduce function.
Once the map tasks have completed their work, the data output from their various calls of the map function are sorted and shuffled so that all values associated with a given key are arrayed together. Reduce tasks are generated and keys and their values are handed over to the tasks for processing. In our example, MapReduce has generated two reduce tasks and handed the values for k1 and k2 to Reduce Task 0 and values for k3 to Reduce Task 1.
As part of the MapReduce job definition, a Reducer class was defined. This class contains a reduce function and this function is supplied a key value and its array of associated values for processing by the reduce task. The reduce function typically generates a summarized value derived across the array of values for the given key but as before the developer maintains quite a bit of flexibility in terms of what output is emitted and how it is derived.
The reduce function output for each reducer task is written to a file in the output folder identified as part of the MapReduce job’s configuration. A single file is associated with each reduce task so that in our example with two reduce tasks there are two output files produced. These files can be accessed individually but more typically they are combined into a single output file using the getmerge command (at the command line) or similar functionality.
If that explanation makes sense, let's add a little complexity to the story. Not every job consists of both a Mapper and a Reducer class. At a minimum, a MapReduce job must have a Mapper class but if all the data processing work can be tackled through the map function, that will be the end of the road. In that scenario, the output files align with the map tasks and will be found with similar names in the output folder identified with the job configuration.
In addition, MapReduce jobs can accept a Combiner class. The Combiner class is defined like the Reducer class (and has a reduce function) but it is run on the data associated with a single map task. The idea behind the Combiner class is that some data can be "reduced" (summarized) before it is sorted and shuffled. Sorting and shuffling data to get it over to the reduce tasks is an expensive operation and the Combiner class can serve as an optimizer to reduce the amount of data moving between tasks. Combiner classes are by no means required and you should consider using them when you absolutely must squeeze performance out of our MapReduce job.