Map/Reduce - in Functional Programming & Parallel Processing Perspectives

Map/Reduce - in Functional Programming & Parallel Processing Perspectives

Map/Reduce is a very popular term pair in today's technical community, mainly due to the popularity of its "inventor" - Google.

But
in fact, the terms and concepts of map & reduce exist in
programming language community long before G company's successful paper
"MapReduce: Simplified Data Processing on Large Clusters", which appeared in OSDI04.

In
this article, I want to summarize what this term pair means in
functional programming literature and parallel processing literature
respectively.

I - Map/Reduce in Functional Programming Perspective

  Functional Programming
has long history in academia, but not been massively accepted in
developer communities yet. It has some beautiful features, compared
with our daily use imperative language. Higher Order Function is one of them. Basically, it means that function can be used as input parameters or return value for a function definition.

  Among various higher order functions, map, fold and filter are the most popular ones:

- Map
is a higher order function that applies a given function(a.k.a
transformer) element-wise to a list of elements and returns a list of
results. Transformer is a function applies to each element and will produce one or more new elements.

for example: map (toLower) "abcDEFG12!@#" will produces output:"abcdefg12!@#"

- Fold (a.k.a. Reduce, Accumulate)
is a higher order function that processes (using a combiner function) a
list of elements in some order and build up a return value. Combiner
is a function that is applied to two elements and produces a result
that can be combined using combiner with the remaining elements in the
list.

for example: fold (+) 0 [1..5] will produces output: 15, which is the sum of all the elements.

- Filter
is a higher-order function that processes a list of elements in some
order to produce a result containing exactly those original elements
for which a given predicate returns the Boolean value true. Predicate is a function that takes one element as input parameter and return either true or false.

for example: filter (isAlpha) "$#!+abcDEF657" will produces output: "abcDEF"

 
 Essentially, these three higher order functions apply an operation on
some list/array and produce some results: map transform each element,
filter filtering some elements and reduce combine all the elements.

 
 Pure functional language, such as haskell/lisp, and some mixed
language, such as python, have build-in functions named exactly as
Map/Reduce. C# 3.0 introduces some functional features in LINQ
subsystem, where Map is called Select and Reduce is called Aggregate.

  More concrete examples can be found in [2].

II - Map/Reduce in Parallel Processing Perspective

  Map/Reduce is a Programming Model & also an Implementation Runtime. The programming model is what you can use to express your computation tasks while implementation runtime is those software components that realize what the model claims.

  This model is called map/reduce, but their meanings are somewhat different:
  - the map semantic is the same as in functional programming language: the transformer (the mapper in Google's paper) is applied to each element of the list
  - the reduce semantic differs. Here, the combiner(the reducer
in Google's paper) is applied to multiple sub sections of the elements
in the list and thus produces multiple reduce results, but in
functional programming language it is applied to all the elements and
only produces one result.

  Conceptually, how the elements are divided into multiple sub sections?
  To resolve this problem, this model introduces some structure on the elements that are produced by mapper and consumed by reducer - each element/record has two parts: key and value. Then all the elements are divided according to the key. The records with the same key form a sub section and are passed to a reducer function as a whole.

  From implementation's perspective, the most important advantage of this Programming Model is that - it enables automatic parallelization and distribution of large scale data processing:
 
 - mapper is applied to each record, it's a data parallel problem by
itself, we just need to distribute input data in record boundary among
processing nodes.
  - reducer is applied to some sub section, we just need to distribute those sub sections among process nodes.

  Another implementation problem is fail over - what to do when failure happened?
  Simple! It just re-execute
the failed specific mapper/reducer, other mappers/reducers won't be
bothered at all. Because there is no communication among mappers and
reducers respectively, this solution is semantically correct for
mapper/reducer.
  Since the input of mapper is persisted in
reliable storage system, failed mapper only need to re-execute that
mapper. But the input of reducer (also the output of mapper) is
persisted in worker's local storage system, re-executed reducer may
found some input unavailable (for example intermediate data node
crashed). In this situation, failed reducer need to re-execute both
some mappers and that reducer.

[Reference]
1. Functional Programming
2. higher order functions - map, fold and filter
3. Map/Reduce/Filter in Python
4. Map/Reduce in PHP
5. Google's MapReduce Programming Model — Revisited
6. MapReduce: Simplified Data Processing on Large Clusters
7. Map-Reduce-Merge: Simplified Relational Data Processingon Large Clusters