## Google’s MapReduce Programming Model — Revisited

### Cached

### Download Links

- [www.cs.vu.nl]
- [userpages.uni-koblenz.de]
- DBLP

### Other Repositories/Bibliography

Citations: | 40 - 1 self |

### BibTeX

@MISC{Lämmel_google’smapreduce,

author = {Ralf Lämmel},

title = {Google’s MapReduce Programming Model — Revisited},

year = {}

}

### OpenURL

### Abstract

Google’s MapReduce programming model serves for processing large data sets in a massively parallel manner. We deliver the first rigorous description of the model including its advancement as Google’s domain-specific language Sawzall. To this end, we reverse-engineer the seminal papers on MapReduce and Sawzall, and we capture our findings as an executable specification. We also identify and resolve some obscurities in the informal presentation given in the seminal papers. We use typed functional programming (specifically Haskell) as a tool for design recovery and executable specification. Our development comprises three components: (i) the basic program skeleton that underlies MapReduce computations; (ii) the opportunities for parallelism in executing MapReduce computations; (iii) the fundamental characteristics of Sawzall’s aggregators as an advancement of the MapReduce approach. Our development does not formalize the more implementational aspects of an actual, distributed execution of MapReduce computations.

### Citations

1734 | Mapreduce: Simplified data processing on large clusters
- Dean, Ghemawat
- 2004
(Show Context)
Citation Context ...n . . . . . . . . . . . . . . . . . . . . . . . 39 5.9 Sawzall vs. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . 39 6 Conclusion 40 21 Introduction Google’s MapReduce programming model =-=[10]-=- serves for processing large data sets in a massively parallel manner (subject to a ‘MapReduce implementation’). 1 The programming model is based on the following, simple concepts: (i) iteration over ... |

922 | The google file system
- Ghemawat, Gobioff, et al.
- 2003
(Show Context)
Citation Context ...the seminal MapReduce paper, which is based on large networked clusters of commodity machines with local store while also exploiting other bits of Google infrastructure such as the Google file system =-=[13]-=-. The strategy reflects that the chief challenge is network performance in the view of the scarce resource network bandwidth. The main trick is to exploit locality of data. That is, parallelism is ali... |

874 |
Design Patterns
- Gamma, Helm, et al.
- 1994
(Show Context)
Citation Context ...and succinctness of types is also essential for making them useful in reflection on designs. In fact, how would we somehow systematically reflect on designs other than based on types? Design patterns =-=[12]-=- may come to mind. However, we contend that their actual utility for the problem at hand is non-obvious, but one may argue that we are in the process of discovering a design pattern, and we inform our... |

443 |
Can programming be liberated from the von Neumann style? a functional style and its algebra of programs
- Backus
- 1978
(Show Context)
Citation Context ... Haskell illustrations that can be safely skipped by the reader with proficiency in typed functional programming. Illustration of map: Let us double all numbers in a list: Haskell-prompt> map ((∗) 2) =-=[1,2,3]-=- [2,4,6] Here, the expression ‘ ((∗) 2)’ denotes multiplication by 2. In (Haskell’s) lambda notation, ‘ ((∗) 2)’ can also be rendered as ‘\x −> 2∗x’. Illustration of foldl: Let us compute the sum of a... |

302 | Functional programming with bananas, lenses, envelopes and barbed wire
- Meijer, Fokkinga, et al.
- 1991
(Show Context)
Citation Context ...> a −> [a] −> a reduce = foldl Asides on folding • For the record, we mention that the combinators map and foldl can actually be both defined in terms of the right-associative fold operation, foldr , =-=[23, 20]-=-. Hence, foldr can be considered as the fundamental recursion scheme for list traversal. The functions that are expressible in terms of foldr are also known as ‘list catamorphisms’ or ‘bananas’. We in... |

287 | Why functional programming matters - Hughes - 1989 |

229 | The gamma database machine project
- DeWitt, Ghandeharizadeh, et al.
- 1990
(Show Context)
Citation Context ...ed mapping. That is, grouping could be performed for any fraction of intermediate data and distributed grouping results could be merged centrally, just as in the case of a parallel-merge-all strategy =-=[11]-=-. Parallel map over groups: Reduction is performed for each group (which is a key with a list of values) separately. Again, the pattern of a list map applies here; total data parallelism is admitted f... |

220 |
Common Lisp, the Language
- Steele
- 1990
(Show Context)
Citation Context ...effects and meet certain algebraic properties. Given the quoted reference to Lisp, let us recall the map and reduce combinators of Lisp. The following two quotes stem from “Common Lisp, the Language” =-=[30]-=-: 2 map result-type function sequence &rest more-sequences “The function must take as many arguments as there are sequences provided; at least one sequence must be provided. The result of map is a seq... |

217 |
An introduction to the theory of lists
- Bird
- 1987
(Show Context)
Citation Context ... illustrations that can be safely skipped by the reader with proficiency in typed functional programming. Illustration of map: Let us double all numbers in a list: Haskell-prompt> map ((∗) 2) [1,2,3] =-=[2,4,6]-=- Here, the expression ‘ ((∗) 2)’ denotes multiplication by 2. In (Haskell’s) lambda notation, ‘ ((∗) 2)’ can also be rendered as ‘\x −> 2∗x’. Illustration of foldl: Let us compute the sum of all numbe... |

193 | Programming parallel algorithms
- Blelloch
- 1996
(Show Context)
Citation Context ...parallelism Parallel map over input: Input data is processed such that key/value pairs are processed one by one. It is well known that this pattern of a list map is amenable to total data parallelism =-=[27, 28, 5, 29]-=-. That is, in principle, the list map may be executed in parallel at the granularity level of single elements. Clearly, MAP must be a pure function so that the order of processing key/value pairs does... |

182 | Interpreting the data: Parallel analysis with sawzall
- Pike, Dorward, et al.
- 2005
(Show Context)
Citation Context ...more demanding) programming models for subproblems. In the present paper, we deliver the first rigorous description of the model including its advancement as Google’s domain-specific language Sawzall =-=[26]-=-. To this end, we reverse-engineer the seminal MapReduce and Sawzall papers, and we capture our findings as an executable specification. We also identify and resolve some obscurities in the informal p... |

135 | Models and languages for parallel computation
- Skillicorn, Talia
- 1998
(Show Context)
Citation Context ...parallelism Parallel map over input: Input data is processed such that key/value pairs are processed one by one. It is well known that this pattern of a list map is amenable to total data parallelism =-=[27, 28, 5, 29]-=-. That is, in principle, the list map may be executed in parallel at the granularity level of single elements. Clearly, MAP must be a pure function so that the order of processing key/value pairs does... |

105 | Routing, merging and sorting on parallel models of computation
- Borodin, Hopcroft
- 1985
(Show Context)
Citation Context ... illustrations that can be safely skipped by the reader with proficiency in typed functional programming. Illustration of map: Let us double all numbers in a list: Haskell-prompt> map ((∗) 2) [1,2,3] =-=[2,4,6]-=- Here, the expression ‘ ((∗) 2)’ denotes multiplication by 2. In (Haskell’s) lambda notation, ‘ ((∗) 2)’ can also be rendered as ‘\x −> 2∗x’. Illustration of foldl: Let us compute the sum of all numbe... |

91 | Architecture-Independent Parallel Computation - Skillicorn - 1990 |

88 |
Algebra of programming
- Bird, Moor
- 1997
(Show Context)
Citation Context ... Haskell illustrations that can be safely skipped by the reader with proficiency in typed functional programming. Illustration of map: Let us double all numbers in a list: Haskell-prompt> map ((∗) 2) =-=[1,2,3]-=- [2,4,6] Here, the expression ‘ ((∗) 2)’ denotes multiplication by 2. In (Haskell’s) lambda notation, ‘ ((∗) 2)’ can also be rendered as ‘\x −> 2∗x’. Illustration of foldl: Let us compute the sum of a... |

88 | Haskell: The Craft of Functional Programming
- Thompson
- 1999
(Show Context)
Citation Context ...e checking, full type inference, powerful abstraction forms, compositionality, and algebraic reasoning style. This insight has been described more appropriately by Hughes, Thompson, and surely others =-=[19, 33]-=-. Acknowledgments: I would like to acknowledge feedback I received through presentations on the subject: University of Innsbruck (January 2006), University of Nottingham (January 2006), University of ... |

85 |
Algebraic data types and program transformation
- Malcolm
- 1990
(Show Context)
Citation Context ...ifying its ingredients, in fact, by naming the monoid for reduction. Hence, the actual reduction can still be composed together in different ways. That is, we can form a list homomorphism in two ways =-=[21, 20]-=-: −− Separated phases for mapping and reduction phasing f = mconcat . map f −− first map, then reduce −− Mapping and reduction ‘fused’ fusing f = foldr (mappend . f) mempty −− map and reduce combined ... |

60 | A tutorial on the universality and expressiveness of fold
- Hutton
- 1999
(Show Context)
Citation Context ...> a −> [a] −> a reduce = foldl Asides on folding • For the record, we mention that the combinators map and foldl can actually be both defined in terms of the right-associative fold operation, foldr , =-=[23, 20]-=-. Hence, foldr can be considered as the fundamental recursion scheme for list traversal. The functions that are expressible in terms of foldr are also known as ‘list catamorphisms’ or ‘bananas’. We in... |

53 | A theory of overloading
- Stuckey, Sulzmann
(Show Context)
Citation Context ... implies programming convenience because the aggregator type can be therefore ‘inferred’. (Multi-parameter type classes with functional dependencies go beyond Haskell 98, but they are well understood =-=[31]-=-, well implemented and widely used.) In the case of non-collection-like aggregators, e equals m, and mInsert equals mappend. In the case of collection-like aggregators, we designate a new type to emis... |

49 | The under-appreciated unfold
- Gibbons, Jones
- 1998
(Show Context)
Citation Context ...otentially many intermediate key/value pairs. For the record, we mention that the typical kind of MAP function could be characterized as an instance of unfolding (also known as anamorphisms or lenses =-=[23, 16, 1]-=-). 3 3 The MapReduce abstraction We will now enter reverse-engineering mode with the goal to extract an executable specification (in fact, a relatively simple Haskell function) that captures the abstr... |

43 |
Calculating Compilers
- Meijer
- 1992
(Show Context)
Citation Context ...ual pass over the input. We can exploit the monoid of tuples. The so-called ‘banana split’ property of foldr implies that the results of multiple passes coincide with the projections of a single pass =-=[22, 20]-=-. Thus, the Sawzall program is translated to a Haskell program (which is, by the way, shorter and more polymorphic) as follows: firstSawzall x = (Sum 1, Sum x, Sum (x∗x)) The monoid of triplets is rea... |

42 |
Parallel programming with list homomorphisms
- Cole
- 1995
(Show Context)
Citation Context ...nation is incomplete. 5.2 List homomorphisms The Sawzall paper does not say so, but we contend that the essence of a Sawzall program is to identify the characteristic arguments of a list homomorphism =-=[4, 9, 28, 17, 8]-=-: a function to be mapped over the list elements as well as the monoid to be used for the 29reduction. (A monoid is a simple algebraic structure: a set, an associative operation, and its unit.) List ... |

28 | Sorting morphisms
- Augusteijn
- 1998
(Show Context)
Citation Context ... Haskell illustrations that can be safely skipped by the reader with proficiency in typed functional programming. Illustration of map: Let us double all numbers in a list: Haskell-prompt> map ((∗) 2) =-=[1,2,3]-=- [2,4,6] Here, the expression ‘ ((∗) 2)’ denotes multiplication by 2. In (Haskell’s) lambda notation, ‘ ((∗) 2)’ can also be rendered as ‘\x −> 2∗x’. Illustration of foldl: Let us compute the sum of a... |

27 | Calculating Functional Programs
- Gibbons
- 2002
(Show Context)
Citation Context ... mapReduce function: mapReduce mAP rEDUCE input = reducePerKey $ groupByKey $ mapPerKey input where ... For the record, the systematic use of function combinators like ‘.’ leads to ‘point-free’ style =-=[3, 14, 15]-=-. The term ‘point’ refers to explicit arguments, such as input in the illustrative code snippets, listed above. That is, a point-free definition basically only uses function combinators but captures n... |

27 | Systematic efficient parallelization of scan and other list homomorphisms
- Gorlatch
- 1996
(Show Context)
Citation Context ...nation is incomplete. 5.2 List homomorphisms The Sawzall paper does not say so, but we contend that the essence of a Sawzall program is to identify the characteristic arguments of a list homomorphism =-=[4, 9, 28, 17, 8]-=-: a function to be mapped over the list elements as well as the monoid to be used for the 29reduction. (A monoid is a simple algebraic structure: a set, an associative operation, and its unit.) List ... |

14 |
Parallelizing conditional recurrences
- Chin, Darlington, et al.
(Show Context)
Citation Context ...nation is incomplete. 5.2 List homomorphisms The Sawzall paper does not say so, but we contend that the essence of a Sawzall program is to identify the characteristic arguments of a list homomorphism =-=[4, 9, 28, 17, 8]-=-: a function to be mapped over the list elements as well as the monoid to be used for the 29reduction. (A monoid is a simple algebraic structure: a set, an associative operation, and its unit.) List ... |

11 |
Fast parallel sorting algorithms
- Hirschberg
- 1978
(Show Context)
Citation Context ...an be avoided. Parallel grouping of intermediate data The grouping of intermediate data by key, as needed for the reduce phase, is essentially a sorting problem. Various parallel sorting models exist =-=[18, 6, 32]-=-. If we assume a distributed map phase, then it is reasonable to anticipate grouping to be aligned with distributed mapping. That is, grouping could be performed for any fraction of intermediate data ... |

11 |
Foundations of Parallel Programming. Number 6
- Skillicorn
- 1994
(Show Context)
Citation Context ...parallelism Parallel map over input: Input data is processed such that key/value pairs are processed one by one. It is well known that this pattern of a list map is amenable to total data parallelism =-=[27, 28, 5, 29]-=-. That is, in principle, the list map may be executed in parallel at the granularity level of single elements. Clearly, MAP must be a pure function so that the order of processing key/value pairs does... |

10 |
A pointless derivation of radixsort
- Gibbons
- 1999
(Show Context)
Citation Context ... mapReduce function: mapReduce mAP rEDUCE input = reducePerKey $ groupByKey $ mapPerKey input where ... For the record, the systematic use of function combinators like ‘.’ leads to ‘point-free’ style =-=[3, 14, 15]-=-. The term ‘point’ refers to explicit arguments, such as input in the illustrative code snippets, listed above. That is, a point-free definition basically only uses function combinators but captures n... |

10 |
Lexically scoped type variables
- Jones, Shields
- 2004
(Show Context)
Citation Context ...y to input−value mapping −> ? ? ? −− What’s the result and its type? ✸ 15More Haskell trivia: We must note that our specification relies on a Haskell 98 extension for lexically scoped type variables =-=[25]-=-. This extension allows us to reuse type variables from the signature of the top-level function mapReduce in the signatures of local helpers such as mapPerKey. 4 The discovery of mapPerKey’s result ty... |

3 |
Parallel database sorting
- Taniar, Rahayu
- 2002
(Show Context)
Citation Context ...an be avoided. Parallel grouping of intermediate data The grouping of intermediate data by key, as needed for the reduce phase, is essentially a sorting problem. Various parallel sorting models exist =-=[18, 6, 32]-=-. If we assume a distributed map phase, then it is reasonable to anticipate grouping to be aligned with distributed mapping. That is, grouping could be performed for any fraction of intermediate data ... |