## A Model of Computation for MapReduce (2010)

### Cached

### Download Links

Venue: | Proc. ACM-SIAM SODA |

Citations: | 29 - 5 self |

### BibTeX

@INPROCEEDINGS{Karloff10amodel,

author = {Howard Karloff and Siddharth Suri and Sergei Vassilvitskii},

title = {A Model of Computation for MapReduce},

booktitle = {Proc. ACM-SIAM SODA},

year = {2010}

}

### OpenURL

### Abstract

In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Used daily at companies such as Yahoo!, Google, Amazon, and Facebook, and adopted more recently by several universities, it allows for easy parallelization of data intensive computations over many machines. One key feature of MapReduce that differentiates it from previous models of parallel computation is that it interleaves sequential and parallel computation. We propose a model of efficient computation using the MapReduce paradigm. Since MapReduce is designed for computations over massive data sets, our model limits the number of machines and the memory per machine to be substantially sublinear in the size of the input. On the other hand, we place very loose restrictions on the computational power of of any individual machine— our model allows each machine to perform sequential computations in time polynomial in the size of the original input. We compare MapReduce to the PRAM model of computation. We prove a simulation lemma showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce. The strength of MapReduce, however, lies in the fact that it uses both sequential and parallel computation. We demonstrate how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds, as opposed to Ω(log(n)) rounds needed in the standard PRAM model. We show how to evaluate a wide class of functions using the MapReduce framework. We conclude by applying this result to show how to compute some basic algorithmic problems such as undirected s-t connectivity in the MapReduce framework. 1

### Citations

1703 | MapReduce: Simplified data processing on large clusters
- Dean, Ghemawat
- 2004
(Show Context)
Citation Context ...AT&T Labs—Research, howard@research.att.com † Yahoo! Research, suri@yahoo-inc.com ‡ Yahoo! Research, sergei@yahoo-inc.com of available data. The MapReduce framework was originally developed at Google =-=[4]-=-, but has recently seen wide adoption and has become the de facto standard for large scale data analysis. Publicly available statistics indicate that MapReduce is used to process more than 10 petabyte... |

1308 |
Introduction to Parallel Algorithms and Architectures: Arrays
- Leighton
- 1992
(Show Context)
Citation Context ...SP, proposed by Valiant [12]. These three models are all architecture independent. Other researchers have studied architecture-dependent models, such as the fixedconnection network model described in =-=[10]-=-. Since the most prevalent model in theoretical computer science is the PRAM, it seems most appropriate to compare our MapReduce model to it. In a PRAM, an arbitrary number of processors, sharing an u... |

1130 |
A Bridging Model for Parallel Computation
- Valiant
- 1990
(Show Context)
Citation Context ...see [1] for a survey of them. While the most popular by far for theoretical study is the PRAM, probably the next two most popular are LogP, proposed by Culler et al. [2], and BSP, proposed by Valiant =-=[12]-=-. These three models are all architecture independent. Other researchers have studied architecture-dependent models, such as the fixedconnection network model described in [10]. Since the most prevale... |

895 |
Approximation Algorithms
- Vazirani
- 2001
(Show Context)
Citation Context ...O(n2−2ɛ ). Furthermore, the space of the reducer is restricted to O(n1−ɛ ); therefore for all k, |k| + s(Vk,r) is O(n1−ɛ ). Using Graham’s greedy algorithm for the minimum makespan scheduling problem =-=[7, 13]-=-, we can conclude that the maximum number of bits mapped to any one machine is no more than the average load per machine plus the maximum size of any 〈k, Vk,r〉 pair. Thus, ≤ s(Vr) + s(Kr) number of ma... |

497 | Eicken. Logp: Towards a realistic model of parallel computation
- Culler, Karp, et al.
- 1993
(Show Context)
Citation Context ... been proposed in the literature; see [1] for a survey of them. While the most popular by far for theoretical study is the PRAM, probably the next two most popular are LogP, proposed by Culler et al. =-=[2]-=-, and BSP, proposed by Valiant [12]. These three models are all architecture independent. Other researchers have studied architecture-dependent models, such as the fixedconnection network model descri... |

135 | Google news personalization: scalable online collaborative filtering
- Das, Datar, et al.
(Show Context)
Citation Context ...and sequential computation. recent years several nontrivial MapReduce algorithms have emerged, from computing the diameter of a graph [9] to implementing the EM algorithm to cluster massive data sets =-=[3]-=-. Each of these algorithms gives some insights into what can be done in a MapReduce framework, however, there is a lack of rigorous algorithmic analyses of the issues involved. In this work we begin b... |

31 | DOULION: counting triangles in massive graphs with a coin
- Tsourakakis, Kang, et al.
- 2009
(Show Context)
Citation Context ...lizable computations. Kang et al. [9] show how to use MapReduce to compute diameters of massive graphs, taking as an example a webgraph with 1.5 billion nodes and 5.5 billion arcs. Tsourakakis et al. =-=[11]-=- use MapReduce for counting the total number of triangles in a graph. Motivated by personalized news results, Das et al. [3] implement the EM clustering algorithm on MapReduce. Overall, each of these ... |

19 | On distributing symmetric streaming computations
- FELDMAN, MUTHUKRISHNAN, et al.
- 2008
(Show Context)
Citation Context ... the total number of machines available to be substantially sublinear in the data size. Time Finally, there is a question of the total running time available. In a major difference from previous work =-=[6]-=-, we do not restrict the power of the individual reducer, except that we require that both the map and the reduce functions run in time polynomial in the original input length in order to ensure effic... |

11 | Hadi: Fast diameter estimation and mining in massive graphs with hadoop
- Kang, Tsourakakis, et al.
- 2008
(Show Context)
Citation Context ...ed models of parallel computation because it interleaves parallel and sequential computation. recent years several nontrivial MapReduce algorithms have emerged, from computing the diameter of a graph =-=[9]-=- to implementing the EM algorithm to cluster massive data sets [3]. Each of these algorithms gives some insights into what can be done in a MapReduce framework, however, there is a lack of rigorous al... |

10 |
Bounds on multiprocessing anomalies and related packing algorithms
- Graham
- 1972
(Show Context)
Citation Context ...O(n2−2ɛ ). Furthermore, the space of the reducer is restricted to O(n1−ɛ ); therefore for all k, |k| + s(Vk,r) is O(n1−ɛ ). Using Graham’s greedy algorithm for the minimum makespan scheduling problem =-=[7, 13]-=-, we can conclude that the maximum number of bits mapped to any one machine is no more than the average load per machine plus the maximum size of any 〈k, Vk,r〉 pair. Thus, ≤ s(Vr) + s(Kr) number of ma... |

5 | A survey of models of parallel computation
- Campbell
- 1997
(Show Context)
Citation Context ...ls of parallel computation. After that we discuss other works that use MapReduce. □4.1 Comparing MapReduce and PRAMs Numerous models of parallel computation have been proposed in the literature; see =-=[1]-=- for a survey of them. While the most popular by far for theoretical study is the PRAM, probably the next two most popular are LogP, proposed by Culler et al. [2], and BSP, proposed by Valiant [12]. T... |

3 |
partners with four top universities to advance cloud computing systems and application research.yahoo! press release,2009
- Yahoo
- 2009
(Show Context)
Citation Context ...s more than 10 petabytes of information per day at Google alone [5]. An open source version, called Hadoop, has recently been developed, and is seeing increased adoption both in industry and academia =-=[14]-=-. Over 70 companies use Hadoop including Yahoo!, Facebook, Adobe, and IBM [8]. Moreover, Amazon’s Elastic Compute Cloud (EC2) is a Hadoop cluster where users can upload large data sets and rent proces... |