Results 1  10
of
34
A Model of Computation for MapReduce
 Proc. ACMSIAM SODA
, 2010
"... In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Used daily at companies such as Yahoo!, Google, Amazon, and Facebook, and adopted more recently by several universities, it allows for ..."
Abstract

Cited by 103 (7 self)
 Add to MetaCart
(Show Context)
In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Used daily at companies such as Yahoo!, Google, Amazon, and Facebook, and adopted more recently by several universities, it allows for easy parallelization of data intensive computations over many machines. One key feature of MapReduce that differentiates it from previous models of parallel computation is that it interleaves sequential and parallel computation. We propose a model of efficient computation using the MapReduce paradigm. Since MapReduce is designed for computations over massive data sets, our model limits the number of machines and the memory per machine to be substantially sublinear in the size of the input. On the other hand, we place very loose restrictions on the computational power of of any individual machine— our model allows each machine to perform sequential computations in time polynomial in the size of the original input. We compare MapReduce to the PRAM model of computation. We prove a simulation lemma showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce. The strength of MapReduce, however, lies in the fact that it uses both sequential and parallel computation. We demonstrate how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds, as opposed to Ω(log(n)) rounds needed in the standard PRAM model. We show how to evaluate a wide class of functions using the MapReduce framework. We conclude by applying this result to show how to compute some basic algorithmic problems such as undirected st connectivity in the MapReduce framework. 1
PrivacyPreserving Access of Outsourced Data via Oblivious RAM Simulation
, 2011
"... Suppose a client, Alice, has outsourced her data to an external storage provider, Bob, because he has capacity for her massive data set, of size n, whereas her private storage is much smaller—say, of size O(n1/r), for some constant r> 1. Alice trusts Bob to maintain her data, but she would like t ..."
Abstract

Cited by 66 (9 self)
 Add to MetaCart
(Show Context)
Suppose a client, Alice, has outsourced her data to an external storage provider, Bob, because he has capacity for her massive data set, of size n, whereas her private storage is much smaller—say, of size O(n1/r), for some constant r> 1. Alice trusts Bob to maintain her data, but she would like to keep its contents private. She can encrypt her data, of course, but she also wishes to keep her access patterns hidden from Bob as well. We describe schemes for the oblivious RAM simulation problem with a small logarithmic or polylogarithmic amortized increase in access times, with a very high probability of success, while keeping the external storage to be of size O(n). To achieve this, our algorithmic contributions include a parallel MapReduce cuckoohashing algorithm and an externalmemory dataoblivious sorting algorithm.
Fast clustering using MapReduce
 In KDD
, 2011
"... Clustering problems have numerous applications and are becoming more challenging with the growing size of data available. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the ..."
Abstract

Cited by 24 (4 self)
 Add to MetaCart
(Show Context)
Clustering problems have numerous applications and are becoming more challenging with the growing size of data available. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems kcenter and kmedian. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis showing several clustering algorithms are in MRC 0, a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and run a time consuming clustering algorithm such as local search or Lloyd’s algorithm on the reduced data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the kmedian problem. The experiments show that our algorithms ’ solutions are similar or better than the other algorithms, while running faster than any other parallel algorithm that was tested for sufficiently large data sets. 1.
Mergeable Summaries
"... We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means t ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
(Show Context)
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for εapproximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for εapproximate quantiles, there is a deterministic summary of size O ( 1 log(εn)) that has a restricted form of mergeability, ε and a randomized one of size O ( 1 1 log3/2) with full mergeε ε ability. We also extend our results to geometric summaries such as εapproximations and εkernels. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for εapproximate quantiles that depends only on ε, of size O ( 1 1 log3/2), and (2) we demonstrate that the MG and the ε ε SpaceSaving summaries for heavy hitters are isomorphic. Supported by NSF under grants CNS0540347, IIS07
Communication steps for parallel query processing.
 In Proceedings of the 32nd ACM Symposium on Principles of Database Systems, PODS,
, 2013
"... ABSTRACT We consider the problem of computing a relational query q on a large input database of size n, using a large number p of servers. The computation is performed in rounds, and each server can receive only O(n/p 1−ε ) bits of data, where ε ∈ [0, 1] is a parameter that controls replication. We ..."
Abstract

Cited by 22 (4 self)
 Add to MetaCart
(Show Context)
ABSTRACT We consider the problem of computing a relational query q on a large input database of size n, using a large number p of servers. The computation is performed in rounds, and each server can receive only O(n/p 1−ε ) bits of data, where ε ∈ [0, 1] is a parameter that controls replication. We examine how many global communication steps are needed to compute q. We establish both lower and upper bounds, in two settings. For a single round of communication, we give lower bounds in the strongest possible model, where arbitrary bits may be exchanged; we show that any algorithm requires ε ≥ 1−1/τ * , where τ * is the fractional vertex cover of the hypergraph of q. We also give an algorithm that matches the lower bound for a specific class of databases. For multiple rounds of communication, we present lower bounds in a model where routing decisions for a tuple are tuplebased. We show that for the class of treelike queries there exists a tradeoff between the number of rounds and the space exponent ε. The lower bounds for multiple rounds are the first of their kind. Our results also imply that transitive closure cannot be computed in O(1) rounds of communication.
Sorting, searching, and simulation in the MapReduce framework
 in Proc. of the 22nd International Symp. on Algorithms and Computation
"... In this paper, we study the MapReduce framework from an algorithmic standpoint and demonstrate the usefulness of our approach by designing and analyzing efficient MapReduce algorithms for fundamental sorting, searching, and simulation problems. This study is motivated by a goal of ultimately puttin ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
(Show Context)
In this paper, we study the MapReduce framework from an algorithmic standpoint and demonstrate the usefulness of our approach by designing and analyzing efficient MapReduce algorithms for fundamental sorting, searching, and simulation problems. This study is motivated by a goal of ultimately putting the MapReduce framework on an equal theoretical footing with the wellknown PRAM and BSP parallel models, which would benefit both the theory and practice of MapReduce algorithms. We describe efficient MapReduce algorithms for sorting, multisearching, and simulations of parallel algorithms specified in the BSP and CRCW PRAM models. We also provide some applications of these results to problems in parallel computational geometry for the MapReduce framework, which result in efficient MapReduce algorithms for sorting, 2 and 3dimensional convex hulls, and fixeddimensional linear programming. For the case when mappers and reducers have a memory/messageI/O size of M = Θ(N ), for a small constant > 0, all of our MapReduce algorithms for these applications run in a constant number of rounds. ar X iv
Densest Subgraph in Streaming and MapReduce
"... The problem of finding locally dense components of a graph is an important primitive in data analysis, with wideranging applications from community mining to spam detection and the discovery of biological network modules. In this paper we present new algorithms for finding the densest subgraph in t ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
(Show Context)
The problem of finding locally dense components of a graph is an important primitive in data analysis, with wideranging applications from community mining to spam detection and the discovery of biological network modules. In this paper we present new algorithms for finding the densest subgraph in the streaming model. For any ɛ> 0, our algorithms make O(log 1+ɛ n) passes over the input and find a subgraph whose density is guaranteed to be within a factor 2(1 + ɛ) of the optimum. Our algorithms are also easily parallelizable and we illustrate this by realizing them in the MapReduce model. In addition we perform extensive experimental evaluation on massive realworld graphs showing the performance and scalability of our algorithms in practice. 1.
The Continuous Distributed Monitoring Model
, 2013
"... In the model of continuous distributed monitoring, a number of observers each see a stream of observations. Their goal is to work together to compute a function of the union of their observations. This can be as simple as counting the total number of observations, or more complex nonlinear function ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
In the model of continuous distributed monitoring, a number of observers each see a stream of observations. Their goal is to work together to compute a function of the union of their observations. This can be as simple as counting the total number of observations, or more complex nonlinear functions such as tracking the entropy of the induced distribution. Assuming that it is too costly to simply centralize all the observations, it becomes quite challenging to design solutions which provide a good approximation to the current answer, while bounding the communication cost of the observers, and their other resources such as their space usage. This survey introduces this model, and describe a selection results in this setting, from the simple counting problem to a variety of other functions that have been studied.
DOT: A Matrix Model for Analyzing, Optimizing and Deploying Software for Big Data Analytics in Distributed Systems
"... ..."
(Show Context)
Dataoblivious graph drawing model and algorithms. arXiv preprint arXiv:1209.0756
, 2012
"... Abstract. We study graph drawing in a cloudcomputing context where data is stored externally and processed using a small local working storage. We show that a number of classic graph drawing algorithms can be efficiently implemented in such a framework where the client can maintain privacy while co ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We study graph drawing in a cloudcomputing context where data is stored externally and processed using a small local working storage. We show that a number of classic graph drawing algorithms can be efficiently implemented in such a framework where the client can maintain privacy while constructing a drawing of her graph. 1