Results 1  10
of
103
Counting triangles and the curse of the last reducer
 In WWW
, 2011
"... The clustering coefficient of a node in a social network is a fundamental measure that quantifies how tightlyknit the community is around the node. Its computation can be reduced to counting the number of triangles incident on the particular node in the network. In case the graph is too big to fit ..."
Abstract

Cited by 72 (1 self)
 Add to MetaCart
(Show Context)
The clustering coefficient of a node in a social network is a fundamental measure that quantifies how tightlyknit the community is around the node. Its computation can be reduced to counting the number of triangles incident on the particular node in the network. In case the graph is too big to fit into memory, this is a nontrivial task, and previous researchers showed how to estimate the clustering coefficient in this scenario. A different avenue of research is to to perform the computation in parallel, spreading it across many machines. In recent years MapReduce has emerged as a de facto programming paradigm for parallel computation on massive data sets. The main focus of this work is to give MapReduce algorithms for counting triangles which we use to compute clustering coefficients. Our contributions are twofold. First, we describe a sequential triangle counting algorithm and show how to adapt it to the MapReduce setting. This algorithm achieves a factor of 10100 speed up over the naive approach. Second, we present a new algorithm designed specifically for the MapReduce framework. A key feature of this approach is that it allows for a smooth tradeoff between the memory available on each individual machine and the total memory available to the algorithm, while keeping the total work done constant. Moreover, this algorithm can use any triangle counting algorithm as a black box and distribute the computation across many machines. We validate our algorithms on real world datasets comprising of millions of nodes and over a billion edges. Our results show both algorithms effectively deal with skew in the degree distribution and lead to dramatic speed ups over the naive implementation.
PrivacyPreserving Access of Outsourced Data via Oblivious RAM Simulation
, 2011
"... Suppose a client, Alice, has outsourced her data to an external storage provider, Bob, because he has capacity for her massive data set, of size n, whereas her private storage is much smaller—say, of size O(n1/r), for some constant r> 1. Alice trusts Bob to maintain her data, but she would like t ..."
Abstract

Cited by 66 (9 self)
 Add to MetaCart
(Show Context)
Suppose a client, Alice, has outsourced her data to an external storage provider, Bob, because he has capacity for her massive data set, of size n, whereas her private storage is much smaller—say, of size O(n1/r), for some constant r> 1. Alice trusts Bob to maintain her data, but she would like to keep its contents private. She can encrypt her data, of course, but she also wishes to keep her access patterns hidden from Bob as well. We describe schemes for the oblivious RAM simulation problem with a small logarithmic or polylogarithmic amortized increase in access times, with a very high probability of success, while keeping the external storage to be of size O(n). To achieve this, our algorithmic contributions include a parallel MapReduce cuckoohashing algorithm and an externalmemory dataoblivious sorting algorithm.
The Declarative Imperative Experiences and Conjectures in Distributed Logic
"... The rise of multicore processors and cloud computing is putting enormous pressure on the software community to find solutions to the difficulty of parallel and distributed programming. At the same time, there is more—and more varied—interest in datacentric programming languages than at any time in ..."
Abstract

Cited by 50 (5 self)
 Add to MetaCart
(Show Context)
The rise of multicore processors and cloud computing is putting enormous pressure on the software community to find solutions to the difficulty of parallel and distributed programming. At the same time, there is more—and more varied—interest in datacentric programming languages than at any time in computing history, in part because these languages parallelize naturally. This juxtaposition raises the possibility that the theory of declarative database query languages can provide a foundation for the next generation of parallel and distributed programming languages. In this paper I reflect on my group’s experience over seven years using Datalog extensions to build networking protocols and distributed systems. Based on that experience, I present a number of theoretical conjectures that may both interest the database community, and clarify important practical issues in distributed computing. Most importantly, I make a case for database researchers to take a leadership role in addressing the impending programming crisis. This is an extended version of an invited lecture at the ACM PODS 2010 conference [32]. 1.
Parallel evaluation of conjunctive queries.
, 2011
"... ABSTRACT The availability of large data centers with tens of thousands of servers has led to the popular adoption of massive parallelism for data analysis on large datasets. Several query languages exist for running queries on massively parallel architectures, some based on the MapReduce infrastruc ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
(Show Context)
ABSTRACT The availability of large data centers with tens of thousands of servers has led to the popular adoption of massive parallelism for data analysis on large datasets. Several query languages exist for running queries on massively parallel architectures, some based on the MapReduce infrastructure, others using proprietary implementations. Motivated by this trend, this paper analyzes the parallel complexity of conjunctive queries. We propose a very simple model of parallel computation that captures these architectures, in which the complexity parameter is the number of parallel steps requiring synchronization of all servers. We study the complexity of conjunctive queries and give a complete characterization of the queries which can be computed in one parallel step. These form a strict subset of hierarchical queries, and in U(x, y, z, w), and combinations thereof, which we call tallflat queries. We describe an algorithm for computing in parallel any tallflat query, and prove that any query that is not tallflat cannot be computed in one step in this model. Finally, we present extensions of our results to queries that are not tallflat.
Scalable KMeans++
"... Over half a century old and showing no signs of aging, kmeans remains one of the most popular data processing algorithms. As is wellknown, a proper initialization of kmeans is crucial for obtaining a good final solution. The recently proposed kmeans++ initialization algorithm achieves this, obta ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
Over half a century old and showing no signs of aging, kmeans remains one of the most popular data processing algorithms. As is wellknown, a proper initialization of kmeans is crucial for obtaining a good final solution. The recently proposed kmeans++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the kmeans++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing kmeans that have mostly focused on the postinitialization phases of kmeans. We prove that our proposed initialization algorithm kmeans obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on realworld largescale data demonstrates that kmeans  outperforms kmeans++ in both sequential and parallel settings. 1.
Fast clustering using MapReduce
 In KDD
, 2011
"... Clustering problems have numerous applications and are becoming more challenging with the growing size of data available. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the ..."
Abstract

Cited by 24 (4 self)
 Add to MetaCart
(Show Context)
Clustering problems have numerous applications and are becoming more challenging with the growing size of data available. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems kcenter and kmedian. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis showing several clustering algorithms are in MRC 0, a theoretical MapReduce class introduced by Karloff et al. [26]. Our algorithms use sampling to decrease the data size and run a time consuming clustering algorithm such as local search or Lloyd’s algorithm on the reduced data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the kmedian problem. The experiments show that our algorithms ’ solutions are similar or better than the other algorithms, while running faster than any other parallel algorithm that was tested for sufficiently large data sets. 1.
Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds
"... Abstract—Running MapReduce programs in the public cloud introduces the important problem: how to optimize resource provisioning to minimize the financial charge for a specific job? In this paper, we study the whole process of MapReduce processing and build up a cost function that explicitly models t ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Running MapReduce programs in the public cloud introduces the important problem: how to optimize resource provisioning to minimize the financial charge for a specific job? In this paper, we study the whole process of MapReduce processing and build up a cost function that explicitly models the relationship between the amount of input data, the available system resources (Map and Reduce slots), and the complexity of the Reduce function for the target MapReduce job. The model parameters can be learned from test runs with a small number of nodes. Based on this cost model, we can solve a number of decision problems, such as the optimal amount of resources that can minimize the financial cost with a time deadline or minimize the time under certain financial budget. Experimental results show that this cost model performs well on tested MapReduce programs.
Communication steps for parallel query processing.
 In Proceedings of the 32nd ACM Symposium on Principles of Database Systems, PODS,
, 2013
"... ABSTRACT We consider the problem of computing a relational query q on a large input database of size n, using a large number p of servers. The computation is performed in rounds, and each server can receive only O(n/p 1−ε ) bits of data, where ε ∈ [0, 1] is a parameter that controls replication. We ..."
Abstract

Cited by 22 (4 self)
 Add to MetaCart
(Show Context)
ABSTRACT We consider the problem of computing a relational query q on a large input database of size n, using a large number p of servers. The computation is performed in rounds, and each server can receive only O(n/p 1−ε ) bits of data, where ε ∈ [0, 1] is a parameter that controls replication. We examine how many global communication steps are needed to compute q. We establish both lower and upper bounds, in two settings. For a single round of communication, we give lower bounds in the strongest possible model, where arbitrary bits may be exchanged; we show that any algorithm requires ε ≥ 1−1/τ * , where τ * is the fractional vertex cover of the hypergraph of q. We also give an algorithm that matches the lower bound for a specific class of databases. For multiple rounds of communication, we present lower bounds in a model where routing decisions for a tuple are tuplebased. We show that for the class of treelike queries there exists a tradeoff between the number of rounds and the space exponent ε. The lower bounds for multiple rounds are the first of their kind. Our results also imply that transitive closure cannot be computed in O(1) rounds of communication.
Fast greedy algorithms in mapreduce and streaming
 In SPAA
, 2013
"... Greedy algorithms are practitioners ’ best friends—they are intuitive, simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advant ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
(Show Context)
Greedy algorithms are practitioners ’ best friends—they are intuitive, simple to implement, and often lead to very good solutions. However, implementing greedy algorithms in a distributed setting is challenging since the greedy choice is inherently sequential, and it is not clear how to take advantage of the extra processing power. Our main result is a powerful sampling technique that aids in parallelization of sequential algorithms. We then show how to use this primitive to adapt a broad class of greedy algorithms to the MapReduce paradigm; this class includes maximum cover and submodular maximization subject to psystem constraints. Our method yields efficient algorithms that run in a logarithmic number of rounds, while obtaining solutions that are arbitrarily close to those produced by the standard sequential greedy algorithm. We begin with algorithms for modular maximization subject to a matroid constraint, and then extend this approach to obtain approximation algorithms for submodular maximization subject to knapsack or psystem constraints. Finally, we empirically validate our algorithms, and show that they achieve the same quality of the solution as standard greedy algorithms but run in a substantially fewer number of rounds. Categories and Subject Descriptors