Results 1 - 10
of
23
PEGASUS: A Peta-Scale Graph Mining System- Implementation and Observations
- IEEE INTERNATIONAL CONFERENCE ON DATA MINING
, 2009
"... Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga-, Tera- or P ..."
Abstract
-
Cited by 128 (26 self)
- Add to MetaCart
(Show Context)
Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with ≈ 6,7 billion edges. Keywords-PEGASUS; graph mining; hadoop I.
A provable time and space efficient implementation of nesl
- In International Conference on Functional Programming
, 1996
"... In this paper we prove time and space bounds for the implementation of the programming language NESL on various parallel machine models. NESL is a sugared typed J-calculus with a set of array primitives and an explicit parallel map over arrays. Our results extend previous work on provable implementa ..."
Abstract
-
Cited by 87 (10 self)
- Add to MetaCart
(Show Context)
In this paper we prove time and space bounds for the implementation of the programming language NESL on various parallel machine models. NESL is a sugared typed J-calculus with a set of array primitives and an explicit parallel map over arrays. Our results extend previous work on provable implementation bounds for functional languages by considering space and by including arrays. For modeling the cost of NESL we augment a standard call-by-value operational semantics to return two cost measures: a DAG representing the sequential dependence in the computation, and a measure of the space taken by a sequential implementation. We show that a NESL program with w work (nodes in the DAG), d depth (levels in the DAG), and s sequential space can be implemented on a p processor butterfly network, hypercube, or CRCW PRAM usin O(w/p + d log p) time and 0(s + dp logp) reachable space. For programs with sufficient parallelism these bounds are optimal in that they give linew speedup and use space within a constant factor of the sequential space. 1
Idempotent work stealing
- in PPoPP, 2009
"... Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Work-stealing is a popular technique for implementing load balancing, where each parallel thread maintains its own work set of items and o ..."
Abstract
-
Cited by 32 (6 self)
- Add to MetaCart
(Show Context)
Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Work-stealing is a popular technique for implementing load balancing, where each parallel thread maintains its own work set of items and occasionally steals items from the sets of other threads. The conventional semantics of work stealing guarantee that each inserted task is eventually extracted exactly once. However, correctness of a wide class of applications allows for relaxed se-mantics, because either: i) the application already explicitly checks that no work is repeated or ii) the application can tolerate repeated work. In this paper, we introduce idempotent work stealing, and present several new algorithms that exploit the relaxed semantics to deliver better performance. The semantics of the new algorithms guarantee that each inserted task is eventually extracted at least once–instead of exactly once. On mainstream processors, algorithms for conventional work stealing require special atomic instructions or store-load memory ordering fence instructions in the owner’s critical path operations. In general, these instructions are substantially slower than regular memory access instructions. By exploiting the relaxed semantics, our algorithms avoid these instructions in the owner’s operations. We evaluated our algorithms using common graph problems and micro-benchmarks and compared them to well-known conventional work stealing algorithms, the THE Cilk and Chase-Lev algorithms. We found that our best algorithm (with LIFO extraction) outper-forms existing algorithms in nearly all cases, and often by signifi-cant margins.
Connected Components on Distributed Memory Machines
- Parallel Algorithms: 3rd DIMACS Implementation Challenge October 17-19, 1994, volume 30 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science
, 1994
"... . The efforts of the theory community to develop efficient PRAM algorithms often receive little attention from application programmers. Although there are PRAM algorithm implementations that perform reasonably on shared memory machines, they often perform poorly on distributed memory machines, where ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
(Show Context)
. The efforts of the theory community to develop efficient PRAM algorithms often receive little attention from application programmers. Although there are PRAM algorithm implementations that perform reasonably on shared memory machines, they often perform poorly on distributed memory machines, where the cost of remote memory accesses is relatively high. We present a hybrid approach to solving the connected components problem, whereby a PRAM algorithm is merged with a sequential algorithm and then optimized to create an efficient distributed memory implementation. The sequential algorithm handles local work on each processor, and the PRAM algorithm handles interactions between processors. Our hybrid algorithm uses the Shiloach-Vishkin CRCW PRAM algorithm on a partition of the graph distributed over the processors and sequential breadth-first search within each local subgraph. The implementation uses the Split-C language developed at Berkeley, which provides a global address space and al...
An Optimal Randomized Logarithmic Time Connectivity Algorithm for the EREW PRAM
, 1996
"... Improving a long chain of works we obtain a randomised EREW PRAM algorithm for finding the connected components of a graph G = (V; E) with n vertices and m edges in O(logn) time using an optimal number of O((m + n)= log n) processors. The result returned by the algorithm is always correct. The pr ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Improving a long chain of works we obtain a randomised EREW PRAM algorithm for finding the connected components of a graph G = (V; E) with n vertices and m edges in O(logn) time using an optimal number of O((m + n)= log n) processors. The result returned by the algorithm is always correct. The probability that the algorithm will not complete in O(log n) time is o(n \Gammac ) for any c ? 0. 1 Introduction Finding the connected components of an undirected graph is perhaps the most basic algorithmic graph problem. While the problem is trivial in the sequential setting, it seems that elaborate methods should be used to solve the problem efficiently in the parallel setting. A considerable number of researchers investigated the complexity of the problem in various parallel models including, in particular, various members of the PRAM family. In this work we consider the EREW PRAM model, the weakest member of this family, and obtain, for the first time, a parallel connectivity algorith...
Towards Modeling the Performance of a Fast Connected Components Algorithm on Parallel Machines
- In Proceedings of Supercomputing '95
, 1996
"... : We present and analyze a portable, high-performance algorithm for finding connected components on modern distributed memory multiprocessors. The algorithm is a hybrid of the classic DFS on the subgraph local to each processor and a variant of the Shiloach-Vishkin PRAM algorithm on the global colle ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
: We present and analyze a portable, high-performance algorithm for finding connected components on modern distributed memory multiprocessors. The algorithm is a hybrid of the classic DFS on the subgraph local to each processor and a variant of the Shiloach-Vishkin PRAM algorithm on the global collection of subgraphs. We implement the algorithm in Split-C and measure performance on the the Cray T3D, the Meiko CS-2, and the Thinking Machines CM-5 using a class of graphs derived from cluster dynamics methods in computational physics. On a 256 processor Cray T3D, the implementation outperforms all previous solutions by an order of magnitude. A characterization of graph parameters allows us to select graphs that highlight key performance features. We study the effects of these parameters and machine characteristics on the balance of time between the local and global phases of the algorithm and find that edge density, surface-to-volume ratio, and relative communication cost dominate perform...
Detecting atmospheric rivers in large climate datasets
- In Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities (PDAC ‘11). ACM
, 2011
"... Extreme precipitation events on the western coast of North America are often traced to an unusual weather phenomenon known as atmospheric rivers. Although these storms may provide a significant fraction of the total water to the highly managed western US hydrological system, the resulting intense we ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
(Show Context)
Extreme precipitation events on the western coast of North America are often traced to an unusual weather phenomenon known as atmospheric rivers. Although these storms may provide a significant fraction of the total water to the highly managed western US hydrological system, the resulting intense weather poses severe risks to the human and natural infrastructure through severe flooding and wind damage. To aid the understanding of this phenomenon, we have developed an efficient detection algorithm suitable for analyzing large amounts of data. In addition to detecting actual events in the recent observed historical record, this detection algorithm can be applied to global climate model output providing a new model validation methodology. Comparing the statistical behavior of simulated atmospheric river events in models to observations will enhance confidence in projections of future extreme storms. Our detection algorithm is based on a thresholding condition on the total column integrated water vapor established by Ralph et al. (2004) followed by a connected component labeling procedure to group the mesh points into connected regions in space. We develop an efficient parallel implementation of the algorithm and demonstrate good weak and strong scaling. We process a 30-year simulation output on 10,000 cores in under 3 seconds.
Finding Strongly Connected Components in Parallel using O(log 2 n) Reachability Queries
, 2007
"... We give a randomized (Las-Vegas) parallel algorithm for computing strongly connected components of a graph with n vertices and m edges. The runtime is dominated by O(log 2 n) parallel reachability queries; i.e. O(log 2 n) calls to a subroutine that computes the descendants of a given vertex in a giv ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
We give a randomized (Las-Vegas) parallel algorithm for computing strongly connected components of a graph with n vertices and m edges. The runtime is dominated by O(log 2 n) parallel reachability queries; i.e. O(log 2 n) calls to a subroutine that computes the descendants of a given vertex in a given digraph. Our algorithm also topologically sorts the strongly connected components. Using Ullman and Yannakakis’s [21] techniques for the reachability subroutine gives our algorithm runtime Õ(t) using mn/t2 processors for any (n 2 /m) 1/3 ≤ t ≤ n. On sparse graphs, this improves the number of processors needed to compute strongly connected components and topological sort within time n 1/3 ≤ t ≤ n from the previously best known (n/t) 3 [19] to (n/t) 2. 1 Introduction and main results Breadth-first and depth-first search have many applications in the analysis of directed graphs. Breadth-first search can be used to compute the vertices that are reachable from a given vertex and directed spanning trees. Depth-first search can: solve these problems, determine if a graph is acyclic, topologically sort an acyclic graph and compute strongly connected components (SCCs) [20]. Efforts
Multi-resolution Social Network Community Identification and Maintenance on Big Data Platform
"... Abstract—Community identification in social networks is of great interest and with dynamic changes to its graph representation and content, the incremental maintenance of community poses significant challenges in computation. Moreover, the intensity of community engagement can be distinguished at mu ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Community identification in social networks is of great interest and with dynamic changes to its graph representation and content, the incremental maintenance of community poses significant challenges in computation. Moreover, the intensity of community engagement can be distinguished at multiple levels, resulting in a multi-resolution community representation that has to be maintained over time. In this paper, we first formalize this problem using the k-core metric projected at multiple k values, so that multiple community resolutions are represented with multiple k-core graphs. We then present distributed algorithms to construct and maintain a multi-k-core graph, implemented on the scalable big-data platform Apache HBase. Our experimental evaluation results demonstrate orders of magnitude speedup by maintaining multi-k-core incrementally over complete reconstruction. Our algorithms thus enable practitioners to create and maintain communities at multiple resolutions on different topics in rich social network content simultaneously. Keywords-community identification; Big Data analytics; k-core; dynamic social networks; distributed computing I.
A Simple and Practical Linear-Work Parallel Algorithm for Connectivity
"... Graph connectivity is a fundamental problem in computer science with many important applications. Sequentially, connectivity can be done in linear work easily using breadth-first search or depth-first search. There have been many parallel algorithms for con-nectivity, however the simpler parallel al ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Graph connectivity is a fundamental problem in computer science with many important applications. Sequentially, connectivity can be done in linear work easily using breadth-first search or depth-first search. There have been many parallel algorithms for con-nectivity, however the simpler parallel algorithms require super-linear work, and the linear-work polylogarithmic-depth parallel al-gorithms are very complicated and not amenable to implementa-tion. In this work, we address this gap by describing a simple and practical expected linear-work, polylogarithmic depth parallel al-gorithm for graph connectivity. Our algorithm is based on a recent parallel algorithm for generat-ing low-diameter graph decompositions by Miller et al. [44], which uses parallel breadth-first searches. We discuss a (modest) variant of their decomposition algorithm which preserves the theoretical complexity while leading to simpler and faster implementations. We experimentally compare the connectivity algorithms using both the original decomposition algorithm and our modified decomposi-tion algorithm. We also experimentally compare against the fastest existing parallel connectivity implementations (which are not theo-retically linear-work and polylogarithmic-depth) and show that our implementations are competitive for various input graphs. In ad-dition, we compare our implementations to sequential connectivity algorithms and show that on 40 cores we achieve good speedup relative to the sequential implementations for many input graphs. We discuss the various optimizations used in our implementations and present an extensive experimental analysis of the performance. Our algorithm is the first parallel connectivity algorithm that is both theoretically and practically efficient.