Results 1  10
of
23
PEGASUS: A PetaScale Graph Mining System Implementation and Observations
 IEEE INTERNATIONAL CONFERENCE ON DATA MINING
, 2009
"... Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga, Tera or P ..."
Abstract

Cited by 128 (26 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga, Tera or Petabytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrixvector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIMV (Generalized Iterated MatrixVector multiplication). GIMV is highly optimized, achieving (a) good scaleup on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the nonoptimized version of GIMV. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with ≈ 6,7 billion edges. KeywordsPEGASUS; graph mining; hadoop I.
A provable time and space efficient implementation of nesl
 In International Conference on Functional Programming
, 1996
"... In this paper we prove time and space bounds for the implementation of the programming language NESL on various parallel machine models. NESL is a sugared typed Jcalculus with a set of array primitives and an explicit parallel map over arrays. Our results extend previous work on provable implementa ..."
Abstract

Cited by 87 (10 self)
 Add to MetaCart
(Show Context)
In this paper we prove time and space bounds for the implementation of the programming language NESL on various parallel machine models. NESL is a sugared typed Jcalculus with a set of array primitives and an explicit parallel map over arrays. Our results extend previous work on provable implementation bounds for functional languages by considering space and by including arrays. For modeling the cost of NESL we augment a standard callbyvalue operational semantics to return two cost measures: a DAG representing the sequential dependence in the computation, and a measure of the space taken by a sequential implementation. We show that a NESL program with w work (nodes in the DAG), d depth (levels in the DAG), and s sequential space can be implemented on a p processor butterfly network, hypercube, or CRCW PRAM usin O(w/p + d log p) time and 0(s + dp logp) reachable space. For programs with sufficient parallelism these bounds are optimal in that they give linew speedup and use space within a constant factor of the sequential space. 1
Idempotent work stealing
 in PPoPP, 2009
"... Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Workstealing is a popular technique for implementing load balancing, where each parallel thread maintains its own work set of items and o ..."
Abstract

Cited by 32 (6 self)
 Add to MetaCart
(Show Context)
Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Workstealing is a popular technique for implementing load balancing, where each parallel thread maintains its own work set of items and occasionally steals items from the sets of other threads. The conventional semantics of work stealing guarantee that each inserted task is eventually extracted exactly once. However, correctness of a wide class of applications allows for relaxed semantics, because either: i) the application already explicitly checks that no work is repeated or ii) the application can tolerate repeated work. In this paper, we introduce idempotent work stealing, and present several new algorithms that exploit the relaxed semantics to deliver better performance. The semantics of the new algorithms guarantee that each inserted task is eventually extracted at least once–instead of exactly once. On mainstream processors, algorithms for conventional work stealing require special atomic instructions or storeload memory ordering fence instructions in the owner’s critical path operations. In general, these instructions are substantially slower than regular memory access instructions. By exploiting the relaxed semantics, our algorithms avoid these instructions in the owner’s operations. We evaluated our algorithms using common graph problems and microbenchmarks and compared them to wellknown conventional work stealing algorithms, the THE Cilk and ChaseLev algorithms. We found that our best algorithm (with LIFO extraction) outperforms existing algorithms in nearly all cases, and often by significant margins.
Connected Components on Distributed Memory Machines
 Parallel Algorithms: 3rd DIMACS Implementation Challenge October 1719, 1994, volume 30 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science
, 1994
"... . The efforts of the theory community to develop efficient PRAM algorithms often receive little attention from application programmers. Although there are PRAM algorithm implementations that perform reasonably on shared memory machines, they often perform poorly on distributed memory machines, where ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
(Show Context)
. The efforts of the theory community to develop efficient PRAM algorithms often receive little attention from application programmers. Although there are PRAM algorithm implementations that perform reasonably on shared memory machines, they often perform poorly on distributed memory machines, where the cost of remote memory accesses is relatively high. We present a hybrid approach to solving the connected components problem, whereby a PRAM algorithm is merged with a sequential algorithm and then optimized to create an efficient distributed memory implementation. The sequential algorithm handles local work on each processor, and the PRAM algorithm handles interactions between processors. Our hybrid algorithm uses the ShiloachVishkin CRCW PRAM algorithm on a partition of the graph distributed over the processors and sequential breadthfirst search within each local subgraph. The implementation uses the SplitC language developed at Berkeley, which provides a global address space and al...
An Optimal Randomized Logarithmic Time Connectivity Algorithm for the EREW PRAM
, 1996
"... Improving a long chain of works we obtain a randomised EREW PRAM algorithm for finding the connected components of a graph G = (V; E) with n vertices and m edges in O(logn) time using an optimal number of O((m + n)= log n) processors. The result returned by the algorithm is always correct. The pr ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
Improving a long chain of works we obtain a randomised EREW PRAM algorithm for finding the connected components of a graph G = (V; E) with n vertices and m edges in O(logn) time using an optimal number of O((m + n)= log n) processors. The result returned by the algorithm is always correct. The probability that the algorithm will not complete in O(log n) time is o(n \Gammac ) for any c ? 0. 1 Introduction Finding the connected components of an undirected graph is perhaps the most basic algorithmic graph problem. While the problem is trivial in the sequential setting, it seems that elaborate methods should be used to solve the problem efficiently in the parallel setting. A considerable number of researchers investigated the complexity of the problem in various parallel models including, in particular, various members of the PRAM family. In this work we consider the EREW PRAM model, the weakest member of this family, and obtain, for the first time, a parallel connectivity algorith...
Towards Modeling the Performance of a Fast Connected Components Algorithm on Parallel Machines
 In Proceedings of Supercomputing '95
, 1996
"... : We present and analyze a portable, highperformance algorithm for finding connected components on modern distributed memory multiprocessors. The algorithm is a hybrid of the classic DFS on the subgraph local to each processor and a variant of the ShiloachVishkin PRAM algorithm on the global colle ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
: We present and analyze a portable, highperformance algorithm for finding connected components on modern distributed memory multiprocessors. The algorithm is a hybrid of the classic DFS on the subgraph local to each processor and a variant of the ShiloachVishkin PRAM algorithm on the global collection of subgraphs. We implement the algorithm in SplitC and measure performance on the the Cray T3D, the Meiko CS2, and the Thinking Machines CM5 using a class of graphs derived from cluster dynamics methods in computational physics. On a 256 processor Cray T3D, the implementation outperforms all previous solutions by an order of magnitude. A characterization of graph parameters allows us to select graphs that highlight key performance features. We study the effects of these parameters and machine characteristics on the balance of time between the local and global phases of the algorithm and find that edge density, surfacetovolume ratio, and relative communication cost dominate perform...
Detecting atmospheric rivers in large climate datasets
 In Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities (PDAC ‘11). ACM
, 2011
"... Extreme precipitation events on the western coast of North America are often traced to an unusual weather phenomenon known as atmospheric rivers. Although these storms may provide a significant fraction of the total water to the highly managed western US hydrological system, the resulting intense we ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
(Show Context)
Extreme precipitation events on the western coast of North America are often traced to an unusual weather phenomenon known as atmospheric rivers. Although these storms may provide a significant fraction of the total water to the highly managed western US hydrological system, the resulting intense weather poses severe risks to the human and natural infrastructure through severe flooding and wind damage. To aid the understanding of this phenomenon, we have developed an efficient detection algorithm suitable for analyzing large amounts of data. In addition to detecting actual events in the recent observed historical record, this detection algorithm can be applied to global climate model output providing a new model validation methodology. Comparing the statistical behavior of simulated atmospheric river events in models to observations will enhance confidence in projections of future extreme storms. Our detection algorithm is based on a thresholding condition on the total column integrated water vapor established by Ralph et al. (2004) followed by a connected component labeling procedure to group the mesh points into connected regions in space. We develop an efficient parallel implementation of the algorithm and demonstrate good weak and strong scaling. We process a 30year simulation output on 10,000 cores in under 3 seconds.
Finding Strongly Connected Components in Parallel using O(log 2 n) Reachability Queries
, 2007
"... We give a randomized (LasVegas) parallel algorithm for computing strongly connected components of a graph with n vertices and m edges. The runtime is dominated by O(log 2 n) parallel reachability queries; i.e. O(log 2 n) calls to a subroutine that computes the descendants of a given vertex in a giv ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
We give a randomized (LasVegas) parallel algorithm for computing strongly connected components of a graph with n vertices and m edges. The runtime is dominated by O(log 2 n) parallel reachability queries; i.e. O(log 2 n) calls to a subroutine that computes the descendants of a given vertex in a given digraph. Our algorithm also topologically sorts the strongly connected components. Using Ullman and Yannakakis’s [21] techniques for the reachability subroutine gives our algorithm runtime Õ(t) using mn/t2 processors for any (n 2 /m) 1/3 ≤ t ≤ n. On sparse graphs, this improves the number of processors needed to compute strongly connected components and topological sort within time n 1/3 ≤ t ≤ n from the previously best known (n/t) 3 [19] to (n/t) 2. 1 Introduction and main results Breadthfirst and depthfirst search have many applications in the analysis of directed graphs. Breadthfirst search can be used to compute the vertices that are reachable from a given vertex and directed spanning trees. Depthfirst search can: solve these problems, determine if a graph is acyclic, topologically sort an acyclic graph and compute strongly connected components (SCCs) [20]. Efforts
Multiresolution Social Network Community Identification and Maintenance on Big Data Platform
"... Abstract—Community identification in social networks is of great interest and with dynamic changes to its graph representation and content, the incremental maintenance of community poses significant challenges in computation. Moreover, the intensity of community engagement can be distinguished at mu ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Community identification in social networks is of great interest and with dynamic changes to its graph representation and content, the incremental maintenance of community poses significant challenges in computation. Moreover, the intensity of community engagement can be distinguished at multiple levels, resulting in a multiresolution community representation that has to be maintained over time. In this paper, we first formalize this problem using the kcore metric projected at multiple k values, so that multiple community resolutions are represented with multiple kcore graphs. We then present distributed algorithms to construct and maintain a multikcore graph, implemented on the scalable bigdata platform Apache HBase. Our experimental evaluation results demonstrate orders of magnitude speedup by maintaining multikcore incrementally over complete reconstruction. Our algorithms thus enable practitioners to create and maintain communities at multiple resolutions on different topics in rich social network content simultaneously. Keywordscommunity identification; Big Data analytics; kcore; dynamic social networks; distributed computing I.
A Simple and Practical LinearWork Parallel Algorithm for Connectivity
"... Graph connectivity is a fundamental problem in computer science with many important applications. Sequentially, connectivity can be done in linear work easily using breadthfirst search or depthfirst search. There have been many parallel algorithms for connectivity, however the simpler parallel al ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Graph connectivity is a fundamental problem in computer science with many important applications. Sequentially, connectivity can be done in linear work easily using breadthfirst search or depthfirst search. There have been many parallel algorithms for connectivity, however the simpler parallel algorithms require superlinear work, and the linearwork polylogarithmicdepth parallel algorithms are very complicated and not amenable to implementation. In this work, we address this gap by describing a simple and practical expected linearwork, polylogarithmic depth parallel algorithm for graph connectivity. Our algorithm is based on a recent parallel algorithm for generating lowdiameter graph decompositions by Miller et al. [44], which uses parallel breadthfirst searches. We discuss a (modest) variant of their decomposition algorithm which preserves the theoretical complexity while leading to simpler and faster implementations. We experimentally compare the connectivity algorithms using both the original decomposition algorithm and our modified decomposition algorithm. We also experimentally compare against the fastest existing parallel connectivity implementations (which are not theoretically linearwork and polylogarithmicdepth) and show that our implementations are competitive for various input graphs. In addition, we compare our implementations to sequential connectivity algorithms and show that on 40 cores we achieve good speedup relative to the sequential implementations for many input graphs. We discuss the various optimizations used in our implementations and present an extensive experimental analysis of the performance. Our algorithm is the first parallel connectivity algorithm that is both theoretically and practically efficient.