Results 1  10
of
24
A Fast, Parallel Spanning Tree Algorithm for Symmetric Multiprocessors (SMPs) (Extended Abstract)
, 2004
"... Our study in this paper focuses on implementing parallel spanning tree algorithms on SMPs. Spanning tree is an important problem in the sense that it is the building block for many other parallel graph algorithms and also because it is representative of a large class of irregular combinatorial probl ..."
Abstract

Cited by 47 (13 self)
 Add to MetaCart
Our study in this paper focuses on implementing parallel spanning tree algorithms on SMPs. Spanning tree is an important problem in the sense that it is the building block for many other parallel graph algorithms and also because it is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but often have no known efficient parallel implementations. In this paper we present a new randomized algorithm and implementation with superior performance that for the firsttime achieves parallel speedup on arbitrary graphs (both regular and irregular topologies) when compared with the best sequential implementation for finding a spanning tree. This new algorithm uses several techniques to give an expected running time that scales linearly with the number p of processors for suitably large inputs (n> p 2). As the spanning tree problem is notoriously hard for any parallel implementation to achieve reasonable speedup, our study may shed new light on implementing PRAM algorithms for sharedmemory parallel computers. The main results of this paper are 1. A new and practical spanning tree algorithm for symmetric multiprocessors that exhibits parallel speedups on graphs with regular and irregular topologies; and 2. An experimental study of parallel spanning tree algorithms that reveals the superior performance of our new approach compared with the previous algorithms. The source code for these algorithms is freelyavailable from our web site hpc.ece.unm.edu.
Fast SharedMemory Algorithms for Computing the Minimum Spanning Forest of Sparse Graphs
, 2006
"... ..."
On the architectural requirements for efficient execution of graph algorithms
 In Proc. 34th Int’l Conf. on Parallel Processing (ICPP
, 2005
"... Combinatorial problems such as those from graph theory pose serious challenges for parallel machines due to noncontiguous, concurrent accesses to global data structures with low degrees of locality. The hierarchical memory systems of symmetric multiprocessor (SMP) clusters optimize for local, conti ..."
Abstract

Cited by 27 (10 self)
 Add to MetaCart
(Show Context)
Combinatorial problems such as those from graph theory pose serious challenges for parallel machines due to noncontiguous, concurrent accesses to global data structures with low degrees of locality. The hierarchical memory systems of symmetric multiprocessor (SMP) clusters optimize for local, contiguous memory accesses, and so are inefficient platforms for such algorithms. Few parallel graph algorithms outperform their best sequential implementation on SMP clusters due to long memory latencies and high synchronization costs. In this paper, we consider the performance and scalability of two graph algorithms, list ranking and connected components, on two classes of sharedmemory computers: symmetric multiprocessors such as the Sun Enterprise servers and multithreaded architectures
Parallel Implementation of Algorithms for Finding Connected Components in Graphs
, 1997
"... In this paper, we describe our implementation of several parallel graph algorithms for finding connected components. Our implementation, with virtual processing, is on a 16,384processor MasPar MP1 using the language MPL. We present extensive test data on our code. In our previous projects [21, 22, ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
In this paper, we describe our implementation of several parallel graph algorithms for finding connected components. Our implementation, with virtual processing, is on a 16,384processor MasPar MP1 using the language MPL. We present extensive test data on our code. In our previous projects [21, 22, 23], we reported the implementation of an extensible parallel graph algorithms library. We developed general implementation and finetuning techniques without expending too much effort on optimizing each individual routine. We also handled the issue of implementing virtual processing. In this paper, we describe several algorithms and finetuning techniques that we developed for the problem of finding connected components in parallel; many of the finetuning techniques are of general interest, and should be applicable to code for other problems. We present data on the execution time and memory usage of our various implementations.
Connected Components Algorithms For MeshConnected Parallel Computers
 PARALLEL ALGORITHMS: 3RD DIMACS IMPLEMENTATION CHALLENGE OCTOBER 1719, 1994, VOLUME 30 OF DIMACS SERIES IN DISCRETE MATHEMATICS AND THEORETICAL COMPUTER SCIENCE
, 1995
"... We present a new CREW PRAM algorithm for finding connected components. For a graph G with n vertices and m edges, algorithmA 0 requires at most O(logn) parallel steps and performs O((n+m) log n) work in the worst case. The advantage our algorithm has over others in the literature is that it can be ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
We present a new CREW PRAM algorithm for finding connected components. For a graph G with n vertices and m edges, algorithmA 0 requires at most O(logn) parallel steps and performs O((n+m) log n) work in the worst case. The advantage our algorithm has over others in the literature is that it can be adapted to a 2D meshconnected communication model in which all CREW operations are replaced by O(logn) parallel row and column operations without increasing the time complexity. We present the mapping of A 0 to a meshconnected computer and describe two implementations, A 1 and A 2 . Algorithm A 1 , which uses an adjacency matrix to represent the graph, performs O(n 2 log n) work. Hence, it only achieves work efficiency on dense graphs. The second implementation, A 2 , uses a sparse representation of the adjacency matrix and again performs O(logn) row and column operations but reduces the work to O((m + n) log n) on all graphs. We report MasPar MP1 performance figures for implementati...
Towards Modeling the Performance of a Fast Connected Components Algorithm on Parallel Machines
 In Proceedings of Supercomputing '95
, 1996
"... : We present and analyze a portable, highperformance algorithm for finding connected components on modern distributed memory multiprocessors. The algorithm is a hybrid of the classic DFS on the subgraph local to each processor and a variant of the ShiloachVishkin PRAM algorithm on the global colle ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
: We present and analyze a portable, highperformance algorithm for finding connected components on modern distributed memory multiprocessors. The algorithm is a hybrid of the classic DFS on the subgraph local to each processor and a variant of the ShiloachVishkin PRAM algorithm on the global collection of subgraphs. We implement the algorithm in SplitC and measure performance on the the Cray T3D, the Meiko CS2, and the Thinking Machines CM5 using a class of graphs derived from cluster dynamics methods in computational physics. On a 256 processor Cray T3D, the implementation outperforms all previous solutions by an order of magnitude. A characterization of graph parameters allows us to select graphs that highlight key performance features. We study the effects of these parameters and machine characteristics on the balance of time between the local and global phases of the algorithm and find that edge density, surfacetovolume ratio, and relative communication cost dominate perform...
1 Finding Connected Components in MapReduce in Logarithmic Rounds
"... Abstract—Given a large graph G = (V, E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in mapreduce, where a fundamental tradeoff exists between the number of mapreduce rounds and the communication of each round. Denotin ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Abstract—Given a large graph G = (V, E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in mapreduce, where a fundamental tradeoff exists between the number of mapreduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the largest component, all prior techniques for mapreduce either require a linear, Θ(d), number of rounds, or a quadratic, Θ(nV  + E), communication per round. We propose here two efficient mapreduce algorithms: (i) HashGreatertoMin, which is a randomized algorithm based on PRAM techniques, requiring O(log n) rounds and O(V  + E) communication per round, and (ii) HashtoMin, which is a novel algorithm, provably finishing in O(log n) iterations for path graphs. The proof technique used for HashtoMin is novel, but not tight, and it is actually faster than HashGreatertoMin in practice. We conjecture that it requires 2 log d rounds and 3(V  + E) communication per round, as demonstrated in our experiments. Using secondary sorting, a standard mapreduce feature, we scale HashtoMin to graphs with very large connected components. Our techniques for connected components can be applied to clustering as well. We propose a novel algorithm for agglomerative single linkage clustering in mapreduce. This is the first mapreduce algorithm for clustering in at most O(log n) rounds, where n is the size of the largest cluster. We show the effectiveness of all our algorithms through detailed experiments on large synthetic as well as realworld datasets. I.
The Euler tour technique and parallel rooted spanning tree
 In Proc. Int’l Conf. on Parallel Processing (ICPP
, 2004
"... Many parallel algorithms for graph problems start with finding a spanning tree and rooting the tree to define some structural relationship on the vertices which can be used by following problem specific computations. The generic procedure is to find an unrooted spanning tree and then root the spanni ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
Many parallel algorithms for graph problems start with finding a spanning tree and rooting the tree to define some structural relationship on the vertices which can be used by following problem specific computations. The generic procedure is to find an unrooted spanning tree and then root the spanning tree using the Euler tour technique. With a randomized worktime optimal unrooted spanning tree algorithm and worktime optimal list ranking, finding rooted spanning trees can be done worktime optimally on EREW PRAM w.h.p. Yet the Euler tour technique assumes as “given ” a circular adjacency list, it is not without implications though to construct the circular adjacency list for the spanning tree found on the fly by a spanning tree algorithm. In fact our experiments show that this “hidden ” step of constructing a circular adjacency list could take as much time as both spanning tree and list ranking combined. In this paper we present new efficient algorithms that find rooted spanning trees without using the Euler tour technique and incur little or no overhead over the underlying spanning tree algorithms. We also present two new approaches that construct Euler tours efficiently when the circular adjacency list is not given. One is a deterministic PRAM algorithm and the other is a randomized algorithm in the symmetric multiprocessor (SMP) model. The randomized algorithm takes a novel approach for the problems of constructing the Euler tour and rooting a tree. It computes a rooted spanning tree first, then constructs an Euler tour directly for the tree using depthfirst traversal. The tour constructed is cachefriendly with adjacent edges in the
ConnectedComponents Algorithms For MeshConnected Parallel Computers
 PRESENTED AT THE 3RD DIMACS IMPLEMENTATION CHALLENGE WORKSHOP
, 1995
"... We present efficient parallel algorithms for finding the connected components of sparse and dense graphs using a meshconnected parallel computer. We start with a PRAM algorithm with work complexity O(n²log n). The algorithm performs O(logn) reduction and broadcast operations on within the rows an ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
We present efficient parallel algorithms for finding the connected components of sparse and dense graphs using a meshconnected parallel computer. We start with a PRAM algorithm with work complexity O(n²log n). The algorithm performs O(logn) reduction and broadcast operations on within the rows and columns of a mesh connected computer. Next, a representation of the adjacency matrix for a sparse graph with m edges is chosen that preserves the communication structure of the algorithm but improves the work bound to O((n + m)logn). This work bound can be improved to the optimal O(n +m) bound through the use of graph contraction. In architectures like the MasPar MP1 and MP2, parallel row and column operations of the form described achieve high performance relative to unrestricted concurrent accesses typically found in parallel connected component algorithms for sparse graphs and exhibit no locality dependence. We present MasPar MP1 performance figures for implementations of the a...
Lockfree parallel algorithms: An experimental study
 In Proceedings of the 11th International Conference High Performance Computing
, 2004
"... Abstract. Lockfree shared data structures in the setting of distributed computing have received a fair amount of attention. Major motivations of lockfree data structures include increasing fault tolerance of a (possibly heterogeneous) system and getting rid of the problems associated with critical ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Lockfree shared data structures in the setting of distributed computing have received a fair amount of attention. Major motivations of lockfree data structures include increasing fault tolerance of a (possibly heterogeneous) system and getting rid of the problems associated with critical sections such as priority inversion and deadlock. For parallel computers with closelycoupled processors and shared memory, these issues are no longer major concerns. While many of the results are applicable especially when the model used is shared memory multiprocessors, no prior studies have considered improving the performance of a parallel implementation by way of lockfree programming. As a matter of fact, often times in practice lock free data structures in a distributed setting do not perform as well as those that use locks. As the data structures and algorithms for parallel computing are often drastically different from those in distributed computing, it is possible that lockfree programs perform better. In this paper we compare the similarity and difference of lockfree programming in both distributed and parallel computing environments and explore the possibility of adapting lockfree programming to parallel computing to improve performances. Lockfree programming also provides a new way of simulating PRAM and asynchronous PRAM algorithms on current parallel machines.