Results 1  10
of
12
Parallel Implementation of Algorithms for Finding Connected Components in Graphs
, 1997
"... In this paper, we describe our implementation of several parallel graph algorithms for finding connected components. Our implementation, with virtual processing, is on a 16,384processor MasPar MP1 using the language MPL. We present extensive test data on our code. In our previous projects [21, 22, ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
In this paper, we describe our implementation of several parallel graph algorithms for finding connected components. Our implementation, with virtual processing, is on a 16,384processor MasPar MP1 using the language MPL. We present extensive test data on our code. In our previous projects [21, 22, 23], we reported the implementation of an extensible parallel graph algorithms library. We developed general implementation and finetuning techniques without expending too much effort on optimizing each individual routine. We also handled the issue of implementing virtual processing. In this paper, we describe several algorithms and finetuning techniques that we developed for the problem of finding connected components in parallel; many of the finetuning techniques are of general interest, and should be applicable to code for other problems. We present data on the execution time and memory usage of our various implementations.
Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs)
 Proc. 9th Int’l Conf. on High Performance Computing (HiPC 2002), volume 2552 of Lecture Notes in Computer Science
, 2002
"... The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the res ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors and across the entire range of instance sizes tested. This linear speedup with the number of processors is one of the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is for evaluating arithmetic expression trees using the algorithmic techniques of list ranking and tree contraction; this problem is not only of interest in its own right, but is representativeof a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of sharedmemory parallel algorithms.
Using PRAM Algorithms on a UniformMemoryAccess SharedMemory Architecture
 Proc. 5th Int’l Workshop on Algorithm Engineering (WAE 2001), volume 2141 of Lecture Notes in Computer Science
, 2001
"... The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the res ..."
Abstract

Cited by 21 (11 self)
 Add to MetaCart
The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors (from 1 to 64) and across the entire range of instance sizes tested. This linear speedup with the number of processors is, to our knowledge, the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is a graph decomposition algorithm that also requires the computation of a spanning tree; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of sharedmemory parallel algorithms.
Connected Components Algorithms For MeshConnected Parallel Computers
 PARALLEL ALGORITHMS: 3RD DIMACS IMPLEMENTATION CHALLENGE OCTOBER 1719, 1994, VOLUME 30 OF DIMACS SERIES IN DISCRETE MATHEMATICS AND THEORETICAL COMPUTER SCIENCE
, 1995
"... We present a new CREW PRAM algorithm for finding connected components. For a graph G with n vertices and m edges, algorithmA 0 requires at most O(logn) parallel steps and performs O((n+m) log n) work in the worst case. The advantage our algorithm has over others in the literature is that it can be ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We present a new CREW PRAM algorithm for finding connected components. For a graph G with n vertices and m edges, algorithmA 0 requires at most O(logn) parallel steps and performs O((n+m) log n) work in the worst case. The advantage our algorithm has over others in the literature is that it can be adapted to a 2D meshconnected communication model in which all CREW operations are replaced by O(logn) parallel row and column operations without increasing the time complexity. We present the mapping of A 0 to a meshconnected computer and describe two implementations, A 1 and A 2 . Algorithm A 1 , which uses an adjacency matrix to represent the graph, performs O(n 2 log n) work. Hence, it only achieves work efficiency on dense graphs. The second implementation, A 2 , uses a sparse representation of the adjacency matrix and again performs O(logn) row and column operations but reduces the work to O((m + n) log n) on all graphs. We report MasPar MP1 performance figures for implementati...
Language and Library Support for Practical PRAM Programming
, 1997
"... We investigate the wellknown PRAM model of parallel computation as a practical parallel programming model. The two components of this project are a generalpurpose PRAM programming language called Fork95, and a library, called PAD, of efficient, basic parallel algorithms and data structures. We out ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We investigate the wellknown PRAM model of parallel computation as a practical parallel programming model. The two components of this project are a generalpurpose PRAM programming language called Fork95, and a library, called PAD, of efficient, basic parallel algorithms and data structures. We outline the primary features of Fork95 as they apply to the implementation of PAD. We give a brief overview of PAD and sketch the implementation of library routines for pre xsums and bucket sorting. Both language and library can be used with the SBPRAM, an emulation of the PRAM in hardware.
Some Results on Ongoing Research on Parallel Implementation of Graph Algorithms
, 1997
"... In high performance computing, three recognized important points are usability, scalability and portability. No models seemed to satisfy these three steps till recently: a few proposed models try to fulfill the previous goals. Among them, the BSPlike CGM model seemed adapted to us to facilitate the ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
In high performance computing, three recognized important points are usability, scalability and portability. No models seemed to satisfy these three steps till recently: a few proposed models try to fulfill the previous goals. Among them, the BSPlike CGM model seemed adapted to us to facilitate the way between algorithms design and real implementations. Many algorithms have been designed but few implementations have been carried out to demonstrate the practical relevance of this model. In this article, we propose to test this model actually on an irregular problem. We present the results of implementations of permutation graph algorithms written in two different models: the PRAM and the BSPlike CGM model. These implementation have been made on a CM5 and a PC cluster. We compare the results of these implementations with the performances of sequential code for this problem. With a classical problem in gaph theory, we validate BSPlike CGM model: it is possible to write portable code o...
Feasibility, Portability, . . . Grained Graph Algorithms
, 2000
"... We study the relationship between the design and analysis of graph algorithms in the coarsed grained parallel models and the behavior of the resulting code on todays parallel machines and clusters. We conclude that the coarse grained multicomputer model (CGM) is well suited to design competitive al ..."
Abstract
 Add to MetaCart
We study the relationship between the design and analysis of graph algorithms in the coarsed grained parallel models and the behavior of the resulting code on todays parallel machines and clusters. We conclude that the coarse grained multicomputer model (CGM) is well suited to design competitive algorithms, and that it is thereby now possible to aim to develop portable, predictable and efficient parallel algorithms code for graph problems.
The Handling of Graphs on PC Clusters: A Coarse Grained Approach
, 2000
"... We study the relationship between the design and analysis of graph algorithms in the coarsed grained parallel models and the behavior of the resulting code on clusters. We conclude that the coarse grained multicomputer model (CGM) is well suited to design competitive algorithms, and that it is there ..."
Abstract
 Add to MetaCart
We study the relationship between the design and analysis of graph algorithms in the coarsed grained parallel models and the behavior of the resulting code on clusters. We conclude that the coarse grained multicomputer model (CGM) is well suited to design competitive algorithms, and that it is thereby now possible to aim to develop portable, predictable and efficient parallel code for graph problems on clusters.
Finding Connected Components in Graphs
, 1996
"... In this paper, we describe our implementation of several parallel graph algorithms for finding connected components. Our implementation, with virtual processing, is on a 16,384processor MasPar MP1 using the language MPL. We present extensive test data on our code. In our previous projects [21, 22, ..."
Abstract
 Add to MetaCart
In this paper, we describe our implementation of several parallel graph algorithms for finding connected components. Our implementation, with virtual processing, is on a 16,384processor MasPar MP1 using the language MPL. We present extensive test data on our code. In our previous projects [21, 22, 23], we reported the implementation of an extensible parallel graph algorithms library. We developed general implementation and netuning techniques without expending too much e ort on optimizing each individual routine. We also handled the issue of implementing virtual processing. In this paper, we describe several algorithms and finetuning techniques that we developed for the problem of finding connected components in parallel; many of the finetuning techniques are of general interest, and should be applicable to code for other problems. We present data on the execution time and memory usage of our various implementations.