## Towards Modeling the Performance of a Fast Connected Components Algorithm on Parallel Machines (1996)

### Cached

### Download Links

- [lambda.cs.yale.edu]
- [www.cs.berkeley.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of Supercomputing '95 |

Citations: | 12 - 3 self |

### BibTeX

@INPROCEEDINGS{Lumetta96towardsmodeling,

author = {Steven S. Lumetta and Arvind Krishnamurthy and David E. Culler},

title = {Towards Modeling the Performance of a Fast Connected Components Algorithm on Parallel Machines},

booktitle = {In Proceedings of Supercomputing '95},

year = {1996}

}

### OpenURL

### Abstract

: We present and analyze a portable, high-performance algorithm for finding connected components on modern distributed memory multiprocessors. The algorithm is a hybrid of the classic DFS on the subgraph local to each processor and a variant of the Shiloach-Vishkin PRAM algorithm on the global collection of subgraphs. We implement the algorithm in Split-C and measure performance on the the Cray T3D, the Meiko CS-2, and the Thinking Machines CM-5 using a class of graphs derived from cluster dynamics methods in computational physics. On a 256 processor Cray T3D, the implementation outperforms all previous solutions by an order of magnitude. A characterization of graph parameters allows us to select graphs that highlight key performance features. We study the effects of these parameters and machine characteristics on the balance of time between the local and global phases of the algorithm and find that edge density, surface-to-volume ratio, and relative communication cost dominate perform...

### Citations

1403 |
Robot Vision
- HORN
- 1986
(Show Context)
Citation Context ... connected components of a graph has broad importance in both computer and computational science. In computer vision, for example, edge detection and object recognition depend on connected components =-=[16]-=-. Connected components algorithms have also advanced the study of physical phenomena, including properties of magnetic materials near critical temperatures. However, the problem offers a unique challe... |

255 |
Nonuniversal critical dynamics in Monte Carlo simulations,” Phys
- Swendsen, Wang
- 1987
(Show Context)
Citation Context ...algorithm developed by Greiner[12]; Greiner used 2D30 rather than 2D40 in his work. (S-W), reduce correlation time for the simulations. For example, the correlation time using S-W grows as O(L 0:35 ) =-=[30, 31]-=- for a two-dimensional Ising model, allowing much larger samples to be studied. At the heart of S-W is a connected components algorithm. The S-W algorithm repeatedly generates a random graph, finds th... |

155 | Parallel programming in Split-C
- Culler, Dusseau, et al.
- 1993
(Show Context)
Citation Context ...d generality, but still rarely obtain good performance. We present a fast, portable, general-purpose algorithm for finding connected components on a distributed memory machine. Implemented in Split-C =-=[8]-=-, the algorithm is a hybrid of the classic depth-first search on the subgraph local to each processor and a variant of the Shiloach-Vishkin PRAM algorithm on the global collection of subgraphs. On a 2... |

116 |
An O(log n) parallel connectivity algorithm
- Shiloach, Vishkin
- 1982
(Show Context)
Citation Context ...th-first search. Parallel solutions have received a great deal of attention from theorists, and have proven difficult. Algorithms such as Shiloach-Vishkin obtain good results with the CRCW PRAM model =-=[3, 11, 12, 28]-=-, which assumes uniform memory access time and Copyright 1995 by the Association for Computing Machinery, Inc. (ACM). Permission to make digital or hard copies of part or all of this work for personal... |

55 |
Scaling parallel programs for multiprocessors: methodology and examples
- Singh, Hennessy, et al.
- 1993
(Show Context)
Citation Context ...n mesh dimension and edge probability. For each graph, we scale the size of the graph with the number of processors, so that the nodes per processor is held constant, i.e., memory constrained scaling =-=[29]-=-. For each data point, we average execution time for twenty graph instances with the specified degree and edge probability. The result is presented as a normalized rate: millions of nodes processed pe... |

48 |
A case for NOWs
- Anderson, Culler, et al.
- 1995
(Show Context)
Citation Context ...bserved speedup on our three platforms and outlines the possibilities for less tightly integrated systems, where greater computational performance is obtained by sacrificing communication performance =-=[1]-=-. Finally, the modeling process serves as a case study to aid in the understanding of other algorithms. The remainder of the paper is structured as follows: Section 2 describes the pieces of our study... |

47 |
Computing connected components on parallel computers
- Chandra, Sarwate
- 1979
(Show Context)
Citation Context ... and do not necessarily reflect the views of any organization. arbitrary bandwidth to any memory location, but the inherent contention in the algorithm makes even EREW solutions much more challenging =-=[6, 15, 18, 20]-=-. Implementation of the theoretical work has been restricted to shared-memory machines [12] and SIMD machines with very slow processors [12, 17, 23]. Many practical solutions have been developed indep... |

46 |
An Optimal Randomized Parallel Algorithm for Finding Connected Components in a Graph
- Gazit
- 1991
(Show Context)
Citation Context ...th-first search. Parallel solutions have received a great deal of attention from theorists, and have proven difficult. Algorithms such as Shiloach-Vishkin obtain good results with the CRCW PRAM model =-=[3, 11, 12, 28]-=-, which assumes uniform memory access time and Copyright 1995 by the Association for Computing Machinery, Inc. (ACM). Permission to make digital or hard copies of part or all of this work for personal... |

45 | Experience with Active Messages on the Meiko CS-2
- Schauser, Scheiman
- 1995
(Show Context)
Citation Context ...-C also gives our implementation portability, with versions running on the Cray T3D, the IBM SP-1 and SP-2, the Intel Paragon, the Thinking Machines CM-5, the Meiko CS-2, and networks of workstations =-=[2, 24, 27, 34]-=-. 2.3 Parallel platforms We consider three large-scale parallel machines: the Cray T3D, the Meiko CS-2, and the Thinking Machines CM-5. These machines offer a range of computational and communication ... |

31 |
New connectivity and MSF algorithms for Ultracomputer and PRAM
- Awerbuch, Shiloach
- 1987
(Show Context)
Citation Context ...th-first search. Parallel solutions have received a great deal of attention from theorists, and have proven difficult. Algorithms such as Shiloach-Vishkin obtain good results with the CRCW PRAM model =-=[3, 11, 12, 28]-=-, which assumes uniform memory access time and Copyright 1995 by the Association for Computing Machinery, Inc. (ACM). Permission to make digital or hard copies of part or all of this work for personal... |

26 | Fast connected components algorithms for the erew pram
- Karger, Nisan, et al.
- 1999
(Show Context)
Citation Context ... and do not necessarily reflect the views of any organization. arbitrary bandwidth to any memory location, but the inherent contention in the algorithm makes even EREW solutions much more challenging =-=[6, 15, 18, 20]-=-. Implementation of the theoretical work has been restricted to shared-memory machines [12] and SIMD machines with very slow processors [12, 17, 23]. Many practical solutions have been developed indep... |

25 | Parallel Implementation of Algorithms for Finding Connected Components in Graphs
- Hsu, Ramachandran, et al.
- 1997
(Show Context)
Citation Context ... makes even EREW solutions much more challenging [6, 15, 18, 20]. Implementation of the theoretical work has been restricted to shared-memory machines [12] and SIMD machines with very slow processors =-=[12, 17, 23]-=-. Many practical solutions have been developed independently of theoretical work for modern MIMD massively parallel platforms (MPP's) [7, 10, 13, 14, 21, 26] and vector machines [9, 26]. With the exce... |

23 |
Cluster Monte Carlo algorithms
- Wang, Swendsen
- 1990
(Show Context)
Citation Context ...algorithm developed by Greiner[12]; Greiner used 2D30 rather than 2D40 in his work. (S-W), reduce correlation time for the simulations. For example, the correlation time using S-W grows as O(L 0:35 ) =-=[30, 31]-=- for a two-dimensional Ising model, allowing much larger samples to be studied. At the heart of S-W is a connected components algorithm. The S-W algorithm repeatedly generates a random graph, finds th... |

22 | Parallel Algorithms for Image Histogramming and Connected Components with an Experimental
- Bader, JáJá
- 1994
(Show Context)
Citation Context ...actical solutions have been developed independently of theoretical work for modern MIMD massively parallel platforms (MPP's) [7, 10, 13, 14, 21, 26] and vector machines [9, 26]. With the exception of =-=[5]-=-, which focuses on 2D graphs for robot vision, these solutions typically emphasize performance over portability, scalability, and generality, but still rarely obtain good performance. We present a fas... |

22 | Connected components on distributed memory machines
- Krishnamurthy, Lumetta, et al.
- 1994
(Show Context)
Citation Context ...e subgraph local to a processor, resulting in a much smaller graph for the global phase. The optimized algorithm follows. For more detail on the data structures or on the process of optimization, see =-=[22]-=-. 1. Local Phase. Perform a local DFS on each processor's portion of the graph, collapsing each local connected component into a representative node. Mark each component of this global graph with a un... |

16 | A comparison of parallel algorithms for connected components
- GREINER
- 1994
(Show Context)
Citation Context |

11 |
Connected component labeling on coarse grain parallel computers: An experimental study
- Choudhary, Thakur
- 1994
(Show Context)
Citation Context ...ines [12] and SIMD machines with very slow processors [12, 17, 23]. Many practical solutions have been developed independently of theoretical work for modern MIMD massively parallel platforms (MPP's) =-=[7, 10, 13, 14, 21, 26]-=- and vector machines [9, 26]. With the exception of [5], which focuses on 2D graphs for robot vision, these solutions typically emphasize performance over portability, scalability, and generality, but... |

9 |
A Parallel Cluster Labeling Method for Monte Carlo Dynamics
- Flanigan, Tamayo
- 1992
(Show Context)
Citation Context ...ines [12] and SIMD machines with very slow processors [12, 17, 23]. Many practical solutions have been developed independently of theoretical work for modern MIMD massively parallel platforms (MPP's) =-=[7, 10, 13, 14, 21, 26]-=- and vector machines [9, 26]. With the exception of [5], which focuses on 2D graphs for robot vision, these solutions typically emphasize performance over portability, scalability, and generality, but... |

8 |
Implementing an Efficient Portable Global Memory Layer on Distributed Memory Multiprocessors
- Luna
- 1994
(Show Context)
Citation Context ...-C also gives our implementation portability, with versions running on the Cray T3D, the IBM SP-1 and SP-2, the Intel Paragon, the Thinking Machines CM-5, the Meiko CS-2, and networks of workstations =-=[2, 24, 27, 34]-=-. 2.3 Parallel platforms We consider three large-scale parallel machines: the Cray T3D, the Meiko CS-2, and the Thinking Machines CM-5. These machines offer a range of computational and communication ... |

7 |
Empirical Evaluation of the CRAY T3D: A compiler perspective
- Arpaci, Culler, et al.
- 1995
(Show Context)
Citation Context ...-C also gives our implementation portability, with versions running on the Cray T3D, the IBM SP-1 and SP-2, the Intel Paragon, the Thinking Machines CM-5, the Meiko CS-2, and networks of workstations =-=[2, 24, 27, 34]-=-. 2.3 Parallel platforms We consider three large-scale parallel machines: the Cray T3D, the Meiko CS-2, and the Thinking Machines CM-5. These machines offer a range of computational and communication ... |

6 | Connected-Components Algorithms for MeshConnected Parallel Computers
- Kumar, Goddard, et al.
- 1997
(Show Context)
Citation Context ... makes even EREW solutions much more challenging [6, 15, 18, 20]. Implementation of the theoretical work has been restricted to shared-memory machines [12] and SIMD machines with very slow processors =-=[12, 17, 23]-=-. Many practical solutions have been developed independently of theoretical work for modern MIMD massively parallel platforms (MPP's) [7, 10, 13, 14, 21, 26] and vector machines [9, 26]. With the exce... |

4 |
A study of connected component labeling algorithms on the MPP
- Hambrusch, TeWinkel
- 1988
(Show Context)
Citation Context ...ines [12] and SIMD machines with very slow processors [12, 17, 23]. Many practical solutions have been developed independently of theoretical work for modern MIMD massively parallel platforms (MPP's) =-=[7, 10, 13, 14, 21, 26]-=- and vector machines [9, 26]. With the exception of [5], which focuses on 2D graphs for robot vision, these solutions typically emphasize performance over portability, scalability, and generality, but... |

4 |
A Vectorized Algorithm for Cluster Formation in the Swendsen-Wang Dynamics
- Mino
- 1991
(Show Context)
Citation Context |

2 |
Finding Connected Components in O(log n log log n
- Chong, Lam
- 1993
(Show Context)
Citation Context ... and do not necessarily reflect the views of any organization. arbitrary bandwidth to any memory location, but the inherent contention in the algorithm makes even EREW solutions much more challenging =-=[6, 15, 18, 20]-=-. Implementation of the theoretical work has been restricted to shared-memory machines [12] and SIMD machines with very slow processors [12, 17, 23]. Many practical solutions have been developed indep... |

2 | Vectorized Cluster Search
- Evertz
- 1992
(Show Context)
Citation Context ...rocessors [12, 17, 23]. Many practical solutions have been developed independently of theoretical work for modern MIMD massively parallel platforms (MPP's) [7, 10, 13, 14, 21, 26] and vector machines =-=[9, 26]-=-. With the exception of [5], which focuses on 2D graphs for robot vision, these solutions typically emphasize performance over portability, scalability, and generality, but still rarely obtain good pe... |

2 |
Swendsen-Wang Dynamics of Large 2D Critical Ising Models
- Kert'esz, Stauffer
- 1992
(Show Context)
Citation Context |

1 |
Parallelization of the 2D Swendsen-Wang Algorithm
- Hackl, Matuttis, et al.
- 1993
(Show Context)
Citation Context |

1 |
Connected Components in O(log =2n) Parallel Time for the CREW PRAM
- Johnson, Metaxas
- 1991
(Show Context)
Citation Context |

1 |
Split-C on the Meiko CS-2," http://HTTP.CS.Berkeley.EDU/��chad/meiko.ps
- Yoshikawa
- 1995
(Show Context)
Citation Context |

1 |
et al.,"Empirical Evaluation of the Cray T3D: a compiler perspective," to appear
- Arpaci
- 1995
(Show Context)
Citation Context ...ves our implementation portability. Versions of Split-C exist on the Cray T3D, the IBM SP-1 and SP-2, the Intel Paragon, the Thinking Machines Corp. CM-5, the Meiko CS-2, and networks of workstations =-=[2, 20, 23, 29]-=-. Although our algorithm accepts arbitrary graphs as input, obtaining optimal performance requires a reasonable partitioning of the graph across processors to enhance locality and load balancing. Part... |

1 |
Experience with Active Messages on the Meiko CS-2," to appear
- Schauser, Scheiman
- 1995
(Show Context)
Citation Context ...ves our implementation portability. Versions of Split-C exist on the Cray T3D, the IBM SP-1 and SP-2, the Intel Paragon, the Thinking Machines Corp. CM-5, the Meiko CS-2, and networks of workstations =-=[2, 20, 23, 29]-=-. Although our algorithm accepts arbitrary graphs as input, obtaining optimal performance requires a reasonable partitioning of the graph across processors to enhance locality and load balancing. Part... |

1 |
Split-C on the Meiko CS-2," http://www.CS.Berkeley.EDU/��chad/meiko.ps
- Yoshikawa
- 1995
(Show Context)
Citation Context ...ves our implementation portability. Versions of Split-C exist on the Cray T3D, the IBM SP-1 and SP-2, the Intel Paragon, the Thinking Machines Corp. CM-5, the Meiko CS-2, and networks of workstations =-=[2, 20, 23, 29]-=-. Although our algorithm accepts arbitrary graphs as input, obtaining optimal performance requires a reasonable partitioning of the graph across processors to enhance locality and load balancing. Part... |