Results 1 - 10
of
12
Parallel Implementation of Algorithms for Finding Connected Components in Graphs
, 1997
"... In this paper, we describe our implementation of several parallel graph algorithms for finding connected components. Our implementation, with virtual processing, is on a 16,384-processor MasPar MP-1 using the language MPL. We present extensive test data on our code. In our previous projects [21, 22, ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
In this paper, we describe our implementation of several parallel graph algorithms for finding connected components. Our implementation, with virtual processing, is on a 16,384-processor MasPar MP-1 using the language MPL. We present extensive test data on our code. In our previous projects [21, 22, 23], we reported the implementation of an extensible parallel graph algorithms library. We developed general implementation and fine-tuning techniques without expending too much effort on optimizing each individual routine. We also handled the issue of implementing virtual processing. In this paper, we describe several algorithms and fine-tuning techniques that we developed for the problem of finding connected components in parallel; many of the fine-tuning techniques are of general interest, and should be applicable to code for other problems. We present data on the execution time and memory usage of our various implementations.
The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms
- Proc. 5th ACM-SIAM Symp. on Discrete Algorithms
, 1997
"... Abstract. This paper introduces the queue-read queue-write (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to shared-memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to thi ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
Abstract. This paper introduces the queue-read queue-write (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to shared-memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to this work there were no formal complexity models that accounted for the contention to memory locations, despite its large impact on the performance of parallel programs. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the well-studied crcw pram or erew pram models: the crcw model does not adequately penalize algorithms with high contention to shared-memory locations, while the erew model is too strict in its insistence on zero contention at each step. The�qrqw pram is strictly more powerful than the erew pram. This paper shows a separation of log n between the two models, and presents faster and more efficient qrqw algorithms for several basic problems, such as linear compaction, leader election, and processor allocation. Furthermore, we present a work-preserving emulation of the qrqw pram with only logarithmic slowdown on Valiant’s bsp model, and hence on hypercube-type noncombining networks, even when latency, synchronization, and memory granularity overheads are taken into account. This matches the bestknown emulation result for the erew pram, and considerably improves upon the best-known efficient emulation for the crcw pram on such networks. Finally, the paper presents several lower bound results for this model, including lower bounds on the time required for broadcasting and for leader election.
The Evaluation of Massively Parallel Array Architectures
, 1994
"... Computer Science to the memory of my mother Acknowledgments This dissertation would not have been possible without the help of many people. First, I would like to thank my committee for their many helpful comments and suggestions. Specifically, Al Hanson who taught me about computer vision, Wayne Bu ..."
Abstract
-
Cited by 13 (7 self)
- Add to MetaCart
Computer Science to the memory of my mother Acknowledgments This dissertation would not have been possible without the help of many people. First, I would like to thank my committee for their many helpful comments and suggestions. Specifically, Al Hanson who taught me about computer vision, Wayne Burleson who taught me about VLSI, and Don Towsley who taught me about performance evaluation. Most especially, I’d like to thank my committee chair and my advisor and mentor for my entire graduate career, Chip Weems. Besides teaching me about architecture and writing, he suggested the final form of the topic, pulled me out of many blind alleys, and his vast store of knowledge was a constant help. Many other professors at UMass also contributed to my knowledge of computer science and so helped me with this dissertation. I would especially like to thank Arny Rosenberg who not only taught me theory but more importantly how and where to apply it, and Ed Riseman who’s boundless energy and optimism serves as a model for all of us. The first level of discussion and comments is always with the fellow graduate students in one’s
Implementation of Parallel Graph Algorithms on a Massively Parallel SIMD Computer with Virtual Processing
, 1995
"... We describe our implementation of several PRAM graph algorithms on the massively parallel computer MasPar MP-1 with 16,384 processors. Our implementation incorporated virtual processing and we present extensive test data. In a previous project [13], we reported the implementation of a set of paralle ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
We describe our implementation of several PRAM graph algorithms on the massively parallel computer MasPar MP-1 with 16,384 processors. Our implementation incorporated virtual processing and we present extensive test data. In a previous project [13], we reported the implementation of a set of parallel graph algorithms with the constraint that the maximum input size was restricted to be no more than the physical number of processors on the MasPar. The MasPar language MPL that we used for our code does not support virtual processing. In this paper, we describe a method of simulating virtual processors on the MasPar. We re-coded and fine-tuned our earlier parallel graph algorithms to incorporate the usage of virtual processors. Under the current implementation scheme, there is no limit on the number of virtual processors that one can use in the program as long as there is enough main memory to store all the data required during the computation. We also give two general optimization techniq...
Efficient Massively Parallel Implementation of Some Combinatorial Algorithms
, 1996
"... We describe our implementation of several efficient parallel algorithms on the massively parallel SIMD machine MasPar MP-1 with virtual processing. The MPL language that we used on the MasPar MP-1 does not support virtual processing. In this paper, we describe the implementation of virtual processin ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
We describe our implementation of several efficient parallel algorithms on the massively parallel SIMD machine MasPar MP-1 with virtual processing. The MPL language that we used on the MasPar MP-1 does not support virtual processing. In this paper, we describe the implementation of virtual processing for several combinatorial algorithms using the MPL language. We present our data allocation scheme for virtual processing and code rewriting rules for converting a code that uses no virtual processors into a code with virtual processing. We then describe the implementation of virtual processing and the fine-tuning of a set of commonly used routines. In coding these routines, we tried different underlying (deterministic and randomized) algorithms. We present the performance data for our different implementations. We also compared the performance of several of the parallel routines with their sequential implementations. The performance of our code tracks theoretical predictions quite well fo...
Special Issue on Group Communication Systems
- In Communications of the ACM
, 1996
"... Abstract: The high latency of memory operations is a problem in both sequential and parallel computing. Multithreading is a technique, which can be used to eliminate the delays caused by the high latency. This happens by letting a processor to execute other processes (threads) while one process is w ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Abstract: The high latency of memory operations is a problem in both sequential and parallel computing. Multithreading is a technique, which can be used to eliminate the delays caused by the high latency. This happens by letting a processor to execute other processes (threads) while one process is waiting for the completion of a memory operation. In this paper we investigate the implementation of multithreading in the processor-level. As a result we outline and evaluate a MultiThreaded VLIW processor Architecture with functional unit Chaining (MTAC), which is specially designed for PRAM-style parallelism. According to our experiments MTAC offers remarkably better performance than a basic pipelined RISC architecture and chaining improves the exploitation of instruction level parallelism to a level where the achieved speedup corresponds to the number of functional units in a processor.
An Environment for Evaluating Architectures for Spatially Mapped Computation: System Architecture and Preliminary Results
, 1993
"... : An environment which addresses several problems in evaluating massively parallel array architectures is described. A realistic workload including a series of applications currently being used as building blocks in vision research has been constructed. Both flexibility in architectural parameter se ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
: An environment which addresses several problems in evaluating massively parallel array architectures is described. A realistic workload including a series of applications currently being used as building blocks in vision research has been constructed. Both flexibility in architectural parameter selection and simulation efficiency are maintained by combining virtual machine emulation with trace driven simulation. The trade-off between fairness to diverse target architectures and programmability of the test programs is addressed through the use of operator and application libraries. Initial results are presented indicating the appropriate balance between register file and cache to optimize performance under varying levels of processor element virtualization. This paper also appears in the Proceedings of Computer Architectures for Machine Perception `93. y Authors' address: Department of Computer Science; University of Massachusetts; Amherst, MA 01003; NetAd : fherbordt,weemsg@cs.uma...
Comparison of MasPar MP-1 and MP-2 Communication Operations
- Institute fur Programmstrukturen und Datenorganisation, Fakultat fur Informatik, Universitat
, 1993
"... Report 01/93 [Pre93] describes the findings of a series of communication measurements performed on a MasPar MP-1 series MP-1216A machine. The current report covers the same measurements performed on a MP-2 series MP-2216 machine. It compares the results and outlines and discusses the main difference ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Report 01/93 [Pre93] describes the findings of a series of communication measurements performed on a MasPar MP-1 series MP-1216A machine. The current report covers the same measurements performed on a MP-2 series MP-2216 machine. It compares the results and outlines and discusses the main differences. In these measurements, raw router communication was sometimes faster, sometimes slower on the MP-2 than on the MP-1, depending on the parameters of the communication requested. The relative performance of the MP-2 varied between 93% and 120%. Xnet communication was faster in all cases (performance 100% to 175%). Complex functions from the communication library were also always faster (performance 100% to 180%). Some of these results contradict technical specifications for the MP-1 and MP-2 published by MasPar. 2 CONTENTS Contents 1 Introduction 4 1.1 The scope of this report : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 1.2 Technical specification comparison...
A Data Parallel Augmenting Path Algorithm for the Dense Linear Many-To-One Assignment Problem
, 1995
"... . The purpose of this study is to describe a data parallel primal-dual augmenting path algorithm for the dense linear many-to-one assignment problem also known as semi-assignment. This problem could for instance be described as assigning n persons to m( n) job groups. The algorithm is tailored speci ..."
Abstract
- Add to MetaCart
. The purpose of this study is to describe a data parallel primal-dual augmenting path algorithm for the dense linear many-to-one assignment problem also known as semi-assignment. This problem could for instance be described as assigning n persons to m( n) job groups. The algorithm is tailored specifically for massive SIMD parallelism and employs, in this context, a new efficient breadth-first-search augmenting path technique which is shown to be faster than the shortest augmenting path search normally used in sequential algorithms for this problem. We show that the best known sequential computational complexity of O(mn 2 ) for dense problems, is reduced to the parallel complexity of O(mn), on a machine with n processors supporting reductions in O(1) time. The algorithm is easy to implement efficiently on commercially available massively parallel computers. A range of numerical experiments are performed on a Connection Machine CM200 and a MasPar MP-2. The tests show the good performa...

