Results 1  10
of
37
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract

Cited by 191 (9 self)
 Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
LoPC: Modeling Contention in Parallel Algorithms
, 1997
"... Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention for message processing resources in parallel al ..."
Abstract

Cited by 45 (9 self)
 Add to MetaCart
Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention for message processing resources in parallel algorithms on a multiprocessor or network of workstations. LoPC takes the , and parameters directly from the LogP model and uses them to predict the cost of contention, .
Accounting for memory bank contention and delay in highbandwidth multiprocessors
 In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1997
"... Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several sharedmemory multiprocessors consist of more memory banks than processors. The object ..."
Abstract

Cited by 32 (5 self)
 Add to MetaCart
Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several sharedmemory multiprocessors consist of more memory banks than processors. The object of this paper is to provide a simple model (with only a few parameters) for the design and analysis of irregular parallel algorithms that will give a reasonable characterization of performance on such machines. For this purpose, we extend Valiant’s bulksynchronous parallel (BSP) model with two parameters: a parameter for memory bank delay, the minimum time for servicing requests at a bank, and a parameter for memory bank expansion, the ratio of the number of banks to the number of processors. We call this model the (d, x)BSP. We show experimentally that the (d, x)BSP captures the impact of bank contention and delay on the CRAY C90 and J90 for irregular access patterns, without modeling machinespecific details of these machines. The model has clarified the performance characteristics of several unstructured algorithms on the CRAY C90 and J90, and allowed us to explore tradeoffs and optimizations for these algorithms. In addition to modeling individual algorithms directly, we also consider the use of the (d, x)BSP as a bridging model for emulating a very highlevel abstract model, the Parallel Random Access Machine (PRAM). We provide matching upper and lower bounds for emulating the EREW and QRQW PRAMs on the (d, x)BSP.
Efficient LowContention Parallel Algorithms
 the 1994 ACM Symp. on Parallel Algorithms and Architectures
, 1994
"... The queueread, queuewrite (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model reflects the contention prope ..."
Abstract

Cited by 31 (12 self)
 Add to MetaCart
The queueread, queuewrite (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the wellstudied crcw pram or erew pram models, and can be efficiently emulated with only logarithmic slowdown on hypercubetype noncombining networks. This paper describes fast, lowcontention, workoptimal, randomized qrqw pram algorithms for the fundamental problems of load balancing, multiple compaction, generating a random permutation, parallel hashing, and distributive sorting. These logarithmic or sublogarithmic time algorithms considerably improve upon the best known erew pram algorithms for these problems, while avoiding the highcontention steps typical of crcw pram algorithms. An illustrative expe...
Parallel Sorting With Limited Bandwidth
 in Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1995
"... We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the tradeoff between the amount of local computation an ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the tradeoff between the amount of local computation and the amount of interprocessor communication required for parallel sorting algorithms. We prove a lower bound of \Omega\Gamma n log m m ) on the time to sort n numbers in an exclusiveread variant of the PRAM(m) model. We show that Leighton's Columnsort can be used to give an asymptotically matching upper bound in the case where m grows as a fractional power of n. The bounds are of a surprising form, in that they have little dependence on the parameter p. This implies that attempting to distribute the workload across more processors while holding the problem size and the size of the shared memory fixed will not improve the optimal running time of sorting in this model. We also show that bot...
Parallel Implementation of Algorithms for Finding Connected Components in Graphs
, 1997
"... In this paper, we describe our implementation of several parallel graph algorithms for finding connected components. Our implementation, with virtual processing, is on a 16,384processor MasPar MP1 using the language MPL. We present extensive test data on our code. In our previous projects [21, 22, ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
In this paper, we describe our implementation of several parallel graph algorithms for finding connected components. Our implementation, with virtual processing, is on a 16,384processor MasPar MP1 using the language MPL. We present extensive test data on our code. In our previous projects [21, 22, 23], we reported the implementation of an extensible parallel graph algorithms library. We developed general implementation and finetuning techniques without expending too much effort on optimizing each individual routine. We also handled the issue of implementing virtual processing. In this paper, we describe several algorithms and finetuning techniques that we developed for the problem of finding connected components in parallel; many of the finetuning techniques are of general interest, and should be applicable to code for other problems. We present data on the execution time and memory usage of our various implementations.
The QueueRead QueueWrite PRAM Model: Accounting for Contention in Parallel Algorithms
 Proc. 5th ACMSIAM Symp. on Discrete Algorithms
, 1997
"... Abstract. This paper introduces the queueread queuewrite (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to sharedmemory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to thi ..."
Abstract

Cited by 23 (10 self)
 Add to MetaCart
Abstract. This paper introduces the queueread queuewrite (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to sharedmemory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to this work there were no formal complexity models that accounted for the contention to memory locations, despite its large impact on the performance of parallel programs. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the wellstudied crcw pram or erew pram models: the crcw model does not adequately penalize algorithms with high contention to sharedmemory locations, while the erew model is too strict in its insistence on zero contention at each step. The�qrqw pram is strictly more powerful than the erew pram. This paper shows a separation of log n between the two models, and presents faster and more efficient qrqw algorithms for several basic problems, such as linear compaction, leader election, and processor allocation. Furthermore, we present a workpreserving emulation of the qrqw pram with only logarithmic slowdown on Valiant’s bsp model, and hence on hypercubetype noncombining networks, even when latency, synchronization, and memory granularity overheads are taken into account. This matches the bestknown emulation result for the erew pram, and considerably improves upon the bestknown efficient emulation for the crcw pram on such networks. Finally, the paper presents several lower bound results for this model, including lower bounds on the time required for broadcasting and for leader election.
Models and Resource Metrics for Parallel and Distributed Computation
 PROC. 28TH ANNUAL HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES
, 1989
"... This paper presents a framework of using resource metrics to characterize the various models of parallel computation. Our framework reflects the approach of recent models to abstract architectural details into several generic parameters, which we call resource metrics. We examine the different reso ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
This paper presents a framework of using resource metrics to characterize the various models of parallel computation. Our framework reflects the approach of recent models to abstract architectural details into several generic parameters, which we call resource metrics. We examine the different resource metrics chosen by different parallel models, categorizing the models into four classes: the basic synchronous models, and extensions of the basic models which more accurately reflect practical machines by incorporating notions of asynchrony, communication cost and memory hierarchy. We then present a new parallel computation model, the LogPHMM model, as an illustration of design principles based on the framework of resource metrics. The LogPHMM model extends an existing parameterized network model (LogP) with a sequential hierarchical memory model (HMM) characterizing each processor. The result accurately captures both network communication costs and the effects of multileveled memory ...
Delayed path coupling and generating random permutations via distributed stochastic processes
, 1999
"... We analyze various stochastic processes for generating permutations almost uniformly at random in distributed and parallel systems. All our protocols are simple, elegant and are based on performing disjoint transpositions executed in parallel. The challenging problem of our concern is to prove that ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
We analyze various stochastic processes for generating permutations almost uniformly at random in distributed and parallel systems. All our protocols are simple, elegant and are based on performing disjoint transpositions executed in parallel. The challenging problem of our concern is to prove that the output configurations in our processes reach almost uniform probability distribution very rapidly, i.e. in a (low) polylogarithmic time. For the analysis of the aforementioned protocols we develop a novel technique, called delayed path coupling, for proving rapid mixing of Markov chains. Our approach is an extension of the path coupling method of Bubley and Dyer. We apply delayed path coupling to three stochastic processes for generating random permutations. For one