Results 1 
6 of
6
Efficient LowContention Parallel Algorithms
 the 1994 ACM Symp. on Parallel Algorithms and Architectures
, 1994
"... The queueread, queuewrite (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model reflects the contention prope ..."
Abstract

Cited by 31 (12 self)
 Add to MetaCart
The queueread, queuewrite (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the wellstudied crcw pram or erew pram models, and can be efficiently emulated with only logarithmic slowdown on hypercubetype noncombining networks. This paper describes fast, lowcontention, workoptimal, randomized qrqw pram algorithms for the fundamental problems of load balancing, multiple compaction, generating a random permutation, parallel hashing, and distributive sorting. These logarithmic or sublogarithmic time algorithms considerably improve upon the best known erew pram algorithms for these problems, while avoiding the highcontention steps typical of crcw pram algorithms. An illustrative expe...
Optimal Parallel Approximation Algorithms for Prefix Sums and Integer Sorting (Extended Abstract)
"... Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of nontrivial applications t ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of nontrivial applications that have been shown to be solvable using only an approximate version of the prefix sums problem. In this paper we resolve the issue of approximating parallel prefix by introducing an algorithm that runs in O(lg n) time with very high probability, using n= lg n processors, which is optimal in terms of both work and running time. Our approximate prefix sums are guaranteed to come within a factor of (1 + ffl) of the values of the true sums in a "consistent fashion", where ffl is o(1). We achieve this result through the use of a number of interesting new techniques, such as overcertification and estimatefocusing, as well ...
Approximate Parallel Prefix Computation and Its Applications
, 1993
"... In this paper we address two fundamental problems in parallel algorithm designparallel prefix sums and integer sortingand show that both of them can be approximately solved very quickly on a randomized CRCW PRAM. In the case of prefix sums the approximation is in terms of the accuracy of the s ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
In this paper we address two fundamental problems in parallel algorithm designparallel prefix sums and integer sortingand show that both of them can be approximately solved very quickly on a randomized CRCW PRAM. In the case of prefix sums the approximation is in terms of the accuracy of the sums and in the case of integer sorting it is in terms of allowing some gaps between consecutive elements in the ordered list. By introducing approximation in these ways we are able to solve these problems in o(lg lg n) time, and thus avoid the nearlogarithmic lower bounds by Beame and Hastad that hold for the exact versions of these problems. Nevertheless, we demonstrate that these approximations are strong enough to be used as subroutines in fast randomized algorithms for some wellknown problems in parallel computational geometry. Perhaps the most succinct way to describe the power of the new tools which are presented is by observing that prior to this work it was known how to solve the i...
Balanced PRAM Simulations via Moving Threads and Hashing
"... : We present a novel approach to parallel computing, where (virtual) PRAM processors are represented as lightweight threads, and each physical processor is capable of managing several threads. Instead of moving read and write requests, and replies between processor&memory pairs (and caches), we mov ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
: We present a novel approach to parallel computing, where (virtual) PRAM processors are represented as lightweight threads, and each physical processor is capable of managing several threads. Instead of moving read and write requests, and replies between processor&memory pairs (and caches), we move the lightweight threads. Consequently, the processor load balancing problem reduces to the problem of producing evenly distributed memory references. In PRAM computations, this can be achieved by properly hashing the shared memory into the processor&memory pairs. We describe the idea of moving threads, and show that the moving threads framework provides a natural validation for Brent's theorem in workoptimal PRAM simulation situations on mesh of trees, coated mesh, and OCPC based distributed memory machines (DMMs). We prove that an EREW PRAM computation C requiring work W and time T , can be implemented workoptimally on those pprocessor DMMs with high probability, if W =\Omega (p \De...
Simple Fast Parallel Hashing by Oblivious Execution
 AT&T Bell Laboratories
, 1994
"... A hash table is a representation of a set in a linear size data structure that supports constanttime membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a crcw pram. Our algo ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
A hash table is a representation of a set in a linear size data structure that supports constanttime membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a crcw pram. Our algorithm uses a novel approach of hashing by "oblivious execution" based on probabilistic analysis to circumvent the parity lower bound barrier at the nearlogarithmic time level. The algorithm is simple and is sketched by the following: 1. Partition the input set into buckets by a random polynomial of constant degree. 2. For t := 1 to O(lg lg n) do (a) Allocate M t memory blocks, each of size K t . (b) Let each bucket select a block at random, and try to injectively map its keys into the block using a random linear function. Buckets that fail carry on to the next iteration. The crux of the algorithm is a careful a priori selection of the parameters M t and K t . The algorithm uses only O(lg lg...
An Empirical Analysis of Parallel Random Permutation Algorithms on
"... We compare parallel algorithms for random permutation generation on symmetric multiprocessors (SMPs). Algorithms considered are the sortingbased algorithm, Anderson’s shuffling algorithm, the dartthrowing algorithm, and Sanders ’ algorithm. We investigate the impact of synchronization method, memor ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We compare parallel algorithms for random permutation generation on symmetric multiprocessors (SMPs). Algorithms considered are the sortingbased algorithm, Anderson’s shuffling algorithm, the dartthrowing algorithm, and Sanders ’ algorithm. We investigate the impact of synchronization method, memory access pattern, cost of generating random numbers and other parameters on the performance of the algorithms. Within the range of inputs used and processors employed, Anderson’s algorithm is preferable due to its simplicity when random number generation is relatively costly, while Sanders ’ algorithm has superior performance due to good cache performance when a fast random number generator is available. There is no definite winner across all settings. In fact we predict our new dartthrowing algorithm performs best when synchronization among processors becomes costly and memory access is relatively fast. We also compare the performance of our parallel implementations with the sequential implementation. It is unclear without extensive experimental studies whether fast parallel algorithms beat efficient sequential algorithms due to mismatch between model and architecture. Our implementations achieve speedups up to 6 with 12 processors on the Sun E4500.