Results 1  10
of
11
Efficient LowContention Parallel Algorithms
, 1996
"... The queueread, queuewrite (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model re ects the contention propert ..."
Abstract

Cited by 34 (14 self)
 Add to MetaCart
(Show Context)
The queueread, queuewrite (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model re ects the contention properties of most commercially available parallel machines more accurately than either the wellstudied crcw pram or erew pram models, and can be e ciently emulated with only logarithmic slowdown on hypercubetype noncombining networks. This paper describes fast, lowcontention, workoptimal, randomized qrqw pram algorithms for the fundamental problems of load balancing, multiple compaction, generating a random permutation, parallel hashing, and distributive sorting. These logarithmic or sublogarithmic time algorithms considerably improve upon the best known erew pram algorithms for these problems, while avoiding the highcontention steps typical of crcw pram algorithms. An illustrative experiment demonstrates the performance advantage of a new qrqw random permutation algorithm when compared with the popular erew algorithm. Finally, this paper presents new randomized algorithms for integer sorting and general sorting.
Optimal Parallel Approximation Algorithms for Prefix Sums and Integer Sorting (Extended Abstract)
"... Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of nontrivial applications t ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of nontrivial applications that have been shown to be solvable using only an approximate version of the prefix sums problem. In this paper we resolve the issue of approximating parallel prefix by introducing an algorithm that runs in O(lg n) time with very high probability, using n= lg n processors, which is optimal in terms of both work and running time. Our approximate prefix sums are guaranteed to come within a factor of (1 + ffl) of the values of the true sums in a "consistent fashion", where ffl is o(1). We achieve this result through the use of a number of interesting new techniques, such as overcertification and estimatefocusing, as well ...
Balanced PRAM Simulations via Moving Threads and Hashing
"... : We present a novel approach to parallel computing, where (virtual) PRAM processors are represented as lightweight threads, and each physical processor is capable of managing several threads. Instead of moving read and write requests, and replies between processor&memory pairs (and caches), we ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
: We present a novel approach to parallel computing, where (virtual) PRAM processors are represented as lightweight threads, and each physical processor is capable of managing several threads. Instead of moving read and write requests, and replies between processor&memory pairs (and caches), we move the lightweight threads. Consequently, the processor load balancing problem reduces to the problem of producing evenly distributed memory references. In PRAM computations, this can be achieved by properly hashing the shared memory into the processor&memory pairs. We describe the idea of moving threads, and show that the moving threads framework provides a natural validation for Brent's theorem in workoptimal PRAM simulation situations on mesh of trees, coated mesh, and OCPC based distributed memory machines (DMMs). We prove that an EREW PRAM computation C requiring work W and time T , can be implemented workoptimally on those pprocessor DMMs with high probability, if W =\Omega (p \De...
Approximate parallel prefix computations and its applications
 in Proc. 7th International Parallel Processing Symposium (IEEE, Los Alamitos
, 1993
"... ..."
Simple Fast Parallel Hashing by Oblivious Execution
 AT&T Bell Laboratories
, 1994
"... A hash table is a representation of a set in a linear size data structure that supports constanttime membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a crcw pram. Our algo ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
A hash table is a representation of a set in a linear size data structure that supports constanttime membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a crcw pram. Our algorithm uses a novel approach of hashing by "oblivious execution" based on probabilistic analysis to circumvent the parity lower bound barrier at the nearlogarithmic time level. The algorithm is simple and is sketched by the following: 1. Partition the input set into buckets by a random polynomial of constant degree. 2. For t := 1 to O(lg lg n) do (a) Allocate M t memory blocks, each of size K t . (b) Let each bucket select a block at random, and try to injectively map its keys into the block using a random linear function. Buckets that fail carry on to the next iteration. The crux of the algorithm is a careful a priori selection of the parameters M t and K t . The algorithm uses only O(lg lg...
An Empirical Analysis of Parallel Random Permutation Algorithms on
"... We compare parallel algorithms for random permutation generation on symmetric multiprocessors (SMPs). Algorithms considered are the sortingbased algorithm, Anderson’s shuffling algorithm, the dartthrowing algorithm, and Sanders ’ algorithm. We investigate the impact of synchronization method, memor ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We compare parallel algorithms for random permutation generation on symmetric multiprocessors (SMPs). Algorithms considered are the sortingbased algorithm, Anderson’s shuffling algorithm, the dartthrowing algorithm, and Sanders ’ algorithm. We investigate the impact of synchronization method, memory access pattern, cost of generating random numbers and other parameters on the performance of the algorithms. Within the range of inputs used and processors employed, Anderson’s algorithm is preferable due to its simplicity when random number generation is relatively costly, while Sanders ’ algorithm has superior performance due to good cache performance when a fast random number generator is available. There is no definite winner across all settings. In fact we predict our new dartthrowing algorithm performs best when synchronization among processors becomes costly and memory access is relatively fast. We also compare the performance of our parallel implementations with the sequential implementation. It is unclear without extensive experimental studies whether fast parallel algorithms beat efficient sequential algorithms due to mismatch between model and architecture. Our implementations achieve speedups up to 6 with 12 processors on the Sun E4500.
Sequential Random Permutation, List Contraction and Tree Contraction are Highly Parallel
"... We show that simple sequential randomized iterative algorithms for random permutation, list contraction, and tree contraction are highly parallel. In particular, if iterations of the algorithms are run as soon as all of their dependencies have been resolved, the resulting computations have logarit ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We show that simple sequential randomized iterative algorithms for random permutation, list contraction, and tree contraction are highly parallel. In particular, if iterations of the algorithms are run as soon as all of their dependencies have been resolved, the resulting computations have logarithmic depth (parallel time) with high probability. Our proofs make an interesting connection between the dependence structure of two of the problems and random binary trees. Building upon this analysis, we describe linearwork, polylogarithmicdepth algorithms for the three problems. Although asymptotically no better than the many prior parallel algorithms for the given problems, their advantages include very simple and fast implementations, and returning the same result as the sequential algorithm. Experiments on a 40core machine show reasonably good performance relative to the sequential algorithms. 1
SIMPLE FAST PARALLEL HASHING BY OBLIVIOUS EXECUTION
"... Abstract. A hash table is a representation of a set in a linear size data structure that supports constanttime membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a concurren ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. A hash table is a representation of a set in a linear size data structure that supports constanttime membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a concurrentread concurrentwrite parallel random access machine (crcw pram). Our algorithm uses a novel approach of hashing by \oblivious execution " based on probabilistic analysis. The algorithm is simple and has the following structure: 1. Partition the input set into buckets by a random polynomial of constant degree. 2. For t: = 1 to O(lg lg n) do (a) Allocate Mt memory blocks, each of size Kt. (b) Let each bucket select a block at random, and try to injectively map its keys into the block using a random linear function. Buckets that fail carry on to the next iteration. The crux of the algorithm is a careful a priori selection of the parameters Mt and Kt. The algorithm uses only O(lg lg n) random words and can be implemented in a workecient manner.
Converting High Probability into NearlyConstant Time  with Applications to Parallel Hashing
, 1991
"... ) Yossi Matias Uzi Vishkin University of Maryland & TelAviv University Abstract We present a new paradigm for efficient randomized parallel algorithms that needs O(log n) time, where O(x) means `O(x) expected'. It leads to: (1) constructing a perfect hash function for n elements in O(l ..."
Abstract
 Add to MetaCart
(Show Context)
) Yossi Matias Uzi Vishkin University of Maryland & TelAviv University Abstract We present a new paradigm for efficient randomized parallel algorithms that needs O(log n) time, where O(x) means `O(x) expected'. It leads to: (1) constructing a perfect hash function for n elements in O(log n log(log n)) time and O(n) operations; (2) an algorithm for generating a random permutation in O(log n) time, using n processors or in O(log n log(log n)) time and O(n) operations; and (3) an efficient optimizer: consider a parallel algorithm that runs in t time using p processors; since at each time unit some of the processors may be idle, we let x, the total number of actual operations, be the sum over all nonidle processors at every time unit; assuming the algorithm belongs to a certain kind, it can be adapted to run in O(t+log n log(log n)) time (additive overhead!) using x=(t + log n log(log n)) processors. We also get an optimal integer sorting algorithm. Given...
Balanced PRAM Simulations via Moving Threads and Hashing
"... Abstract: We present anovel approach to parallel computing, where (virtual) PRAM processors are represented as lightweight threads, and each physical processor is capable of managing several threads. Instead of moving read and write requests, and replies between processor&memory pairs (and cach ..."
Abstract
 Add to MetaCart
Abstract: We present anovel approach to parallel computing, where (virtual) PRAM processors are represented as lightweight threads, and each physical processor is capable of managing several threads. Instead of moving read and write requests, and replies between processor&memory pairs (and caches), we move the lightweight threads. Consequently, the processor load balancing problem reduces to the problem of producing evenly distributed memory references. In PRAM computations, this can be achieved by properly hashing the shared memory into the processor&memory pairs. We describe the idea of moving threads, and show that the moving threads framework provides a natural validation for Brent's theorem in workoptimal PRAM simulation situations on mesh of trees, coated mesh, and OCPC based distributed memory machines (DMMs). We prove that an EREW PRAM computation C requiring work W and time T, can be implemented workoptimally on those pprocessor DMMs with high probability, ifW = (p T max(D; log p)), where D is the diameter of the underlying routing machinery. The computation is workoptimal regardless how (virtual) PRAM processors terminate or initiate new PRAM processors during the computation. Our result is based on using only one randomly chosen hash function and on showing, how the threads (PRAM processors) can spawn new threads in required time on pprocessor OCPC, 2dimensional mesh of trees, 2dimensional coated, and 3dimensional coated mesh. A deterministic spawning algorithm is provided for all cases, although a randomized algorithm would be su cient due to the randomized nature of timeprocessor optimal PRAM simulations.