Results 1 - 10
of
14
Provably efficient scheduling for languages with fine-grained parallelism
- IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract
-
Cited by 68 (22 self)
- Add to MetaCart
Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
Efficient Low-Contention Parallel Algorithms
- the 1994 ACM Symp. on Parallel Algorithms and Architectures
, 1994
"... The queue-read, queue-write (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model reflects the contention prope ..."
Abstract
-
Cited by 29 (11 self)
- Add to MetaCart
The queue-read, queue-write (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the well-studied crcw pram or erew pram models, and can be efficiently emulated with only logarithmic slowdown on hypercubetype non-combining networks. This paper describes fast, low-contention, work-optimal, randomized qrqw pram algorithms for the fundamental problems of load balancing, multiple compaction, generating a random permutation, parallel hashing, and distributive sorting. These logarithmic or sublogarithmic time algorithms considerably improve upon the best known erew pram algorithms for these problems, while avoiding the high-contention steps typical of crcw pram algorithms. An illustrative expe...
The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms
- Proc. 5th ACM-SIAM Symp. on Discrete Algorithms
, 1997
"... Abstract. This paper introduces the queue-read queue-write (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to shared-memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to thi ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
Abstract. This paper introduces the queue-read queue-write (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to shared-memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to this work there were no formal complexity models that accounted for the contention to memory locations, despite its large impact on the performance of parallel programs. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the well-studied crcw pram or erew pram models: the crcw model does not adequately penalize algorithms with high contention to shared-memory locations, while the erew model is too strict in its insistence on zero contention at each step. The�qrqw pram is strictly more powerful than the erew pram. This paper shows a separation of log n between the two models, and presents faster and more efficient qrqw algorithms for several basic problems, such as linear compaction, leader election, and processor allocation. Furthermore, we present a work-preserving emulation of the qrqw pram with only logarithmic slowdown on Valiant’s bsp model, and hence on hypercube-type noncombining networks, even when latency, synchronization, and memory granularity overheads are taken into account. This matches the bestknown emulation result for the erew pram, and considerably improves upon the best-known efficient emulation for the crcw pram on such networks. Finally, the paper presents several lower bound results for this model, including lower bounds on the time required for broadcasting and for leader election.
Using Difficulty of Prediction to Decrease Computation: Fast Sort, Priority Queue and Convex Hull on Entropy Bounded Inputs
"... There is an upsurge in interest in the Markov model and also more general stationary ergodic stochastic distributions in theoretical computer science community recently (e.g. see [Vitter,KrishnanSl], [Karlin,Philips,Raghavan92], [Raghavan9 for use of Markov models for on-line algorithms, e.g., cashi ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
There is an upsurge in interest in the Markov model and also more general stationary ergodic stochastic distributions in theoretical computer science community recently (e.g. see [Vitter,KrishnanSl], [Karlin,Philips,Raghavan92], [Raghavan9 for use of Markov models for on-line algorithms, e.g., cashing and prefetching). Their results used the fact that compressible sources are predictable (and vise versa), and showed that on-line algorithms can improve their performance by prediction. Actual page access sequences are in fact somewhat compressible, so their predictive methods can be of benefit. This paper investigates the interesting idea of decreasing computation by using learning in the opposite way, namely to determine the difficulty of prediction. That is, we will ap proximately learn the input distribution, and then improve the performance of the computation when the input is not too predictable, rather than the reverse. To our knowledge,
Optimal Parallel Approximation Algorithms for Prefix Sums and Integer Sorting (Extended Abstract)
"... Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of non-trivial applications t ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of non-trivial applications that have been shown to be solvable using only an approximate version of the prefix sums problem. In this paper we resolve the issue of approximating parallel prefix by introducing an algorithm that runs in O(lg n) time with very high probability, using n= lg n processors, which is optimal in terms of both work and running time. Our approximate prefix sums are guaranteed to come within a factor of (1 + ffl) of the values of the true sums in a "consistent fashion", where ffl is o(1). We achieve this result through the use of a number of interesting new techniques, such as overcertification and estimate-focusing, as well ...
Retrieval of scattered information by EREW, CREW, and CRCW PRAMs
- In Proc. 3rd Scand. Workshop on Alg. Theory
, 1992
"... Abstract. The k-compaction problem arises when k out of n cells in an array are non-empty and the contents of these cells must be moved to the first k locations in the array. Parallel algorithms for k-compaction have obvious applications in processor allocation and load balancing; k-compaction is al ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract. The k-compaction problem arises when k out of n cells in an array are non-empty and the contents of these cells must be moved to the first k locations in the array. Parallel algorithms for k-compaction have obvious applications in processor allocation and load balancing; k-compaction is also an important subroutine in many recently developed parallel algorithms. We show that any EREW PRAM that solves the k-compaction problem requires Ω ( √ log n) time, even if the number of processors is arbitrarily large and k = 2. On the CREW PRAM, we show that every n-processor algorithm for k-compaction problem requires Ω(log log n) time, even if k = 2. Finally, we show that O(log k) time can be achieved on the ROBUST PRAM, a very weak CRCW PRAM model.
The Random Adversary: A Lower-Bound Technique For Randomized Parallel Algorithms
- in Proc. of the 3rd SODA (ACM
, 1997
"... . The random-adversary technique is a general method for proving lower bounds on randomized parallel algorithms. The bounds apply to the number of communication steps, and they apply regardless of the processors' instruction sets, the lengths of messages, etc. This paper introduces the ra ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
.<F3.82e+05> The random-adversary technique is a general method for proving lower bounds on randomized parallel algorithms. The bounds apply to the number of communication steps, and they apply regardless of the processors' instruction sets, the lengths of messages, etc. This paper introduces the random-adversary technique and shows how it can be used to obtain lower bounds on randomized parallel algorithms for load balancing, compaction, padded sorting, and finding Hamiltonian cycles in random graphs. Using the random-adversary technique, we obtain the first lower bounds for randomized parallel algorithms which are provably faster than their deterministic counterparts (specifically, for load balancing and related problems).<F4.005e+05> Key words.<F3.82e+05> parallel algorithms, parallel computation, PRAM model, randomized parallel algorithms, expected time, lower bounds, load balancing<F4.005e+05> AMS subject classifications.<F3.82e+05> 68Q10, 68Q22, 68Q25<F4.005e+05> PII.<F3.82e+05> ...
A Note on Reducing Parallel Model Simulations to Integer Sorting
, 1995
"... We show that simulating a step of a fetch&add pram model on an erew pram model can be made as efficient as integer sorting. In particular, we present several efficient reductions of the simulation problem to various integer sorting problems. By using some recent algorithms for integer sorting, we ge ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We show that simulating a step of a fetch&add pram model on an erew pram model can be made as efficient as integer sorting. In particular, we present several efficient reductions of the simulation problem to various integer sorting problems. By using some recent algorithms for integer sorting, we get simulation algorithms on crew and erew that take o(n lg n) operations where n is the number of processors in the simulated crcw machine. Previous simulations were using \Theta(n lg n) operations. Some of the more interesting simulation results are obtained by using a bootstrapping technique with a crcw pram algorithm for hashing. 1 Introduction The concurrent-read concurrent-write (crcw) pram programmer's model is commonly used for designing parallel algorithms. On the other hand, the weaker exclusive-write pram models are sometimes considered closer to realization. Therefore, while it is more convenient to design algorithms for the stronger crcw model, an extra effort is sometimes neede...
Balanced PRAM Simulations via Moving Threads and Hashing
"... : We present a novel approach to parallel computing, where (virtual) PRAM processors are represented as light-weight threads, and each physical processor is capable of managing several threads. Instead of moving read and write requests, and replies between processor&memory pairs (and caches), we mov ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
: We present a novel approach to parallel computing, where (virtual) PRAM processors are represented as light-weight threads, and each physical processor is capable of managing several threads. Instead of moving read and write requests, and replies between processor&memory pairs (and caches), we move the light-weight threads. Consequently, the processor load balancing problem reduces to the problem of producing evenly distributed memory references. In PRAM computations, this can be achieved by properly hashing the shared memory into the processor&memory pairs. We describe the idea of moving threads, and show that the moving threads framework provides a natural validation for Brent's theorem in work-optimal PRAM simulation situations on mesh of trees, coated mesh, and OCPC based distributed memory machines (DMMs). We prove that an EREW PRAM computation C requiring work W and time T , can be implemented work-optimally on those p-processor DMMs with high probability, if W =\Omega (p \De...
Simple Fast Parallel Hashing by Oblivious Execution
- AT&T Bell Laboratories
, 1994
"... A hash table is a representation of a set in a linear size data structure that supports constanttime membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a crcw pram. Our algo ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
A hash table is a representation of a set in a linear size data structure that supports constanttime membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a crcw pram. Our algorithm uses a novel approach of hashing by "oblivious execution" based on probabilistic analysis to circumvent the parity lower bound barrier at the near-logarithmic time level. The algorithm is simple and is sketched by the following: 1. Partition the input set into buckets by a random polynomial of constant degree. 2. For t := 1 to O(lg lg n) do (a) Allocate M t memory blocks, each of size K t . (b) Let each bucket select a block at random, and try to injectively map its keys into the block using a random linear function. Buckets that fail carry on to the next iteration. The crux of the algorithm is a careful a priori selection of the parameters M t and K t . The algorithm uses only O(lg lg...

