Results 1  10
of
17
Provably efficient scheduling for languages with finegrained parallelism
 IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract

Cited by 92 (27 self)
 Add to MetaCart
Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
Wavelet synopses with error guarantees
 In Proceedings of the 2002 ACM SIGMOD international conference on Management of data
, 2002
"... ABSTRACT Recent work has demonstrated the effectiveness of the wavelet decomposition in reducing large amounts of data to compact sets of ..."
Abstract

Cited by 81 (1 self)
 Add to MetaCart
ABSTRACT Recent work has demonstrated the effectiveness of the wavelet decomposition in reducing large amounts of data to compact sets of
An optical simulation of shared memory
, 1994
"... We present a workoptimal randomized algorithm for simulating a shared memory machine (pram) on an optical communication parallel computer (ocpc). The ocpc model is motivated by the potential of optical communication for parallel computation. The memory of an ocpc is divided into modules, one module ..."
Abstract

Cited by 35 (3 self)
 Add to MetaCart
We present a workoptimal randomized algorithm for simulating a shared memory machine (pram) on an optical communication parallel computer (ocpc). The ocpc model is motivated by the potential of optical communication for parallel computation. The memory of an ocpc is divided into modules, one module per processor. Each memory module only services a request on a timestep if it receives exactly one memory request. Our algorithm simulates each step of an n lg lg nprocessor erew pram on an nprocessor ocpc in O(lg lg n) expected delay. (The probability that the delay is longer than this is at most n; for any constant.) The best previous simulation, due to Valiant, required (lg n) expected delay.
Efficient LowContention Parallel Algorithms
, 1996
"... The queueread, queuewrite (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model re ects the contention propert ..."
Abstract

Cited by 34 (14 self)
 Add to MetaCart
(Show Context)
The queueread, queuewrite (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model re ects the contention properties of most commercially available parallel machines more accurately than either the wellstudied crcw pram or erew pram models, and can be e ciently emulated with only logarithmic slowdown on hypercubetype noncombining networks. This paper describes fast, lowcontention, workoptimal, randomized qrqw pram algorithms for the fundamental problems of load balancing, multiple compaction, generating a random permutation, parallel hashing, and distributive sorting. These logarithmic or sublogarithmic time algorithms considerably improve upon the best known erew pram algorithms for these problems, while avoiding the highcontention steps typical of crcw pram algorithms. An illustrative experiment demonstrates the performance advantage of a new qrqw random permutation algorithm when compared with the popular erew algorithm. Finally, this paper presents new randomized algorithms for integer sorting and general sorting.
The QueueRead QueueWrite PRAM Model: Accounting for Contention in Parallel Algorithms
 Proc. 5th ACMSIAM Symp. on Discrete Algorithms
, 1997
"... Abstract. This paper introduces the queueread queuewrite (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to sharedmemory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to thi ..."
Abstract

Cited by 28 (11 self)
 Add to MetaCart
(Show Context)
Abstract. This paper introduces the queueread queuewrite (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to sharedmemory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to this work there were no formal complexity models that accounted for the contention to memory locations, despite its large impact on the performance of parallel programs. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the wellstudied crcw pram or erew pram models: the crcw model does not adequately penalize algorithms with high contention to sharedmemory locations, while the erew model is too strict in its insistence on zero contention at each step. The�qrqw pram is strictly more powerful than the erew pram. This paper shows a separation of log n between the two models, and presents faster and more efficient qrqw algorithms for several basic problems, such as linear compaction, leader election, and processor allocation. Furthermore, we present a workpreserving emulation of the qrqw pram with only logarithmic slowdown on Valiant’s bsp model, and hence on hypercubetype noncombining networks, even when latency, synchronization, and memory granularity overheads are taken into account. This matches the bestknown emulation result for the erew pram, and considerably improves upon the bestknown efficient emulation for the crcw pram on such networks. Finally, the paper presents several lower bound results for this model, including lower bounds on the time required for broadcasting and for leader election.
Aqua project white paper
, 1997
"... Viswanath Poosala z In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding o ..."
Abstract

Cited by 19 (10 self)
 Add to MetaCart
(Show Context)
Viswanath Poosala z In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. This white paper describes the Approximate QUery Answering (AQUA) Project underway in the Information Sciences Research Center at Bell Labs. We present a framework for an approximate query engine that observes new data as it arrives and maintains small synopsis data structures on that data. These data structures are used to provide fast, approximate answers to a broad class of queries. We describe metrics for evaluating approximate query answers. We also present new synopsis data structures, and new techniques for approximate query answers. We report on the goals and status of the Aqua project, and plans for future work.
Optimal Parallel Approximation Algorithms for Prefix Sums and Integer Sorting (Extended Abstract)
"... Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of nontrivial applications t ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of nontrivial applications that have been shown to be solvable using only an approximate version of the prefix sums problem. In this paper we resolve the issue of approximating parallel prefix by introducing an algorithm that runs in O(lg n) time with very high probability, using n= lg n processors, which is optimal in terms of both work and running time. Our approximate prefix sums are guaranteed to come within a factor of (1 + ffl) of the values of the true sums in a "consistent fashion", where ffl is o(1). We achieve this result through the use of a number of interesting new techniques, such as overcertification and estimatefocusing, as well ...
Approximate parallel prefix computations and its applications
 in Proc. 7th International Parallel Processing Symposium (IEEE, Los Alamitos
, 1993
"... ..."
An Effective Load Balancing Policy for Geometric Decaying Algorithms
"... Parallel algorithms are often first designed as a sequence of rounds, where each round includes any number of independent constant time operations. This socalled worktime presentation is then followed by a processor scheduling implementation ona more concrete computational model. Many parallel alg ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Parallel algorithms are often first designed as a sequence of rounds, where each round includes any number of independent constant time operations. This socalled worktime presentation is then followed by a processor scheduling implementation ona more concrete computational model. Many parallel algorithms are geometricdecaying in the sense that the sequence of work loads is upper bounded by a decreasing geometric series. A standard scheduling implementation of such algorithms consists of a repeated application of load balancing. We present a more effective, yet as simple, policy for the utilization of load balancing in geometric decaying algorithms. By making a more careful choice of when and how often load balancing should be employed, and by using a simple amortization argument, we showthat the number of required applications of load balancing should be nearlyconstant. The policy is not restricted to any particular model of parallel computation, and, up to a constant factor, it is the best possible.
A Note on Reducing Parallel Model Simulations to Integer Sorting
, 1995
"... We show that simulating a step of a fetch&add pram model on an erew pram model can be made as efficient as integer sorting. In particular, we present several efficient reductions of the simulation problem to various integer sorting problems. By using some recent algorithms for integer sorting, w ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
We show that simulating a step of a fetch&add pram model on an erew pram model can be made as efficient as integer sorting. In particular, we present several efficient reductions of the simulation problem to various integer sorting problems. By using some recent algorithms for integer sorting, we get simulation algorithms on crew and erew that take o(n lg n) operations where n is the number of processors in the simulated crcw machine. Previous simulations were using \Theta(n lg n) operations. Some of the more interesting simulation results are obtained by using a bootstrapping technique with a crcw pram algorithm for hashing. 1 Introduction The concurrentread concurrentwrite (crcw) pram programmer's model is commonly used for designing parallel algorithms. On the other hand, the weaker exclusivewrite pram models are sometimes considered closer to realization. Therefore, while it is more convenient to design algorithms for the stronger crcw model, an extra effort is sometimes neede...