Results 1 - 10
of
12
Provably efficient scheduling for languages with fine-grained parallelism
- IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract
-
Cited by 68 (22 self)
- Add to MetaCart
Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
Wavelet synopses with error guarantees
- In Proceedings of the 2002 ACM SIGMOD international conference on Management of data
, 2002
"... ABSTRACT Recent work has demonstrated the effectiveness of the wavelet de-composition in reducing large amounts of data to compact sets of ..."
Abstract
-
Cited by 63 (1 self)
- Add to MetaCart
ABSTRACT Recent work has demonstrated the effectiveness of the wavelet de-composition in reducing large amounts of data to compact sets of
Efficient Low-Contention Parallel Algorithms
- the 1994 ACM Symp. on Parallel Algorithms and Architectures
, 1994
"... The queue-read, queue-write (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model reflects the contention prope ..."
Abstract
-
Cited by 29 (11 self)
- Add to MetaCart
The queue-read, queue-write (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the well-studied crcw pram or erew pram models, and can be efficiently emulated with only logarithmic slowdown on hypercubetype non-combining networks. This paper describes fast, low-contention, work-optimal, randomized qrqw pram algorithms for the fundamental problems of load balancing, multiple compaction, generating a random permutation, parallel hashing, and distributive sorting. These logarithmic or sublogarithmic time algorithms considerably improve upon the best known erew pram algorithms for these problems, while avoiding the high-contention steps typical of crcw pram algorithms. An illustrative expe...
The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms
- Proc. 5th ACM-SIAM Symp. on Discrete Algorithms
, 1997
"... Abstract. This paper introduces the queue-read queue-write (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to shared-memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to thi ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
Abstract. This paper introduces the queue-read queue-write (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to shared-memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to this work there were no formal complexity models that accounted for the contention to memory locations, despite its large impact on the performance of parallel programs. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the well-studied crcw pram or erew pram models: the crcw model does not adequately penalize algorithms with high contention to shared-memory locations, while the erew model is too strict in its insistence on zero contention at each step. The�qrqw pram is strictly more powerful than the erew pram. This paper shows a separation of log n between the two models, and presents faster and more efficient qrqw algorithms for several basic problems, such as linear compaction, leader election, and processor allocation. Furthermore, we present a work-preserving emulation of the qrqw pram with only logarithmic slowdown on Valiant’s bsp model, and hence on hypercube-type noncombining networks, even when latency, synchronization, and memory granularity overheads are taken into account. This matches the bestknown emulation result for the erew pram, and considerably improves upon the best-known efficient emulation for the crcw pram on such networks. Finally, the paper presents several lower bound results for this model, including lower bounds on the time required for broadcasting and for leader election.
Aqua project white paper
, 1997
"... Viswanath Poosala z In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding o ..."
Abstract
-
Cited by 16 (10 self)
- Add to MetaCart
Viswanath Poosala z In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide an estimated response in orders of magnitude less time than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. This white paper describes the Approximate QUery Answering (AQUA) Project underway in the Information Sciences Research Center at Bell Labs. We present a framework for an approximate query engine that observes new data as it arrives and maintains small synopsis data structures on that data. These data structures are used to provide fast, approximate answers to a broad class of queries. We describe metrics for evaluating approximate query answers. We also present new synopsis data structures, and new techniques for approximate query answers. We report on the goals and status of the Aqua project, and plans for future work.
Optimal Parallel Approximation Algorithms for Prefix Sums and Integer Sorting (Extended Abstract)
"... Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of non-trivial applications t ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of non-trivial applications that have been shown to be solvable using only an approximate version of the prefix sums problem. In this paper we resolve the issue of approximating parallel prefix by introducing an algorithm that runs in O(lg n) time with very high probability, using n= lg n processors, which is optimal in terms of both work and running time. Our approximate prefix sums are guaranteed to come within a factor of (1 + ffl) of the values of the true sums in a "consistent fashion", where ffl is o(1). We achieve this result through the use of a number of interesting new techniques, such as overcertification and estimate-focusing, as well ...
A Note on Reducing Parallel Model Simulations to Integer Sorting
, 1995
"... We show that simulating a step of a fetch&add pram model on an erew pram model can be made as efficient as integer sorting. In particular, we present several efficient reductions of the simulation problem to various integer sorting problems. By using some recent algorithms for integer sorting, we ge ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We show that simulating a step of a fetch&add pram model on an erew pram model can be made as efficient as integer sorting. In particular, we present several efficient reductions of the simulation problem to various integer sorting problems. By using some recent algorithms for integer sorting, we get simulation algorithms on crew and erew that take o(n lg n) operations where n is the number of processors in the simulated crcw machine. Previous simulations were using \Theta(n lg n) operations. Some of the more interesting simulation results are obtained by using a bootstrapping technique with a crcw pram algorithm for hashing. 1 Introduction The concurrent-read concurrent-write (crcw) pram programmer's model is commonly used for designing parallel algorithms. On the other hand, the weaker exclusive-write pram models are sometimes considered closer to realization. Therefore, while it is more convenient to design algorithms for the stronger crcw model, an extra effort is sometimes neede...
Approximate Parallel Prefix Computation and Its Applications
, 1993
"... In this paper we address two fundamental problems in parallel algorithm design---parallel prefix sums and integer sorting---and show that both of them can be approximately solved very quickly on a randomized CRCW PRAM. In the case of prefix sums the approximation is in terms of the accuracy of the s ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
In this paper we address two fundamental problems in parallel algorithm design---parallel prefix sums and integer sorting---and show that both of them can be approximately solved very quickly on a randomized CRCW PRAM. In the case of prefix sums the approximation is in terms of the accuracy of the sums and in the case of integer sorting it is in terms of allowing some gaps between consecutive elements in the ordered list. By introducing approximation in these ways we are able to solve these problems in o(lg lg n) time, and thus avoid the near-logarithmic lower bounds by Beame and Hastad that hold for the exact versions of these problems. Nevertheless, we demonstrate that these approximations are strong enough to be used as subroutines in fast randomized algorithms for some well-known problems in parallel computational geometry. Perhaps the most succinct way to describe the power of the new tools which are presented is by observing that prior to this work it was known how to solve the i...
An Effective Load Balancing Policy for Geometric Decaying Algorithms
"... Parallel algorithms are often first designed as a sequence of rounds, where each round includes any number of independent constant time operations. This so-called work-time presentation is then followed by a processor scheduling implementation ona more concrete computational model. Many parallel alg ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Parallel algorithms are often first designed as a sequence of rounds, where each round includes any number of independent constant time operations. This so-called work-time presentation is then followed by a processor scheduling implementation ona more concrete computational model. Many parallel algorithms are geometric-decaying in the sense that the sequence of work loads is upper bounded by a decreasing geometric series. A standard scheduling implementation of such algorithms consists of a repeated application of load balancing. We present a more effective, yet as simple, policy for the utilization of load balancing in geometric decaying algorithms. By making a more careful choice of when and how often load balancing should be employed, and by using a simple amortization argument, we showthat the number of required applications of load balancing should be nearly-constant. The policy is not restricted to any particular model of parallel computation, and, up to a constant factor, it is the best possible.
Fast, Efficient Mutual and Self Simulations for Shared Memory and Reconfigurable Mesh
- in Proceedings of the 7th IEEE Symposium on Parallel and Distributed Processing
, 1995
"... This paper studies relations between the parallel random access machine (pram) model, and the reconfigurable mesh (rmesh) model, by providing mutual simulations between the models. We present an algorithm simulating one step of an (n lg lg n)- processor crcw pram on an n \Theta n rmesh with delay O ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper studies relations between the parallel random access machine (pram) model, and the reconfigurable mesh (rmesh) model, by providing mutual simulations between the models. We present an algorithm simulating one step of an (n lg lg n)- processor crcw pram on an n \Theta n rmesh with delay O(lg lg n) with high probability. We use our pram simulation to obtain the first efficient self-simulation algorithm of an rmesh with general switches: An algorithm running on an n \Theta n rmesh is simulated on a p \Theta p rmesh with delay O((n=p) 2 + lg n lg lg p) with high probability, which is optimal for all p n= p lg n lg lg n. Finally, we consider the simulation of rmesh on the pram. We show that a 2 \Theta n rmesh can be optimally simulated on a crcw pram in \Theta(ff(n)) time, where ff(\Delta) is the slow-growing inverse Ackermann function. In contrast, a pram with polynomial number of processors cannot simulate the 3 \Theta n rmesh in less than \Omega\Gammaha n= lg lg n) e...

