Results 1 - 10
of
25
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract
-
Cited by 163 (7 self)
- Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
Prefix Sums and Their Applications
"... Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel ..."
Abstract
-
Cited by 79 (2 self)
- Add to MetaCart
Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel
Optimal Doubly Logarithmic Parallel Algorithms Based On Finding All Nearest Smaller Values
, 1993
"... The all nearest smaller values problem is defined as follows. Let A = (a 1 ; a 2 ; : : : ; an ) be n elements drawn from a totally ordered domain. For each a i , 1 i n, find the two nearest elements in A that are smaller than a i (if such exist): the left nearest smaller element a j (with j ! i) a ..."
Abstract
-
Cited by 36 (7 self)
- Add to MetaCart
The all nearest smaller values problem is defined as follows. Let A = (a 1 ; a 2 ; : : : ; an ) be n elements drawn from a totally ordered domain. For each a i , 1 i n, find the two nearest elements in A that are smaller than a i (if such exist): the left nearest smaller element a j (with j ! i) and the right nearest smaller element a k (with k ? i). We give an O(log log n) time optimal parallel algorithm for the problem on a CRCW PRAM. We apply this algorithm to achieve optimal O(log log n) time parallel algorithms for four problems: (i) Triangulating a monotone polygon, (ii) Preprocessing for answering range minimum queries in constant time, (iii) Reconstructing a binary tree from its inorder and either preorder or postorder numberings, (vi) Matching a legal sequence of parentheses. We also show that any optimal CRCW PRAM algorithm for the triangulation problem requires \Omega\Gammauir log n) time. Dept. of Computing, King's College London, The Strand, London WC2R 2LS, England. ...
Scan Primitives for Vector Computers
- In Proceedings Supercomputing '90
, 1990
"... This paper describes an optimized implementation of a set of scan (also called allprefix -sums) primitives on a single processor of a CRAY Y-MP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. ..."
Abstract
-
Cited by 34 (9 self)
- Add to MetaCart
This paper describes an optimized implementation of a set of scan (also called allprefix -sums) primitives on a single processor of a CRAY Y-MP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. The algorithm used to implement the scans is based on an algorithm for parallel computers and is applicable with minor modifications to any register-based vector computer. On the CRAY Y-MP, the asymptotic running time of the plus-scan is about 2.25 times that of a vector add, and is within 20% of optimal. An important aspect of our implementation is that a set of segmented versions of these scans are only marginally more expensive than the unsegmented versions. These segmented versions can be used to execute a scan on multiple data sets without having to pay the vector startup cost (n 1=2 ) for each set. The paper describes a radix sorting routine based on the scans that is 13 times faster ...
Parallelization via Context Preservation
- In IEEE Intl Conference on Computer Languages
, 1998
"... Abstract program schemes, such as scan or homomorphism, can capture a wide range of data parallel programs. While versatile, these schemes are of limited practical use on their own. A key problem is that the more natural sequential specifications may not have associative combine operators required b ..."
Abstract
-
Cited by 17 (16 self)
- Add to MetaCart
Abstract program schemes, such as scan or homomorphism, can capture a wide range of data parallel programs. While versatile, these schemes are of limited practical use on their own. A key problem is that the more natural sequential specifications may not have associative combine operators required by these schemes. As a result, they often fail to be immediately identified. To resolve this problem, we propose a method to systematically derive parallel programs from sequential definitions. This method is special in that it can automatically invent auxiliary functions needed by associative combine operators. Apart from a formalisation, we also provide new theorems, based on the notion of context preservation, to guarantee parallelization for a precise class of sequential programs. 1 Introduction It is well-recognised that a key problem of parallel computing remains the development of efficient and correct parallel software. This task is further complicated by the variety of parallel arc...
Acceleration of First and Higher Order Recurrences on Processors with Instruction Level Parallelism
- In Sixth International Workshop on Languages and Compilers for Parallel Computing
, 1993
"... This report describes parallelization techniques for accelerating a broad class of recurrences on processors with instruction level parallelism. We introduce a new technique, called blocked back-substitution, which has lower operation count and higher performance than previous methods. The blocked b ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
This report describes parallelization techniques for accelerating a broad class of recurrences on processors with instruction level parallelism. We introduce a new technique, called blocked back-substitution, which has lower operation count and higher performance than previous methods. The blocked back-substitution technique requires unrolling and non-symmetric optimization of innermost loop iterations. We present metrics to characterize the performance of software-pipelined loops and compare these metrics for a range of height reduction techniques and processor architectures.
Optimal Parallel Approximation Algorithms for Prefix Sums and Integer Sorting (Extended Abstract)
"... Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of non-trivial applications t ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of non-trivial applications that have been shown to be solvable using only an approximate version of the prefix sums problem. In this paper we resolve the issue of approximating parallel prefix by introducing an algorithm that runs in O(lg n) time with very high probability, using n= lg n processors, which is optimal in terms of both work and running time. Our approximate prefix sums are guaranteed to come within a factor of (1 + ffl) of the values of the true sums in a "consistent fashion", where ffl is o(1). We achieve this result through the use of a number of interesting new techniques, such as overcertification and estimate-focusing, as well ...
Thinking in Parallel: Some Basic DataParallel Algorithms and Techniques
- College Park, MD
, 1993
"... PRAM-On-Chip Explicit Multi-Threading (XMT) platform is provided through the XMT home page www.umiacs.umd.edu/users/vishkin/XMT and the class home page. Comments are welcome: please write to me using my last name at umd.edu ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
PRAM-On-Chip Explicit Multi-Threading (XMT) platform is provided through the XMT home page www.umiacs.umd.edu/users/vishkin/XMT and the class home page. Comments are welcome: please write to me using my last name at umd.edu
Solving Linear Recurrences with Loop Raking
, 1992
"... We present a variation of the partition method for solving linear recurrences that is wellsuited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more co ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
We present a variation of the partition method for solving linear recurrences that is wellsuited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more commonly used version of the partition method. Our variation uses a general loop restructuring technique called loop raking. We describe an implementation of this technique on the CRAY Y-MP C90, and present performance results for first- and second-order linear recurrences. On a single processor of the C90 our implementations are up to 7.3 times faster than the corresponding optimized library routines in SCILIB, an optimized mathematical library supplied by Cray Research. On 4 processors, we gain an additional speedup of at least 3.7. List of symbols 1 infinity \Phi plus within circle \Omega times within circle b; c floor ß approximately equal to Lib Partition F77 Raked 1 Proc 4 Procs Lo...
Acceleration of Algebraic Recurrences on Processors with Instruction Level Parallelism
- In Proceedings of LCPC-6
, 1993
"... Architectures with instruction level parallelism such as VLIW and superscalar processors provide parallelism in the form of a limited number of pipelined functional units. For these architectures, recurrence height reduction techniques provide significant speedups when they are properly applied. Thi ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Architectures with instruction level parallelism such as VLIW and superscalar processors provide parallelism in the form of a limited number of pipelined functional units. For these architectures, recurrence height reduction techniques provide significant speedups when they are properly applied. This paper introduces a new technique, called blocked back-substitution,

