Results 1  10
of
26
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract

Cited by 191 (9 self)
 Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
Prefix Sums and Their Applications
"... Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel ..."
Abstract

Cited by 95 (2 self)
 Add to MetaCart
Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel
Scan Primitives for Vector Computers
 In Proceedings Supercomputing '90
, 1990
"... This paper describes an optimized implementation of a set of scan (also called allprefix sums) primitives on a single processor of a CRAY YMP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. ..."
Abstract

Cited by 38 (9 self)
 Add to MetaCart
This paper describes an optimized implementation of a set of scan (also called allprefix sums) primitives on a single processor of a CRAY YMP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. The algorithm used to implement the scans is based on an algorithm for parallel computers and is applicable with minor modifications to any registerbased vector computer. On the CRAY YMP, the asymptotic running time of the plusscan is about 2.25 times that of a vector add, and is within 20% of optimal. An important aspect of our implementation is that a set of segmented versions of these scans are only marginally more expensive than the unsegmented versions. These segmented versions can be used to execute a scan on multiple data sets without having to pay the vector startup cost (n 1=2 ) for each set. The paper describes a radix sorting routine based on the scans that is 13 times faster ...
Optimal Doubly Logarithmic Parallel Algorithms Based On Finding All Nearest Smaller Values
, 1993
"... The all nearest smaller values problem is defined as follows. Let A = (a 1 ; a 2 ; : : : ; an ) be n elements drawn from a totally ordered domain. For each a i , 1 i n, find the two nearest elements in A that are smaller than a i (if such exist): the left nearest smaller element a j (with j ! i) a ..."
Abstract

Cited by 37 (7 self)
 Add to MetaCart
The all nearest smaller values problem is defined as follows. Let A = (a 1 ; a 2 ; : : : ; an ) be n elements drawn from a totally ordered domain. For each a i , 1 i n, find the two nearest elements in A that are smaller than a i (if such exist): the left nearest smaller element a j (with j ! i) and the right nearest smaller element a k (with k ? i). We give an O(log log n) time optimal parallel algorithm for the problem on a CRCW PRAM. We apply this algorithm to achieve optimal O(log log n) time parallel algorithms for four problems: (i) Triangulating a monotone polygon, (ii) Preprocessing for answering range minimum queries in constant time, (iii) Reconstructing a binary tree from its inorder and either preorder or postorder numberings, (vi) Matching a legal sequence of parentheses. We also show that any optimal CRCW PRAM algorithm for the triangulation problem requires \Omega\Gammauir log n) time. Dept. of Computing, King's College London, The Strand, London WC2R 2LS, England. ...
Parallelization via Context Preservation
 In IEEE Intl Conference on Computer Languages
, 1998
"... Abstract program schemes, such as scan or homomorphism, can capture a wide range of data parallel programs. While versatile, these schemes are of limited practical use on their own. A key problem is that the more natural sequential specifications may not have associative combine operators required b ..."
Abstract

Cited by 18 (16 self)
 Add to MetaCart
Abstract program schemes, such as scan or homomorphism, can capture a wide range of data parallel programs. While versatile, these schemes are of limited practical use on their own. A key problem is that the more natural sequential specifications may not have associative combine operators required by these schemes. As a result, they often fail to be immediately identified. To resolve this problem, we propose a method to systematically derive parallel programs from sequential definitions. This method is special in that it can automatically invent auxiliary functions needed by associative combine operators. Apart from a formalisation, we also provide new theorems, based on the notion of context preservation, to guarantee parallelization for a precise class of sequential programs. 1 Introduction It is wellrecognised that a key problem of parallel computing remains the development of efficient and correct parallel software. This task is further complicated by the variety of parallel arc...
Acceleration of First and Higher Order Recurrences on Processors with Instruction Level Parallelism
 In Sixth International Workshop on Languages and Compilers for Parallel Computing
, 1993
"... This report describes parallelization techniques for accelerating a broad class of recurrences on processors with instruction level parallelism. We introduce a new technique, called blocked backsubstitution, which has lower operation count and higher performance than previous methods. The blocked b ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
This report describes parallelization techniques for accelerating a broad class of recurrences on processors with instruction level parallelism. We introduce a new technique, called blocked backsubstitution, which has lower operation count and higher performance than previous methods. The blocked backsubstitution technique requires unrolling and nonsymmetric optimization of innermost loop iterations. We present metrics to characterize the performance of softwarepipelined loops and compare these metrics for a range of height reduction techniques and processor architectures.
Optimal Parallel Approximation Algorithms for Prefix Sums and Integer Sorting (Extended Abstract)
"... Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of nontrivial applications t ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
Parallel prefix computation is perhaps the most frequently used subroutine in parallel algorithms today. Its time complexity on the CRCWPRAM is \Theta(lg n= lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of nontrivial applications that have been shown to be solvable using only an approximate version of the prefix sums problem. In this paper we resolve the issue of approximating parallel prefix by introducing an algorithm that runs in O(lg n) time with very high probability, using n= lg n processors, which is optimal in terms of both work and running time. Our approximate prefix sums are guaranteed to come within a factor of (1 + ffl) of the values of the true sums in a "consistent fashion", where ffl is o(1). We achieve this result through the use of a number of interesting new techniques, such as overcertification and estimatefocusing, as well ...
Thinking in parallel: Some basic dataparallel algorithms and techniques
 In use as class notes since
, 1993
"... Copyright 19922009, Uzi Vishkin. These class notes reflect the theorertical part in the Parallel ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Copyright 19922009, Uzi Vishkin. These class notes reflect the theorertical part in the Parallel
Solving Linear Recurrences with Loop Raking
, 1992
"... We present a variation of the partition method for solving linear recurrences that is wellsuited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more co ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
We present a variation of the partition method for solving linear recurrences that is wellsuited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more commonly used version of the partition method. Our variation uses a general loop restructuring technique called loop raking. We describe an implementation of this technique on the CRAY YMP C90, and present performance results for first and secondorder linear recurrences. On a single processor of the C90 our implementations are up to 7.3 times faster than the corresponding optimized library routines in SCILIB, an optimized mathematical library supplied by Cray Research. On 4 processors, we gain an additional speedup of at least 3.7. List of symbols 1 infinity \Phi plus within circle \Omega times within circle b; c floor ß approximately equal to Lib Partition F77 Raked 1 Proc 4 Procs Lo...
Acceleration of Algebraic Recurrences on Processors with Instruction Level Parallelism
 In Proceedings of LCPC6
, 1993
"... Architectures with instruction level parallelism such as VLIW and superscalar processors provide parallelism in the form of a limited number of pipelined functional units. For these architectures, recurrence height reduction techniques provide significant speedups when they are properly applied. Thi ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Architectures with instruction level parallelism such as VLIW and superscalar processors provide parallelism in the form of a limited number of pipelined functional units. For these architectures, recurrence height reduction techniques provide significant speedups when they are properly applied. This paper introduces a new technique, called blocked backsubstitution,