Results 1 - 10
of
16
ZPL: An Array Sublanguage
- PROCEEDINGS OF THE 6TH INTERNATIONAL WORKSHOP ON LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING
, 1993
"... The notion of isolating the "common case" is a well known computer science principle. This paper describes ZPL, a language that treats data parallelism as a common case of MIMD parallelism. This separation of concerns has many benefits. It allows us to define a clean and concise language for describ ..."
Abstract
-
Cited by 30 (10 self)
- Add to MetaCart
The notion of isolating the "common case" is a well known computer science principle. This paper describes ZPL, a language that treats data parallelism as a common case of MIMD parallelism. This separation of concerns has many benefits. It allows us to define a clean and concise language for describing data parallel computations, and this in turn leads to efficient parallel execution. Our particular language also provides mechanisms for handling boundary conditions. We introduce the concepts, constructs and semantics of our new language, and give a simple example that contrasts ZPL with other data parallel languages.
Parallelizing While Loops for Multiprocessor Systems
- IN PROCEEDINGS OF THE 9TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM
, 1995
"... Current parallelizing compilers treat while loops and do loops with conditional exits as sequential constructs because their iteration space is unknown. Motivated by the fact that these types of loops arise frequently in practice, we have developed techniques that can be used to automatically transf ..."
Abstract
-
Cited by 29 (13 self)
- Add to MetaCart
Current parallelizing compilers treat while loops and do loops with conditional exits as sequential constructs because their iteration space is unknown. Motivated by the fact that these types of loops arise frequently in practice, we have developed techniques that can be used to automatically transform them for parallel execution. We succeed in parallelizing loops involving linked lists traversals --- something that has not been done before. This is an important problem since linked list traversals arise frequently in loops with irregular access patterns, such as sparse matrix computations. The methods can even be applied to loops whose data dependence relations cannot be analyzed at compile-time. We outline a cost/performance analysis that can be used to decide when the methods should be applied. Since, as we show, the expected speedups are significant, our conclusion is that they should almost always be applied --- providing there is sufficient parallelism available in the original loop. We present experimental results on loops from the PERFECT Benchmarks and sparse matrix packages which substantiate our conclusion that these techniques can yield significant speedups.
Efficient parallel algorithms for chordal graphs
"... We give the first efficient parallel algorithms for recognizing chordal graphs, finding a maximum clique and a maximum independent set in a chordal graph, finding an optimal coloring of a chordal graph, finding a breadth-first search tree and a depth-first search tree of a chordal graph, recognizing ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
We give the first efficient parallel algorithms for recognizing chordal graphs, finding a maximum clique and a maximum independent set in a chordal graph, finding an optimal coloring of a chordal graph, finding a breadth-first search tree and a depth-first search tree of a chordal graph, recognizing interval graphs, and testing interval graphs for isomorphism. The key to our results is an efficient parallel algorithm for finding a perfect elimination ordering.
RSA Hardware Implementation
, 1995
"... Introduction to Arithmetic for Digital System Designers. New York, NY: Holt, Rinehart and Winston, 1982. 28 #14# C#. K. Ko#c and C. Y. Hung. Multi-operand modulo addition using carry save adders. Electronics Letters, 26#6#:361#363, 15th March 1990. #15# C# . K. Ko#c and C. Y. Hung. Bit-level syst ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Introduction to Arithmetic for Digital System Designers. New York, NY: Holt, Rinehart and Winston, 1982. 28 #14# C#. K. Ko#c and C. Y. Hung. Multi-operand modulo addition using carry save adders. Electronics Letters, 26#6#:361#363, 15th March 1990. #15# C# . K. Ko#c and C. Y. Hung. Bit-level systolic arrays for modular multiplication. Journal of VLSI Signal Processing, 3#3#:215#223, 1991. #16# M. Kochanski. Developing an RSA chip. In H. C. Williams, editor, Advances in Cryptology ---CRYPTO 85, Proceedings, Lecture Notes in Computer Science, No. 218, pages 350#357. New York, NY: Springer-Verlag, 1985. #17# I. Koren. Computer Arithmetic Algorithms. Englewood Cli#s, NJ: Prentice-Hall, 1993. #18# D. C. Kozen. The Design and Analysis of Algorithms. New York, NY: Springer-Verlag, 1992. #19# R. Ladner and M. Fischer. Parallel pre#x computation. Journal of the ACM, 27#4#:831# 838, October 1980. #20# S.
Generic Downwards Accumulations
- Science of Computer Programming
, 2000
"... . A downwards accumulation is a higher-order operation that distributes information downwards through a data structure, from the root towards the leaves. The concept was originally introduced in an ad hoc way for just a couple of kinds of tree. We generalize the concept to an arbitrary regular d ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
. A downwards accumulation is a higher-order operation that distributes information downwards through a data structure, from the root towards the leaves. The concept was originally introduced in an ad hoc way for just a couple of kinds of tree. We generalize the concept to an arbitrary regular datatype; the resulting denition is co-inductive. 1 Introduction The notion of scans or accumulations on lists is well known, and has proved very fruitful for expressing and calculating with programs involving lists [4]. Gibbons [7, 8] generalizes the notion of accumulation to various kinds of tree; that generalization too has proved fruitful, underlying the derivations of a number of tree algorithms, such as the parallel prex algorithm for prex sums [15, 8], Reingold and Tilford's algorithm for drawing trees tidily [21, 9], and algorithms for query evaluation in structured text [16, 23]. There are two varieties of accumulation on lists: leftwards and rightwards. Leftwards accumulation ...
Run-time Parallelization: A Framework for Parallel Computation
, 1995
"... The goal of parallelizing, or restructuring, compilers is to detect and exploit parallelism in sequential programs written in conventional languages. Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, statically analyzable access patterns. Howev ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
The goal of parallelizing, or restructuring, compilers is to detect and exploit parallelism in sequential programs written in conventional languages. Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, statically analyzable access patterns. However, if the memory access pattern of the program is input data dependent, then static data dependence analysis and consequently parallelization is impossible. Moreover, in this case the compiler cannot apply privatization and reduction parallelization, the transformations that have been proven to be the most effective in removing data dependences and increasing the amount of exploitable parallelism in the program. Typical examples of irregular, dynamic applications are complex simulations such as SPICE for circuit simulation, DYNA-3D for structural mechanics modeling, DMOL for quantum mechanical simulation of molecules, and CHARMM for molecular dynamics simulation of organic systems. Therefore, since irregular programs represent a large and important fraction of applications, an automatable framework for run-time parallelization is needed to complement existing and future static compiler techniques. In this thesis,
Implementation of Parallel Graph Algorithms on a Massively Parallel SIMD Computer with Virtual Processing
, 1995
"... We describe our implementation of several PRAM graph algorithms on the massively parallel computer MasPar MP-1 with 16,384 processors. Our implementation incorporated virtual processing and we present extensive test data. In a previous project [13], we reported the implementation of a set of paralle ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
We describe our implementation of several PRAM graph algorithms on the massively parallel computer MasPar MP-1 with 16,384 processors. Our implementation incorporated virtual processing and we present extensive test data. In a previous project [13], we reported the implementation of a set of parallel graph algorithms with the constraint that the maximum input size was restricted to be no more than the physical number of processors on the MasPar. The MasPar language MPL that we used for our code does not support virtual processing. In this paper, we describe a method of simulating virtual processors on the MasPar. We re-coded and fine-tuned our earlier parallel graph algorithms to incorporate the usage of virtual processors. Under the current implementation scheme, there is no limit on the number of virtual processors that one can use in the program as long as there is enough main memory to store all the data required during the computation. We also give two general optimization techniq...
An Efficient Parallel Algorithm That Finds Independent Sets Of Guaranteed Size
, 1990
"... . Every graph with n vertices and m edges has an independent set containing at least n 2 =(2m +n) vertices. We present a parallel algorithm that nds an independent set of this size and runs in O(log 3 n) time on a CRCW PRAM with O((m + n)(m; n)= log 2 n) processors, where (n; m) is a functiona ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
. Every graph with n vertices and m edges has an independent set containing at least n 2 =(2m +n) vertices. We present a parallel algorithm that nds an independent set of this size and runs in O(log 3 n) time on a CRCW PRAM with O((m + n)(m; n)= log 2 n) processors, where (n; m) is a functional inverse of Ackerman's function. The ideas used in the design of this algorithm are also used to design an algorithm that, with the same resources, nds a vertex coloring satisfying certain minimality conditions. Key words. Turan's theorem, independent set, NC, graph, parallel computation, deterministic AMS(MOS) subject classications. 68Q22, 68R10, 68R05 1. Introduction. This paper presents a fast parallel algorithm that, given a graph G, nds an independent set of G whose size is bounded from below. The bound depends on the number n of vertices and number m of edges of G, and cannot be improved in these terms. Since constructing a maximum independent set is NP-hard, it cannot be so...
Parallel Canonical Recoding
- Electronics Letters
, 1996
"... We introduce a parallel algorithm for generating the canonical signed-digit expansion of an n-bit number in O#log n# time using O#n# gates. The algorithm is similar to the computation of the carries in a carry look-ahead circuit. We also prove that if the binary number x + bx=2c is given, then th ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We introduce a parallel algorithm for generating the canonical signed-digit expansion of an n-bit number in O#log n# time using O#n# gates. The algorithm is similar to the computation of the carries in a carry look-ahead circuit. We also prove that if the binary number x + bx=2c is given, then the canonical signed-digit recoding of x can be computed in O#1# time using O#n# gates. 1 Introduction Recoding techniques #Booth recoding, bit-pair recoding, etc.# for sparse signed-digit representations of binary numbers have been e#ectively used in multiplication #3, 4# and exponentiation algorithms #2#. For example, the original Booth recoding technique #3, 4# scans the bits of the multiplier one bit at a time, and adds or subtracts the multiplicand to or from the partial product, depending on the value of the current bit and the previous bit. The modi#ed versions of the Booth algorithm scan the bits of the multiplier two bits or three bits at a time #4#. These techniques are equivalent ...
An Effective Load Balancing Policy for Geometric Decaying Algorithms
"... Parallel algorithms are often first designed as a sequence of rounds, where each round includes any number of independent constant time operations. This so-called work-time presentation is then followed by a processor scheduling implementation ona more concrete computational model. Many parallel alg ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Parallel algorithms are often first designed as a sequence of rounds, where each round includes any number of independent constant time operations. This so-called work-time presentation is then followed by a processor scheduling implementation ona more concrete computational model. Many parallel algorithms are geometric-decaying in the sense that the sequence of work loads is upper bounded by a decreasing geometric series. A standard scheduling implementation of such algorithms consists of a repeated application of load balancing. We present a more effective, yet as simple, policy for the utilization of load balancing in geometric decaying algorithms. By making a more careful choice of when and how often load balancing should be employed, and by using a simple amortization argument, we showthat the number of required applications of load balancing should be nearly-constant. The policy is not restricted to any particular model of parallel computation, and, up to a constant factor, it is the best possible.

