Results 1  10
of
34
Maximizing Parallelism and Minimizing Synchronization with Affine Transforms
 Parallel Computing
, 1997
"... This paper presents the first algorithm to find the optimal affine transform that maximizes the degree of parallelism while minimizing the degree of synchronization in a program with arbitrary loop nestings and affine data accesses. The problem is formulated without the use of imprecise data depende ..."
Abstract

Cited by 130 (7 self)
 Add to MetaCart
This paper presents the first algorithm to find the optimal affine transform that maximizes the degree of parallelism while minimizing the degree of synchronization in a program with arbitrary loop nestings and affine data accesses. The problem is formulated without the use of imprecise data dependence abstractions such as data dependence vectors. The algorithm presented subsumes previously proposed program transformation algorithms that are based on unimodular transformations, loop fusion, fission, scaling, reindexing and/or statement reordering. 1 Introduction As multiprocessors become popular, it is important to develop compilers that can automatically translate sequential programs into efficient parallel code. Getting high performance on a multiprocessor requires not only finding parallelism in the program but also minimizing the synchronization overhead. Synchronization is expensive on a multiprocessor. The cost of synchronization goes far beyond just the operations that manipul...
The Mapping of Linear Recurrence Equations on Regular Arrays
 Journal of VLSI Signal Processing
, 1989
"... The parallelization of many algorithms can be obtained using spacetime transformations which are applied on nested doloops or on recurrence equations. In this paper, we analyze systems of linear recurrence equations, a generalization of uniform recurrence equations. The first part of the paper des ..."
Abstract

Cited by 66 (7 self)
 Add to MetaCart
The parallelization of many algorithms can be obtained using spacetime transformations which are applied on nested doloops or on recurrence equations. In this paper, we analyze systems of linear recurrence equations, a generalization of uniform recurrence equations. The first part of the paper describes a method for finding automatically whether such a system can be scheduled by an affine timing function, independent of the size parameter of the algorithm. In the second part, we describe a powerful method that makes it possible to transform linear recurrences into uniform recurrence equations. Both parts rely on results on integral convex polyhedra. Our results are illustrated on the Gauss elimination algorithm and on the GaussJordan diagonalization algorithm. 1 Introduction Designing efficient algorithms for parallel architectures is one of the main difficulties of the current research in computer science. As the architecture of supercomputers evolves towards massive parallelism...
Programmable Systolic Arrays
 Brown University
, 1991
"... This paper presents the New Systolic Language as a general solution to the problem systolic programming. The language provides a simple programming interface for systolic algorithms suitable for di erent hardware platforms and software simulators. The New Systolic Language hides the details and pote ..."
Abstract

Cited by 18 (7 self)
 Add to MetaCart
This paper presents the New Systolic Language as a general solution to the problem systolic programming. The language provides a simple programming interface for systolic algorithms suitable for di erent hardware platforms and software simulators. The New Systolic Language hides the details and potential systolic data streams. Data ows and systolic cell programs for the coprocessor are integrated with host functions, enabling a single le to specify a complete systolic program. 1
Extent Analysis of Data Fields
, 1994
"... Data parallelism means operating on distributed tables, data fields, in parallel. An abstract model of data parallelism treats data fields as functions explicitly restricted to a finite set. Data parallel functional languages based on this view vill reach a very high level of abstraction. In this re ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
Data parallelism means operating on distributed tables, data fields, in parallel. An abstract model of data parallelism treats data fields as functions explicitly restricted to a finite set. Data parallel functional languages based on this view vill reach a very high level of abstraction. In this report we consider two static analyses that, when successful, give information about the extent of a recursively defined data field. This information can be used to preallocate the data fields and map then efficiently to distributed memory, and to aid the static scheduling of operations. The analyses are cast in the framework of abstract interpretation: a forward analaysis propagates restrictions on inputs to restrictions on outputs, and a backward analysis propagates restrictions the other way. Fixpoint iteration can sometimes be used to solve the equations that arise, and we devise some cases where this is possible.
Synthesizing Optimal Lower Dimensional Processor Arrays
 Proceedings of International Conference on Parallel Processing, Pennsylvania State
, 1992
"... . Most existing methods for synthesizing systolic architectures can only map ndimensional recurrences to n \Gamma 1dimensional arrays. In this paper, we generalize the parameterbased approach of Li and Wah [1] to map ndimensional uniform recurrences to any kdimensional processor arrays, where ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
. Most existing methods for synthesizing systolic architectures can only map ndimensional recurrences to n \Gamma 1dimensional arrays. In this paper, we generalize the parameterbased approach of Li and Wah [1] to map ndimensional uniform recurrences to any kdimensional processor arrays, where k ! n. In our approach, operations of the target array are captured by a set of parameters, and constraints are derived to avoid computational conflicts and data collisions. We show that the optimal array for any objective function expressed in terms of these parameters can be found by a systematic enumeration over a polynomial search space. In contrast, previous attempts [2, 3] do not guarantee the optimality of the resulting designs. We illustrate our method with optimal singlepass linear arrays for reindexed WarshallFloyd pathfinding algorithm. Finally, we show the application of GPM to practical situations characterized by restriction on resources, such as processors or completion ti...
A LanguageOriented Approach to the Design of Systolic Chips
, 1991
"... The Alpha language results from research on automatic synthesis of systolic algorithms. It is based on the recurrence equation formalism introduced by Karp, Miller and Winograd in 1967. The basic objects of Alpha are variables indexed on integral points of a convex set. It is a functional/equatio ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
The Alpha language results from research on automatic synthesis of systolic algorithms. It is based on the recurrence equation formalism introduced by Karp, Miller and Winograd in 1967. The basic objects of Alpha are variables indexed on integral points of a convex set. It is a functional/equational language, whose definition is particularly wellsuited to expressing regular algorithms, as well as transformations of these algorithms from their initial mathematical specification to an implementation on a synchronous parallel architecture. In particular, Alpha makes it easy to define, prove and implement basic transformations such as Leiserson and Saxe's retiming, spacetime reindexing, localization, and partitioning. We describe Alpha, its use for expressing and derivating systolic arrays, and the development environment Alpha du Centaur for this language. This work is funded by the French Coordinated Research Program C 3 and by the ESPRIT BRA project NANA. 1 1. INTRODUCT...
A ProcessorTimeMinimal Systolic Array for Cubical Mesh Algorithms
"... Using a directed acyclic graph (dag) model of algorithms, the paper focuses on timeminimal multiprocessor schedules that use as few processors as possible. Such a processortimeminimal scheduling of an algorithm’s dag first is illustrated using a triangular shaped 2D directed mesh (representing, f ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
Using a directed acyclic graph (dag) model of algorithms, the paper focuses on timeminimal multiprocessor schedules that use as few processors as possible. Such a processortimeminimal scheduling of an algorithm’s dag first is illustrated using a triangular shaped 2D directed mesh (representing, for example, an algorithm for solving a triangular system of linear equations). Then, algorithms represented by an n × n × n directed mesh are investigated. This cubical directed mesh is fundamental; it represents the standard algorithm for computing matrix product as well as many other algorithms. Completion of the cubical mesh requires 3n − 2 steps. It is shown that the number of processing elements needed to achieve this time bound is at least ⌈3n 2 /4⌉. A systolic array for the cubical directed mesh is then presented. It completes the mesh using the minimum number of steps and exactly ⌈3n 2 /4 ⌉ processing elements: it is processortimeminimal. The systolic array’s topology is that of a hexagonally shaped, cylindrically connected 2D directed mesh.
Converting affine recurrence equations to quasiuniform recurrence equations
 In AWOC 1988: Third International Workshop on Parallel Computation and VLSI Theory
, 1988
"... Most work on the problem of synthesizing a systolic array from a system of recurrence equations is restricted to systems of uniform recurrence equations. Recently, researchers have begun to relax this restriction to include systems of affine recurrence equations. A system of uniform recurrence equat ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Most work on the problem of synthesizing a systolic array from a system of recurrence equations is restricted to systems of uniform recurrence equations. Recently, researchers have begun to relax this restriction to include systems of affine recurrence equations. A system of uniform recurrence equations typically can be embedded in spacetime so that the distance between a variable and a dependent variable does not depend on the problem size. Systems of affine recurrence equations that are not uniform do not enjoy this property. A method is presented for converting a system of affine recurrence equations to an equivalent system of recurrence equations that is uniform, except for points near the boundaries of its index sets. Necessary and sufficient conditions are given for an affine system to be amenable to such a conversion, along with an algorithm that checks for these conditions, and a procedure that converts those affine systems which can be converted. The characterization of convertible systems brings together classical ideas in algebraic geometry, number theory, and matrix representations of groups. While the proof of this characterization is complex, the characterization itself is simple, suggesting that the mathematical ideas are well chosen for this difficult problem in array design.
Optimal Synthesis of AlgorithmSpecific LowerDimensional Processor Arrays
 IEEE Trans. Parallel and Distributed Systems
, 1993
"... Processor arrays are frequently used to deliver high performance in many applications with computationally intensive operations. This paper presents the General Parameter Method (GP M), a systematic parameterbased approach for synthesizing such algorithmspecific architectures. GPM can synthesize ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Processor arrays are frequently used to deliver high performance in many applications with computationally intensive operations. This paper presents the General Parameter Method (GP M), a systematic parameterbased approach for synthesizing such algorithmspecific architectures. GPM can synthesize processor arrays of any lower dimension from a uniformrecurrence description of the algorithm. The design objective is a general nonlinear and nonmonotonic userspecified function, and depends on attributes such as computation time of the recurrence on the processor array, completion time, load time, and drain time. In addition, bounds on some or all of these attributes can be specified. GPM performs an efficient search of polynomial complexity to find the optimal design satisfying the userspecified design constraints. As an illustration, we show how GPM can be used to find optimal linear processor arrays for computing transitive closures. We consider design objectives that minimize co...