Results 1  10
of
388
A data locality optimizing algorithm
, 1991
"... 1 Introduction As processor speed continues to increase faster than memory speed, optimizations to use the memory hierarchy efficiently become ever more important. Blocking [9] ortiling [18] is a wellknown technique that improves the data locality of numerical algorithms [1, 6, 7, 12, 13].Tiling c ..."
Abstract

Cited by 718 (18 self)
 Add to MetaCart
1 Introduction As processor speed continues to increase faster than memory speed, optimizations to use the memory hierarchy efficiently become ever more important. Blocking [9] ortiling [18] is a wellknown technique that improves the data locality of numerical algorithms [1, 6, 7, 12, 13].Tiling can be used for different levels of memory hierarchy such as physical memory, caches and registers; multileveltiling can be used to achieve locality in multiple levels of the memory hierarchy simultaneously.To illustrate the importance of tiling, consider the example of matrix multiplication: for I1: = 1 to nfor
Compiler Transformations for HighPerformance Computing
 ACM Computing Surveys
, 1994
"... In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimization for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and dataflow techniques. ..."
Abstract

Cited by 371 (4 self)
 Add to MetaCart
In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimization for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and dataflow techniques. In contrast, optimization for
Global Optimizations for Parallelism and Locality on Scalable Parallel Machines
 IN PROCEEDINGS OF THE SIGPLAN '93 CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION
, 1993
"... Data locality is critical to achieving high performance on largescale parallel machines. Nonlocal data accesses result in communication that can greatly impact performance. Thus the mapping, or decomposition, of the computation and data onto the processors of a scalable parallel machine is a key i ..."
Abstract

Cited by 243 (21 self)
 Add to MetaCart
Data locality is critical to achieving high performance on largescale parallel machines. Nonlocal data accesses result in communication that can greatly impact performance. Thus the mapping, or decomposition, of the computation and data onto the processors of a scalable parallel machine is a key issue in compiling programs for these architectures.
Some efficient solutions to the affine scheduling problem  Part I Onedimensional Time
, 1996
"... Programs and systems of recurrence equations may be represented as sets of actions which are to be executed subject to precedence constraints. In many cases, actions may be labelled by integral vectors in some iteration domain, and precedence constraints may be described by affine relations. A s ..."
Abstract

Cited by 219 (19 self)
 Add to MetaCart
Programs and systems of recurrence equations may be represented as sets of actions which are to be executed subject to precedence constraints. In many cases, actions may be labelled by integral vectors in some iteration domain, and precedence constraints may be described by affine relations. A schedule for such a program is a function which assigns an execution date to each action. Knowledge of such a schedule allows one to estimate the intrinsic degree of parallelism of the program and to compile a parallel version for multiprocessor architectures or systolic arrays. This paper deals with the problem of finding closed form schedules as affine or piecewise affine functions of the iteration vector. An efficient algorithm is presented which reduces the scheduling problem to a parametric linear program of small size, which can be readily solved by an efficient algorithm.
SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers
 ACM SIGPLAN Notices
, 1994
"... Compiler infrastructures that support experimental research are crucial to the advancement of highperformance computing. New compiler technology must be implemented and evaluated in the context of a complete compiler, but developing such an infrastructure requires a huge investment in time and reso ..."
Abstract

Cited by 212 (21 self)
 Add to MetaCart
Compiler infrastructures that support experimental research are crucial to the advancement of highperformance computing. New compiler technology must be implemented and evaluated in the context of a complete compiler, but developing such an infrastructure requires a huge investment in time and resources. We have spent a number of years building the SUIF compiler into a powerful, flexible system, and we would now like to share the results of our efforts. SUIF consists of a small, clearly documented kernel and a toolkit of compiler passes built on top of the kernel. The kernel defines the intermediate representation, provides functions to access and manipulate the intermediate representation, and structures the interface between compiler passes. The toolkit currently includes C and Fortran front ends, a looplevel parallelism and locality optimizer, an optimizing MIPS back end, a set of compiler development tools, and support for instructional use. Although we do not expect SUIF to be suitable for everyone, we think it may be useful for many other researchers. We thus invite you to use SUIF and welcome your contributions to this infrastructure. Directions for obtaining the SUIF software are included at the end of this paper. 1
Data and Computation Transformations for Multiprocessors
 In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimiza ..."
Abstract

Cited by 168 (14 self)
 Add to MetaCart
Effective memory hierarchy utilization is critical to the performance of modern multiprocessor architectures. We havedeveloped the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimization algorithm consists of two steps. The first step chooses the parallelization and computation assignment such that synchronization and data sharing are minimized. The second step then restructures the layout of the data in the shared address space with an algorithm that is based on a new data transformation framework. We ran our compiler on a set of application programs and measured their performance on the Stanford DASH multiprocessor. Our results show that the compiler can effectively optimize parallelism in conjunction with memory subsystem performance. 1 Introduction In the last decade, microprocessor speeds have been steadily improving at a rate of 50% to 100% every year[16]. Meanwh...
Maximizing Parallelism and Minimizing Synchronization with Affine Transforms
 Parallel Computing
, 1997
"... This paper presents the first algorithm to find the optimal affine transform that maximizes the degree of parallelism while minimizing the degree of synchronization in a program with arbitrary loop nestings and affine data accesses. The problem is formulated without the use of imprecise data depende ..."
Abstract

Cited by 130 (7 self)
 Add to MetaCart
This paper presents the first algorithm to find the optimal affine transform that maximizes the degree of parallelism while minimizing the degree of synchronization in a program with arbitrary loop nestings and affine data accesses. The problem is formulated without the use of imprecise data dependence abstractions such as data dependence vectors. The algorithm presented subsumes previously proposed program transformation algorithms that are based on unimodular transformations, loop fusion, fission, scaling, reindexing and/or statement reordering. 1 Introduction As multiprocessors become popular, it is important to develop compilers that can automatically translate sequential programs into efficient parallel code. Getting high performance on a multiprocessor requires not only finding parallelism in the program but also minimizing the synchronization overhead. Synchronization is expensive on a multiprocessor. The cost of synchronization goes far beyond just the operations that manipul...
A Singular Loop Transformation Framework Based on Nonsingular Matrices
, 1992
"... In this paper, we discuss a loop transformation framework that is based on integer nonsingular matrices. The transformations included in this framework are called transformations and include permutation, skewing and reversal, as well as a transformation called loop scaling. This framework is mo ..."
Abstract

Cited by 124 (8 self)
 Add to MetaCart
In this paper, we discuss a loop transformation framework that is based on integer nonsingular matrices. The transformations included in this framework are called transformations and include permutation, skewing and reversal, as well as a transformation called loop scaling. This framework is more general than existing ones; however, it is also more difficult to generate code in our framework. This paper shows how integer lattice theory can be used to generate efficient code. An added advantage of our framework over existing ones is that there is a simple completion algorithm which, given a partial transformation matrix, produces a full transformation matrix that satisfies all dependences. This completion procedure has applications in parallelization and in the generation of code for NUMA machines.
Loop Parallelization in the Polytope Model
 CONCUR '93, Lecture Notes in Computer Science 715
, 1993
"... . During the course of the last decade, a mathematical model for the parallelization of FORloops has become increasingly popular. In this model, a (perfect) nest of r FORloops is represented by a convex polytope in Z r . The boundaries of each loop specify the extent of the polytope in a dis ..."
Abstract

Cited by 97 (24 self)
 Add to MetaCart
. During the course of the last decade, a mathematical model for the parallelization of FORloops has become increasingly popular. In this model, a (perfect) nest of r FORloops is represented by a convex polytope in Z r . The boundaries of each loop specify the extent of the polytope in a distinct dimension. Various ways of slicing and segmenting the polytope yield a multitude of guaranteed correct mappings of the loops' operations in spacetime. These transformations have a very intuitive interpretation and can be easily quantified and automated due to their mathematical foundation in linear programming and linear algebra. With the recent availability of massively parallel computers, the idea of loop parallelization is gaining significance, since it promises execution speedups of orders of magnitude. The polytope model for loop parallelization has its origin in systolic design, but it applies in more general settings and methods based on it will become a part of futur...
Jade: A HighLevel, MachineIndependent Language for Parallel Programming
 IEEE Computer
, 1993
"... this memory is called a shared object. Pointers to shared objects are identified in a Jade program using the shared type qualifier. For example: ..."
Abstract

Cited by 89 (9 self)
 Add to MetaCart
this memory is called a shared object. Pointers to shared objects are identified in a Jade program using the shared type qualifier. For example: