Results 1 - 10
of
33
Improving Register Allocation for Subscripted Variables
, 1990
"... INTRODUCTION By the late 1980s, memory system performance and CPU performance had already begun to diverge. This trend made effective use of the register file imperative for excellent performance. Although most compilers at that time allocated scalar variables to registers using graph coloring with ..."
Abstract
-
Cited by 192 (34 self)
- Add to MetaCart
INTRODUCTION By the late 1980s, memory system performance and CPU performance had already begun to diverge. This trend made effective use of the register file imperative for excellent performance. Although most compilers at that time allocated scalar variables to registers using graph coloring with marked success [12, 13, 14, 6], allocation of array values to registers only occurred in rare circumstances because standard data-flow analysis techniques could not uncover the available reuse of array memory locations. This deficiency was especially problematic for scientific codes since a majority of the computation involves array references. Our original paper addressed this problem by presenting an algorithm and experiment for a loop transformation, called scalar replacement, that exposed the reuse available in array references in an innermost loop. It also demonstrated experimentally how another loop transformation, called unroll-and-jam [2], could expose more opportunities for scalar…
Practical Dependence Testing
, 1991
"... Precise and efficient dependence tests are essential to the effectiveness of a parallelizing compiler. This paper proposes a dependence testing scheme based on classifying pairs of subscripted variable references. Exact yet fast dependence tests are presented for certain classes of array references, ..."
Abstract
-
Cited by 131 (16 self)
- Add to MetaCart
Precise and efficient dependence tests are essential to the effectiveness of a parallelizing compiler. This paper proposes a dependence testing scheme based on classifying pairs of subscripted variable references. Exact yet fast dependence tests are presented for certain classes of array references, as well as empirical results showing that these references dominate scientific Fortran codes. These dependence tests are being implemented at Rice University in both PFC, a parallelizing compiler, and ParaScope, a parallel programming environment.
ParaScope: a parallel programming environment
- PROCEEDINGS OF THE IEEE
, 1993
"... The ParaScope parallel programming environment developed to support scientific programming of shared-memory multiprocessors, includes a collection of tools that use global program analysis to help users develop and debug parallel programs. This paper focuses on ParaScope’s compilation system, its pa ..."
Abstract
-
Cited by 120 (33 self)
- Add to MetaCart
The ParaScope parallel programming environment developed to support scientific programming of shared-memory multiprocessors, includes a collection of tools that use global program analysis to help users develop and debug parallel programs. This paper focuses on ParaScope’s compilation system, its parallel program editor, and its parallel debugging system. The compilation system extends the traditional single-procedure compiler by providing a mechanism for managing the compilation of complete programs. Thus, ParaScope can support both traditional single-procedure optimization and optimization across procedure boundaries. The ParaScope editor brings both compiler analysis and user expertise to bear on program parallelization. It assists the knowledgeable user by displaying and managing analysis and by proiiding a variety of interactive program tran.formation.s that are effective in exposing parallelism. The debugging svstem detects and reports timing-dependent errors, called data races, in execution of parallel programs. The system combines static analysis. program instrumentation. and run-time reporting to provide a mechanical system for isolating errors in parallel program executions. Finally, we describe a new project to extend ParaScope to support programming in Fortran D, a machine-independent parallel pro-gramming language intended for use with both distributed-memory and shared-memory parallel computers..
Interactive Parallel Programming Using the ParaScope Editor
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1991
"... The ParaScope project is developing an integrated collection of tools to help scientific programmers implement correct and efficient parallel programs. The centerpiece of this collection is the ParaScope Editor, an intelligent interactive editor for parallel Fortran programs. The ParaScope Editor re ..."
Abstract
-
Cited by 61 (12 self)
- Add to MetaCart
The ParaScope project is developing an integrated collection of tools to help scientific programmers implement correct and efficient parallel programs. The centerpiece of this collection is the ParaScope Editor, an intelligent interactive editor for parallel Fortran programs. The ParaScope Editor reveals to users potential hazards of a proposed parallelization in a program. It also provides a variety of powerful interactive program transformations that have been shown useful in converting programs to parallel form. In addition, the ParaScope Editor supports general user editing through a hybrid text and structure editing facility that incrementally analyzes the modified program for potential hazards. The ParaScope Editor is a new kind of program construction tool -- one that not only manages text, but also presents the user with information about the correctness of the parallel program under development. As such, it can support an exploratory programming style in which users get immediate feedback on their various strategies for parallelization.
Memory-Hierarchy Management
, 1994
"... The trend in high-performance microprocessor design is toward increasing computational power on the chip. Microprocessors can now process dramatically more data per machine cycle than previous models. Unfortunately, memory speeds have not kept pace. The result is an imbalance between computation spe ..."
Abstract
-
Cited by 50 (14 self)
- Add to MetaCart
The trend in high-performance microprocessor design is toward increasing computational power on the chip. Microprocessors can now process dramatically more data per machine cycle than previous models. Unfortunately, memory speeds have not kept pace. The result is an imbalance between computation speed and memory speed. This imbalance is leading machine designers to use more complicated memory hierarchies. In turn, programmers are explicitly restructuring codes to perform well on particular memory systems, leading to machine-specific programs. It is our belief that machine-specific programming is a step in the wrong direction. Compilers, not programmers, should handle machine-specific implementation details. To this end, this thesis develops and experiments with compiler algorithms that manage the memory hierarchy of a machine for floating-point intensive numerical codes. Specifically, we address the following issues: Scalar replacement. Lack of information concerning the flow of arra...
Interprocedural Transformations for Parallel Code Generation
- IN PROCEEDINGS OF SUPERCOMPUTING '91
, 1991
"... We present a new approach that enables compiler optimization of procedure calls and loop nests containing procedure calls. We introduce two interprocedural transformations that move loops across procedure boundaries, exposing them to traditional optimizations on loop nests. These transformations are ..."
Abstract
-
Cited by 41 (14 self)
- Add to MetaCart
We present a new approach that enables compiler optimization of procedure calls and loop nests containing procedure calls. We introduce two interprocedural transformations that move loops across procedure boundaries, exposing them to traditional optimizations on loop nests. These transformations are incorporated into a code generation algorithm for a shared-memory multiprocessor. The code generator relies on a machine model to estimate the expected benefits of loop parallelization and parallelism-enhancing transformations. Several transformation strategies are explored and one that minimizes total execution time is selected. Efficient support of this strategy is provided by an existing interprocedural compilation system. We demonstrate the potential of these techniques by applying this code generation strategy to two scientific applications programs.
Automatic and Interactive Parallelization
, 1994
"... The goal of this dissertation is to give programmers the ability to achieve high performance by focusing on developing parallel algorithms, rather than on architecturespecific details. The advantages of this approach also include program portability and legibility. To achieve high performance, we pr ..."
Abstract
-
Cited by 38 (8 self)
- Add to MetaCart
The goal of this dissertation is to give programmers the ability to achieve high performance by focusing on developing parallel algorithms, rather than on architecturespecific details. The advantages of this approach also include program portability and legibility. To achieve high performance, we provide automatic compilation techniques that tailor parallel algorithms to shared-memory multiprocessors with local caches and a common bus. In particular, the compiler maps complete applications onto the specifics of a machine, exploiting both parallelism and memory. To optimize complete applications, we develop novel, general algorithms to transform loops that contain arbitrary conditional control flow. In addition, we provide new interprocedural transformations which enable optimization across procedure boundaries. These techniques provide the basis for a robust automatic parallelizing algorithm that is applicable to complete programs. The algorithm for automatic parallel code generation t...
An Efficient Algorithm for the Run-time Parallelization of Doacross Loops
- In Proceedings of Supercomputing 1994
, 1994
"... While automatic parallelization of loops is generally based on compile-time analysis of data dependences, sometimes the data dependences can not be determined at compile time. This often occurs, for example, when arrays are accessed via subscripted subscripts. In these cases, it is necessary to use ..."
Abstract
-
Cited by 32 (4 self)
- Add to MetaCart
While automatic parallelization of loops is generally based on compile-time analysis of data dependences, sometimes the data dependences can not be determined at compile time. This often occurs, for example, when arrays are accessed via subscripted subscripts. In these cases, it is necessary to use run-time parallelization algorithms. In this paper, we present and evaluate a new run-time parallelization algorithm based on an inspector-executor pair. The scheme, called CYT, handles all types of data dependences in the loop without requiring any special architectural support. Furthermore, compared to an older scheme as general as the new algorithm, the latter speeds up execution by reducing the amount of interprocessor communication required. The new algorithm is applied to the Perfect Club codes and to parameterized loops, all running on the 32-CPU Cedar shared-memory multiprocessor. Although most loops with subscripted subscripts in the Perfect Club codes are highly parallel, their sma...
Scheduling And Behavioral Transformations For Parallel Systems
, 1993
"... In a parallel system, either a VLSI architecture in hardware or a parallel program in software, the quality of the final design depends on the ability of a synthesis system to exploit the parallelism hidden in the input description of applications. Since iterative or recursive algorithms are usually ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
In a parallel system, either a VLSI architecture in hardware or a parallel program in software, the quality of the final design depends on the ability of a synthesis system to exploit the parallelism hidden in the input description of applications. Since iterative or recursive algorithms are usually the most time-critical parts of an application, the parallelism embedded in the repetitive pattern of an iterative algorithm needs to be explored. This thesis studies techniques and algorithms to expose the parallelism in an iterative algorithm so that the designer can find an implementation achieving a desired execution rate. In particular, the objective is to find an efficient schedule to be executed iteratively. A form of data-flow graphs is used to model the iterative part of an application, e.g. a digital signal filter or the while/for loop of a program. Nodes in the graph represent operations to be performed and edges represent both intra-iteration and inter-iteration precedence relat...
Dependence Uniformization: A Loop Parallelization Technique
, 1991
"... In general, any nested loop can be parallelized as long as all dependence constraints among iterations are preserved by applying appropriate synchronizations. However, the performance is significantly affected by the synchronization overhead. In case of irregular and complex dependence constraints, ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
In general, any nested loop can be parallelized as long as all dependence constraints among iterations are preserved by applying appropriate synchronizations. However, the performance is significantly affected by the synchronization overhead. In case of irregular and complex dependence constraints, it is very difficult to efficiently and systematically arrange synchronization primitives. In this paper, we propose a new method, data dependence uniformization, to overcome the difficulties in parallelizing a doubly nested loop with irregular dependence constraints. Our approach is based on the concept of vector decomposition. A simple set of basic dependences is developed of which all dependence constraints can be composed. The set of basic dependences then will be added to every iteration to replace all original dependences so that the dependence constraints become uniform. Finally, an efficient synchronization method is presented to obey the uniform dependence constraints in every iter...

