Results 1 - 10
of
29
Iterative modulo scheduling: An algorithm for software pipelining loops
- In Proceedings of the 27th Annual International Symposium on Microarchitecture
, 1994
"... Modulo scheduling is a framework within which a wide variety of algorithms and heuristics may be defined for software pipelining innermost loops. This paper presents a practical algorithm, iterative modulo scheduling, that is capable of dealing with realistic machine models. This paper also characte ..."
Abstract
-
Cited by 263 (2 self)
- Add to MetaCart
Modulo scheduling is a framework within which a wide variety of algorithms and heuristics may be defined for software pipelining innermost loops. This paper presents a practical algorithm, iterative modulo scheduling, that is capable of dealing with realistic machine models. This paper also characterizes the algorithm in terms of the quality of the generated schedules as well the computational expense incurred.
Practical Dependence Testing
, 1991
"... Precise and efficient dependence tests are essential to the effectiveness of a parallelizing compiler. This paper proposes a dependence testing scheme based on classifying pairs of subscripted variable references. Exact yet fast dependence tests are presented for certain classes of array references, ..."
Abstract
-
Cited by 131 (16 self)
- Add to MetaCart
Precise and efficient dependence tests are essential to the effectiveness of a parallelizing compiler. This paper proposes a dependence testing scheme based on classifying pairs of subscripted variable references. Exact yet fast dependence tests are presented for certain classes of array references, as well as empirical results showing that these references dominate scientific Fortran codes. These dependence tests are being implemented at Rice University in both PFC, a parallelizing compiler, and ParaScope, a parallel programming environment.
Dynamic Dependency Analysis of Ordinary Programs
- In Proceedings of the 19th Annual International Symposium on Computer Architecture
, 1992
"... A quantitative analysis of program execution is essential to the computer architecture design process. With the current trend in architecture of enhancing the performance of uniprocessors by exploiting fine-grain parallelism, first-order metrics of program execution, such as operation frequencies, a ..."
Abstract
-
Cited by 83 (9 self)
- Add to MetaCart
A quantitative analysis of program execution is essential to the computer architecture design process. With the current trend in architecture of enhancing the performance of uniprocessors by exploiting fine-grain parallelism, first-order metrics of program execution, such as operation frequencies, are not sufficient; characterizing the exact nature of dependencies between operations is essential. This paper presents a methodology for constructing the dynamic execution graph that characterizes the execution of an ordinary program (an application program written in an imperative language such as C or FORTRAN) from a serial execution trace of the program. It then uses the methodology to study parallelism in the SPEC benchmarks. We see that the parallelism can be bursty in nature (periods of lots of parallelism followed by periods of little parallelism), but the average parallelism is quite high, ranging from 13 to 23,302 operations per cycle. Exposing this parallelism requires renaming of...
A quantitative analysis of loop nest locality
- In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems
, 1996
"... A Quantitative Analysis of Loop Nest Locality Abstract This paper analyzes and quantifies the locality characteristics of numerical loop nests in cache memories to suggest future directions for architecture and software cache optimizations. The following three observations motivated this study. 1. S ..."
Abstract
-
Cited by 66 (8 self)
- Add to MetaCart
A Quantitative Analysis of Loop Nest Locality Abstract This paper analyzes and quantifies the locality characteristics of numerical loop nests in cache memories to suggest future directions for architecture and software cache optimizations. The following three observations motivated this study. 1. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target loop nests. However, the locality characteristics that drive these optimizations are often collected across the entire application and not at the nest level. 2. Numerical codes have been studied for so long that a number of popular assertions have emerged on their locality characteristics. 3. Many optimizations are motivated by these assertions without further exploring their accuracy.
Supercomputer Performance Evaluation and the Perfect Benchmarks
- In Proceedings of the 1990 ACM International Conference on Supercomputing
, 1990
"... In the past three years, the Perfect Benchmark TM Suite has evolved from a supercomputer performance evaluation plan, presented by Kuck and Sameh at the 1987 International Conference on Supercomputing, to a vigorous international activity. This paper surveys the current state of supercomputer perf ..."
Abstract
-
Cited by 64 (0 self)
- Add to MetaCart
In the past three years, the Perfect Benchmark TM Suite has evolved from a supercomputer performance evaluation plan, presented by Kuck and Sameh at the 1987 International Conference on Supercomputing, to a vigorous international activity. This paper surveys the current state of supercomputer performance evaluation with particular focus on the methodology adopted by the Perfect effort. While there has been considerable success in achieving the goals of the plan, some issues remain unresolved, and new questions have surfaced. 1 Introduction During the four decades since the invention of the transistor, performance increases in computers have been attributable, in large part, to increases in hardware speed, averaging an order of magnitude every seven years. In recent years, the progress of hardware technology has begun to slow as certain fundamental limits (ie. the speed of light and the width of the atom) have been approached. In an effort to sustain increases in the peak speed of ne...
Stage Scheduling: A Technique to Reduce the Register Requirements of a Modulo Schedule
- IN PROC. OF THE 28TH ANNUAL INT. SYMP. ON MICROARCHITECTURE (MICRO-28
, 1995
"... Modulo scheduling is an efficient technique for exploiting instruction level parallelism in a variety of loops, resulting in high performance code but increased register requirements. We present a set of low computational complexity stage-scheduling heuristics that reduce the register requirements o ..."
Abstract
-
Cited by 57 (5 self)
- Add to MetaCart
Modulo scheduling is an efficient technique for exploiting instruction level parallelism in a variety of loops, resulting in high performance code but increased register requirements. We present a set of low computational complexity stage-scheduling heuristics that reduce the register requirements of a given modulo schedule by shifting operations by multiples of II cycles. Measurements on a benchmark suite of 1289 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels shows that our best heuristic achieves on average 99% of the decrease in register requirements obtained by an optimal stage scheduler.
SUIF Explorer: an interactive and interprocedural parallelizer
, 1999
"... The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus mini ..."
Abstract
-
Cited by 55 (5 self)
- Add to MetaCart
The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus minimizing the number of spurious dependences requiring attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. Third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. The system guides the programmer in the parallelization process using a set of sophisticated visualization techniques. This paper demonstrates the effectiveness of the SUIF Explorer with three case studies. The programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables. 1. Introduction Exploiting coarse-grain parallelism i...
Effective Cluster Assignment for Modulo Scheduling
- IN PROCEEDINGS OF THE 31 INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO-31
, 1998
"... Clustering is one solution to the demand for wideissue machines and fast clock cycles because it allows for smaller, less ported register files and simpler bypass logic while remaining scaleable. Much of the previous work on scheduling for clustered architectures has focused on acyclic code. While m ..."
Abstract
-
Cited by 53 (0 self)
- Add to MetaCart
Clustering is one solution to the demand for wideissue machines and fast clock cycles because it allows for smaller, less ported register files and simpler bypass logic while remaining scaleable. Much of the previous work on scheduling for clustered architectures has focused on acyclic code. While minimizing schedule length of acyclic code is paramount, the primary objective when scheduling cyclic code is to maximize the throughput or steady state performance. This paper investigates a pre-modulo scheduling pass that performs cluster assignment in a way that minimizes performance degradation do to explicit communication required as the loops are split over clusters. The proposed cluster assignment algorithm annotates and adjusts the graph for use by the scheduler so that any traditional modulo scheduling algorithm, having no knowledge of clustering, can produce a valid and efficient schedule for a clustered machine.
Interprocedural Symbolic Analysis
, 1994
"... Compiling for efficient execution on advanced computer architectures requires extensive program analysis and transformation. Most compilers limit their analysis to simple phenomena within single procedures, limiting effective optimization of modular codes and making the programmer's job harder. We p ..."
Abstract
-
Cited by 48 (1 self)
- Add to MetaCart
Compiling for efficient execution on advanced computer architectures requires extensive program analysis and transformation. Most compilers limit their analysis to simple phenomena within single procedures, limiting effective optimization of modular codes and making the programmer's job harder. We present methods for analyzing array side effects and for comparing nonconstant values computed in the same and different procedures. Regular sections, described by rectangular bounds and stride, prove as effective in describing array side effects in Linpack as more complicated summary techniques. On a set of six programs, regular section analysis of array side effects gives 0 to 39 percent reductions in array dependences at call sites, with 10 to 25 percent increases in analysis time. Symbolic analysis is essential to data dependence testing, array section analysis, and other high-level program manipulations. We give methods for building symb...
Public International Benchmarks for Parallel Computers
, 1994
"... this report: David Bailey (NASA Ames Research Center) , Michael Berry (University of Tennessee), Jack Dongarra (University of Tennessee/Oak Ridge National Laboratory), Vladimir Getov (University of Southampton), Tom Haupt (Syracuse University), Tony Hey (University of Southampton), Roger Hockney (Un ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
this report: David Bailey (NASA Ames Research Center) , Michael Berry (University of Tennessee), Jack Dongarra (University of Tennessee/Oak Ridge National Laboratory), Vladimir Getov (University of Southampton), Tom Haupt (Syracuse University), Tony Hey (University of Southampton), Roger Hockney (University of Southampton), and David Walker (Oak Ridge National Laboratory). The following PARKBENCH participants were instrumental in defining/promoting the effort, attending meetings, and providing helpful comments and suggestions: Ed Brocklehurst (National Physical Laboratory), Koushik Ghosh (Cray Research), Charles Grassl (Cray Research) , Ed Kushner (Intel SSD), Brian LaRose (Hewlett Packard), Todd Letsche (University of Tennessee), David Mackay (Intel SSD), Joanne Martin (IBM), Ramesh Natarajan (IBM, Yorktown Heights), Bodo Parady (Sun Microsystems), Robert Pennington (Pittsburgh Supercomputing Center), Philip Tannenbaum (NEC), Pearl Wang (George Mason University/US Geological Survey), and Patrick Worley (Oak Ridge National Laboratory). Special thanks are also due to Jack Dongarra in his role of host at our meetings in Knoxville, and to Mike Berry who has served valiantly as secretary at our meetings and produced excellent minutes in difficult circumstances. This publication, and the earlier report could not have been produced without the dedication of Roger Hockney, Mike Berry and Vladimir Getov who devoted many hours in turning a collection of individual contributions into a coherent L a T E X document that was fit for publication.

