Results 1  10
of
23
Abstractions for Recursive Pointer Data Structures: Improving the Analysis and Transformation of Imperative Programs
, 1992
"... Even though impressive progress has been made... ..."
A Practical Data Flow Framework for Array Reference Analysis and its Use in Optimizations
 In ACM SIGPLAN'93 Conf. on Prog. Lang. Design and Implementation
, 1993
"... Data flow analysis techniques have traditionally been restricted to the analysis of scalar variables. This restriction, however, imposes a limitation on the kinds of optimizations that can be performed in loops containing array references. We present a data flow framework for array reference analysi ..."
Abstract

Cited by 57 (2 self)
 Add to MetaCart
Data flow analysis techniques have traditionally been restricted to the analysis of scalar variables. This restriction, however, imposes a limitation on the kinds of optimizations that can be performed in loops containing array references. We present a data flow framework for array reference analysis that provides the information needed in various optimizations targeted at sequential or finegrained parallel architectures. The framework extends the traditional scalar framework by incorporating iteration distance values into the analysis to qualify the computed data flow solution during the fixed point iteration. Analyses phrased in this framework are capable of discovering recurrent access patterns among array references that evolve during the execution of a loop. The framework is practical in that the fixed point solution requires at most three passes over the body of structured loops. Applications of our framework are discussed for register allocation, load/store optimizations, and controlled loop unrolling.
A Framework for ResourceConstrained RateOptimal Software Pipelining
 IEEE Transactions on Parallel and Distributed Systems
, 1996
"... The rapid advances in highperformance computer architecture and compilation techniques provide both challenges and opportunities to exploit the rich solution space of software pipelined loop schedules. In this paper, we develop a framework to construct a software pipelined loop schedule which runs ..."
Abstract

Cited by 24 (9 self)
 Add to MetaCart
The rapid advances in highperformance computer architecture and compilation techniques provide both challenges and opportunities to exploit the rich solution space of software pipelined loop schedules. In this paper, we develop a framework to construct a software pipelined loop schedule which runs on the given architecture (with a fixed number of processor resources) at the maximum possible iteration rate (`a la rateoptimal) while minimizing the number of buffers  a close approximation to minimizing the number of registers. The main contributions of this paper are: ffl First, we demonstrate that such problem can be described by a simple mathematical formulation with precise optimization objectives under a periodic linear scheduling framework. The mathematical formulation provides a clear picture which permits one to visualize the overall solution space (for rateoptimal schedules) under different sets of constraints. ffl Secondly, we show that a precise mathematical formulation...
Designing Programming Languages for Analyzability: A Fresh Look at Pointer Data Structures
 In Proceedings of the IEEE 1992 International Conference on Programming Languages
, 1992
"... In this paper we propose a programming language mechanism and associated compiler techniques which significantly enhance the analyzability of pointerbased data structures frequently used in nonscientific programs. Our approach is based on exploiting two important properties of pointer data structur ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
In this paper we propose a programming language mechanism and associated compiler techniques which significantly enhance the analyzability of pointerbased data structures frequently used in nonscientific programs. Our approach is based on exploiting two important properties of pointer data structures: structural inductivity and speculative traversability. Structural inductivity facilitates the application of a static interference analysis method for such pointer data structures based on path matrices, and speculative traversability is utilized by a novel loop unrolling technique for while loops that exploit finegrain parallelism by speculatively traversing such data structures. The effectiveness of this approach is demonstrated by applying it to a collection of loops found in typical nonscientific C programs. 1 Introduction In the past decade, the dramatic improvement of VLSI technology has led to modern highperformance microprocessors that support some level of finegrain paralle...
ResourceBounded Partial Evaluation
 In Proceedings of PEPM’97, the ACM SIGPLAN Symposium on Partial Evaluation and SemanticsBased Program Manipulation
"... Most partial evaluators do not take the availability of machinelevel resources, such as registers or cache, into consideration when making their specialization decisions. The resulting resource contention can lead to severe performance degradationcausing, in extreme cases, the specialized code ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Most partial evaluators do not take the availability of machinelevel resources, such as registers or cache, into consideration when making their specialization decisions. The resulting resource contention can lead to severe performance degradationcausing, in extreme cases, the specialized code to run slower than the unspecialized code. In this paper we consider how resource considerations can be incorporated within a partial evaluator. We develop an abstract formulation of the problem, show that optimal resourcebounded partial evaluation is NPcomplete, and discuss simple heuristics that can be used to address the problem in practice. 1 Introduction The field of partial evaluation has matured greatly in recent years, and partial evaluators have been implemented for a wide variety of programming languages [1, 4, 5, 6, 20, 29]. A central concern guiding these implementations has been to ensure that input programs should be specialized as far as possible without compromising ...
Iterative Compilation and Performance Prediction for Numerical Applications
, 2004
"... As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compilerbased approaches to improving memory performance are limited and programmers fr ..."
Abstract

Cited by 13 (10 self)
 Add to MetaCart
As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compilerbased approaches to improving memory performance are limited and programmers frequently resort to manual optimisation techniques. However, this process is tedious and timeconsuming. Furthermore, a diverse range of a rapidly evolving hardware makes the optimisation process even more complex. It is often hard to predict the potential benefits from different optimisations and there are no simple criteria to stop optimisations i.e. when optimal memory performance has been achieved or sufficiently approached. This thesis presents a platform independent optimisation approach for numerical applications based on iterative feedbackdirected program restructuring using a new reasonably fast and accurate performance prediction technique for guiding optimisations. New strategies for searching the optimisation space, by means of
Efficient algorithms for computing the condition number of a tridiagonal matrix
 SIAM J. Sci. Statist. Comput
, 1986
"... Abstract. Let A be a tridiagonal matrix of order n. We show that it is possible to compute and hence condo (A), in O(n) operations. Several algorithms which perform this task are given and their numerical properties are investigated. IfA is also positive definite then I[A[[o can be computed as the ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
Abstract. Let A be a tridiagonal matrix of order n. We show that it is possible to compute and hence condo (A), in O(n) operations. Several algorithms which perform this task are given and their numerical properties are investigated. IfA is also positive definite then I[A[[o can be computed as the norm of the solution to a positive definite tridiagonal linear system whose coeffcient matrix is closely related to A. We show how this computation can be carried out in parallel with the solution of a linear system Ax b. In particular we describe some simple modifications to the LINPACK routine SPTSL which enable this routine to compute condt (A), efficiently, in addition to solving Ax b. Key words, matrix condition number, tridiagonal matrix, positive definite matrix, LINPACK
An Aggressive Approach to Loop Unrolling
 Proc. Compiler Construction '96
, 1995
"... A wellknown code transformation for improving the execution performance of a program is loop unrolling. The most obvious benefit of unrolling a loop is that the transformed loop usually, but not always, requires fewer instruction executions than the original loop. The reduction in instruction execu ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
A wellknown code transformation for improving the execution performance of a program is loop unrolling. The most obvious benefit of unrolling a loop is that the transformed loop usually, but not always, requires fewer instruction executions than the original loop. The reduction in instruction executions comes from two sources: the number of branch instructions executed is reduced, and the index variable is modified fewer times. In addition, for architectures with features designed to exploit instructionlevel parallelism, loop unrolling can expose greater levels of instructionlevel parallelism. Loop unrolling is an effective code transformation often improving the execution performance of programs that spend much of their execution time in loops by ten to thirty percent. Possibly because of the effectiveness of a simple application of loop unrolling, it has not been studied as extensively as other code improvements such as register allocation or common subexpression elimination. The r...
Increasing Memory Bandwidth with Wide Buses: Compiler, . . .
, 1997
"... Memory latency and lack of bandwidth are the main barriers to achieve high performance from current and future processors, specially in numeric applications. New organizations of the memory subsystem as well as hardware and software mechanisms to effectively exploit them are required. The paper pres ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Memory latency and lack of bandwidth are the main barriers to achieve high performance from current and future processors, specially in numeric applications. New organizations of the memory subsystem as well as hardware and software mechanisms to effectively exploit them are required. The paper presents a new compilation technique to pack several load/stores (that access consecutive memory locations) into a single wide load/store, so that the number of wide load/stores is maximized. It also evaluates the performance tradeoffs of wide buses and the additional register pressure, showing that it is minimal and has negligible effects. Finally, the paper proposes a hardware mechanism to detect and group memory accesses into wide accesses at run time, so that binary compatibility is preserved. The evaluations are performed using 1180 loops that represent about 78% of the execution time of the Perfect Club. The results reveal that using wide buses is a costeffective solution to improve the ...
Efficient Register Allocation via Coloring Using Clique Separators
"... Although graph coloring is widely recognized as an effective technique for register allocation, memory demands can become quite high for large interference graphs that are needed in coloring. In this paper we present an algorithm that uses the notion of chque separators to improve the space overhead ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Although graph coloring is widely recognized as an effective technique for register allocation, memory demands can become quite high for large interference graphs that are needed in coloring. In this paper we present an algorithm that uses the notion of chque separators to improve the space overhead of coloring, The algorithm, based on a result by R. TarJ an regarding the colorability of graphs, partitions program code into code segments using clique separators. The interference graphs for the code partitions are constructed one at a time and colored independently. The colorings for the partitions are combined to obtain a register allocation for the entire program. Thm approach can be used to perform register allocation in a spaceefficient manner. For straightline code (e.g., local register allocation), an optimal allocation can be obtained from optimal allocations for individual code partitions. Experimental results are presented demonstrating memory demand reductions for interference graphs when allocating registers using clique separators. Categories and Subject Descriptors: C.O [Computer Systems Organization]: General— hard.wzre / softumre Interfaces; D.3.4 [Programming Languages]: Processors—code generation; compilers; optimization