Results 1  10
of
18
Improving Register Allocation for Subscripted Variables
, 1990
"... INTRODUCTION By the late 1980s, memory system performance and CPU performance had already begun to diverge. This trend made effective use of the register file imperative for excellent performance. Although most compilers at that time allocated scalar variables to registers using graph coloring with ..."
Abstract

Cited by 200 (34 self)
 Add to MetaCart
INTRODUCTION By the late 1980s, memory system performance and CPU performance had already begun to diverge. This trend made effective use of the register file imperative for excellent performance. Although most compilers at that time allocated scalar variables to registers using graph coloring with marked success [12, 13, 14, 6], allocation of array values to registers only occurred in rare circumstances because standard dataflow analysis techniques could not uncover the available reuse of array memory locations. This deficiency was especially problematic for scientific codes since a majority of the computation involves array references. Our original paper addressed this problem by presenting an algorithm and experiment for a loop transformation, called scalar replacement, that exposed the reuse available in array references in an innermost loop. It also demonstrated experimentally how another loop transformation, called unrollandjam [2], could expose more opportunities for scalarâ€¦
Scanning Polyhedra with DO Loops
, 1991
"... Supercompilers perform complex program transformations which often result in new loop bounds. This paper shows that, under the usual assumptions in automatic parallelization, most transformations on loop nests can be expressed as affine transformations on integer sets de ned by polyhedra and that th ..."
Abstract

Cited by 195 (5 self)
 Add to MetaCart
Supercompilers perform complex program transformations which often result in new loop bounds. This paper shows that, under the usual assumptions in automatic parallelization, most transformations on loop nests can be expressed as affine transformations on integer sets de ned by polyhedra and that the new loop bounds can be computed with algorithms using Fourier's pairwise elimination method although it is not exact for integer sets. Sufficient conditions to use pairwise elimination on integer sets and to extend it to pseudolinear constraints are also given. A tradeo has to be made between dynamic overhead due to some bound slackness and compilation complexity but the resulting code is always correct. These algorithms can be used to interchange or block loops regardless of the loop bounds or the blocking strategy and to safely exchange array parts between two levels of a memory hierarchy or between neighboring processors in a distributed memory machine.
Improving the Ratio of Memory Operations to FloatingPoint Operations in Loops
 ACM Transactions on Programming Languages and Systems
, 1994
"... this paper we attempt to answer that question. To do so, we develop and evaluate techniques that automatically restructure program loops to achieve high performance on specific target architectures. These methods attempt to balance computation and memory accesses and seek to eliminate or reduce pipe ..."
Abstract

Cited by 97 (18 self)
 Add to MetaCart
this paper we attempt to answer that question. To do so, we develop and evaluate techniques that automatically restructure program loops to achieve high performance on specific target architectures. These methods attempt to balance computation and memory accesses and seek to eliminate or reduce pipeline interlock. To do this, they statically estimate the balance between memory operations and floatingpoint operations for each loop in a particular program and use these estimates to determine whether to apply various loop transformations. Experiments with our automatic techniques show that integerfactor speedups are possible on kernels. Additionally, the estimate of the balance between memory operations and computation, and the application of the estimate are very accurateexperiments reveal little difference between the balance achieved by our automatic system and that possible by hand optimization. Categories and Subject Descriptors: D.3.4 [Programming Languages]: ProcessorsCompilers ;
Perfect Pipelining: A New Loop Parallelization Technique
, 1988
"... Parallelizing compilers do not handle loops in a satisfactory manner. Finegrain transformations capture irregular parallelism inside a loop body not amenable to coarser approaches but have limited ability to exploit parallelism across iterations. Coarse methods sacrifice irregular forms of parall ..."
Abstract

Cited by 60 (9 self)
 Add to MetaCart
Parallelizing compilers do not handle loops in a satisfactory manner. Finegrain transformations capture irregular parallelism inside a loop body not amenable to coarser approaches but have limited ability to exploit parallelism across iterations. Coarse methods sacrifice irregular forms of parallelism in favor of pipelining (overlapping) iterations. In this paper we present a new transformation, Perfect Pipelining, that bridges the gap between these fine and coarsegrain transformations while retaining the desirable features of both. This is accomplished even in the presence of conditional branches and resource constraints. To make our claims rigorous, we develop a formalism for parallelization. The formalism can also be used to compare transformations across computational models. As an illustration, we show that Doacross, a transformation intended for synchronous and asynchronous multiprocessors, can be expressed as a restriction of Perfect Pipelining. 1 Introduction A si...
Scheduling And Behavioral Transformations For Parallel Systems
, 1993
"... In a parallel system, either a VLSI architecture in hardware or a parallel program in software, the quality of the final design depends on the ability of a synthesis system to exploit the parallelism hidden in the input description of applications. Since iterative or recursive algorithms are usually ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
In a parallel system, either a VLSI architecture in hardware or a parallel program in software, the quality of the final design depends on the ability of a synthesis system to exploit the parallelism hidden in the input description of applications. Since iterative or recursive algorithms are usually the most timecritical parts of an application, the parallelism embedded in the repetitive pattern of an iterative algorithm needs to be explored. This thesis studies techniques and algorithms to expose the parallelism in an iterative algorithm so that the designer can find an implementation achieving a desired execution rate. In particular, the objective is to find an efficient schedule to be executed iteratively. A form of dataflow graphs is used to model the iterative part of an application, e.g. a digital signal filter or the while/for loop of a program. Nodes in the graph represent operations to be performed and edges represent both intraiteration and interiteration precedence relat...
Scheduling timecritical instructions on risc machines
 ACM Transactions on Programming Languages and Systems
, 1993
"... We present a polynomial time algorithm for constructing a minimum completion time schedule of instructions from a basic block on RISC machines such as the Sun SPARC, the IBM 801, the Berkeley RISC machine, and the HP Precision Architecture. Our algorithm can be used as a heuristic for RISC processor ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
We present a polynomial time algorithm for constructing a minimum completion time schedule of instructions from a basic block on RISC machines such as the Sun SPARC, the IBM 801, the Berkeley RISC machine, and the HP Precision Architecture. Our algorithm can be used as a heuristic for RISC processors with longer pipelines, for which there is no known optimal algorithm. Our algorithm can also handle timecritical instructions, which are instructions that have to be completed by a specific time. Timecritical instructions occur in some realtime computations, and can also be used to make shared resources such as registers quickly available for reuse. We also prove that in the absence of timecritical constraints, a greedy scheduling algorithm always produces a schedule for a target machine with multiple identical pipelines that has a length less than twice that of an optimal schedule. The behavior of the heuristic is of interest because, as we show, the instruction scheduling problem becomes NPhard for arbitrary length pipelines, even when the basic block of code being input consists of only several independent streams of straightline code, and there are no timecritical constraints, Finally, we prove that the problem becomes NPhard even for small pipelines, no timecritical constraints, and input of several independent streams of straightline code if either there is only a single register or if no two instructions are allowed to complete simultaneously because of some shared resource such as a bus
UnrollandJam Using Uniformly Generated Sets
 In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO
, 1997
"... Modern architectural trends in instructionlevel parallelism (ILP) are to increase the computational power of microprocessors significantly. As a result, the demands on memory have increased. Unfortunately, memory systems have not kept pace. Even hierarchical cache structures are ineffective if prog ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
Modern architectural trends in instructionlevel parallelism (ILP) are to increase the computational power of microprocessors significantly. As a result, the demands on memory have increased. Unfortunately, memory systems have not kept pace. Even hierarchical cache structures are ineffective if programs do not exhibit cache locality. Because of this compilers need to be concerned not only with finding ILP to utilize machine resources effectively, but also with ensuring that the resulting code has a high degree of cache locality. One compiler transformation that is essential for a compiler to meet the above objectives is unrollandjam, or outerloop unrolling. Previous work either has used a dependencebased model [7] to compute unroll amounts, significantly increasing the size of the dependence graph, or has applied a more brute force technique [16]. In this paper, we present an algorithm that uses a linearalgebrabased technique to compute unroll amounts. This technique results in an...
ScheduleBased MultiDimensional Retiming on Data Flow Graphs
 Proceedings of 8th International Parallel Processing Symposium
, 1994
"... Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable tool in onedimensional problems, represented by Data Flow Graphs (DFGs) such as DSP filters, which can maximize the parallelism of a loop ..."
Abstract

Cited by 18 (15 self)
 Add to MetaCart
Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable tool in onedimensional problems, represented by Data Flow Graphs (DFGs) such as DSP filters, which can maximize the parallelism of a loop body represented by a DFG. Since most scientific or DSP applications are recursive or iterative, to increase the parallelism of the loop body can substantially decrease the overall computation time. Few results on retiming have been obtained for multidimensional problems. The previous result of multidimensional retiming is only applied to a restricted class of Data Flow Graphs in which every total delay vector in a cycle has to be strictly nonnegative. This paper develops a novel retiming technique that considers the final schedule as part of the process. To authors' knowledge, this is the first retiming algorithm for general multidimensional Data Flow Graphs. The description and the correctness of our algorithm are presented in the paper. Through the experiments, results have shown that our algorithm runs efficiently. Some DSP filters are used in the paper as an example of the application of our algorithm. 1
Full Parallelism In Uniform Nested Loops Using MultiDimensional Retiming
, 1994
"... Most scientific and DSP applications are recursive or iterative. Uniform nested loops can be modeled as multidimensional data flow graphs (DFGs). To achieve full parallelism of the loop body, i.e., all the computational nodes executed in parallel, substantially decreases the overall computation ti ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
Most scientific and DSP applications are recursive or iterative. Uniform nested loops can be modeled as multidimensional data flow graphs (DFGs). To achieve full parallelism of the loop body, i.e., all the computational nodes executed in parallel, substantially decreases the overall computation time. It is well known that for onedimensional DFGs retiming can not always achieve full parallelism. This paper shows an important and counterintuitive result, which proves that we can always obtain fullparallelism for DFGs with more than one dimension. It also presents two novel multidimensional retiming techniques to obtain full parallelism.
TimeC: A Time Constraint Language for ILP Processor Compilation
, 1998
"... . Enabled by RISC technologies, lowcost commodity microprocessors are performing at ever increasing levels, significantly via instruction level parallelism (ILP). This in turn increases the opportunities for their use in a variety of daytoday applications ranging from the simple control of applia ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
. Enabled by RISC technologies, lowcost commodity microprocessors are performing at ever increasing levels, significantly via instruction level parallelism (ILP). This in turn increases the opportunities for their use in a variety of daytoday applications ranging from the simple control of appliances such as microwave ovens, to sophisticated systems for cabin control in modern aircraft. Indeed, "embedded" applications such as these represent segments in the computer industry with great potential for growth. However, this growth is currently impeded by the lack of robust optimizing compiler technologies that support the assured, rapid and inexpensive prototyping of realtime software in the context of microprocessors with ILP. In this paper we describe a novel notation, TimeC, for specifying timing constraints in programs, independent of the base language being used to develop the embedded application; TimeC specifications are language independent and can be instrumented into imperat...