Results 1 - 10
of
17
Improving Register Allocation for Subscripted Variables
, 1990
"... INTRODUCTION By the late 1980s, memory system performance and CPU performance had already begun to diverge. This trend made effective use of the register file imperative for excellent performance. Although most compilers at that time allocated scalar variables to registers using graph coloring with ..."
Abstract
-
Cited by 192 (34 self)
- Add to MetaCart
INTRODUCTION By the late 1980s, memory system performance and CPU performance had already begun to diverge. This trend made effective use of the register file imperative for excellent performance. Although most compilers at that time allocated scalar variables to registers using graph coloring with marked success [12, 13, 14, 6], allocation of array values to registers only occurred in rare circumstances because standard data-flow analysis techniques could not uncover the available reuse of array memory locations. This deficiency was especially problematic for scientific codes since a majority of the computation involves array references. Our original paper addressed this problem by presenting an algorithm and experiment for a loop transformation, called scalar replacement, that exposed the reuse available in array references in an innermost loop. It also demonstrated experimentally how another loop transformation, called unroll-and-jam [2], could expose more opportunities for scalar…
Scanning Polyhedra with DO Loops
, 1991
"... Supercompilers perform complex program transformations which often result in new loop bounds. This paper shows that, under the usual assumptions in automatic parallelization, most transformations on loop nests can be expressed as affine transformations on integer sets de ned by polyhedra and that th ..."
Abstract
-
Cited by 182 (4 self)
- Add to MetaCart
Supercompilers perform complex program transformations which often result in new loop bounds. This paper shows that, under the usual assumptions in automatic parallelization, most transformations on loop nests can be expressed as affine transformations on integer sets de ned by polyhedra and that the new loop bounds can be computed with algorithms using Fourier's pairwise elimination method although it is not exact for integer sets. Sufficient conditions to use pairwise elimination on integer sets and to extend it to pseudo-linear constraints are also given. A tradeo has to be made between dynamic overhead due to some bound slackness and compilation complexity but the resulting code is always correct. These algorithms can be used to interchange or block loops regardless of the loop bounds or the blocking strategy and to safely exchange array parts between two levels of a memory hierarchy or between neighboring processors in a distributed memory machine.
Improving the Ratio of Memory Operations to Floating-Point Operations in Loops
- ACM Transactions on Programming Languages and Systems
, 1994
"... this paper we attempt to answer that question. To do so, we develop and evaluate techniques that automatically restructure program loops to achieve high performance on specific target architectures. These methods attempt to balance computation and memory accesses and seek to eliminate or reduce pipe ..."
Abstract
-
Cited by 91 (16 self)
- Add to MetaCart
this paper we attempt to answer that question. To do so, we develop and evaluate techniques that automatically restructure program loops to achieve high performance on specific target architectures. These methods attempt to balance computation and memory accesses and seek to eliminate or reduce pipeline interlock. To do this, they statically estimate the balance between memory operations and floating-point operations for each loop in a particular program and use these estimates to determine whether to apply various loop transformations. Experiments with our automatic techniques show that integer-factor speedups are possible on kernels. Additionally, the estimate of the balance between memory operations and computation, and the application of the estimate are very accurate---experiments reveal little difference between the balance achieved by our automatic system and that possible by hand optimization. Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors---Compilers ;
Perfect Pipelining: A New Loop Parallelization Technique
, 1988
"... Parallelizing compilers do not handle loops in a satisfactory manner. Fine-grain transformations capture irregular parallelism inside a loop body not amenable to coarser approaches but have limited ability to exploit parallelism across iterations. Coarse methods sacrifice irregular forms of parall ..."
Abstract
-
Cited by 60 (9 self)
- Add to MetaCart
Parallelizing compilers do not handle loops in a satisfactory manner. Fine-grain transformations capture irregular parallelism inside a loop body not amenable to coarser approaches but have limited ability to exploit parallelism across iterations. Coarse methods sacrifice irregular forms of parallelism in favor of pipelining (overlapping) iterations. In this paper we present a new transformation, Perfect Pipelining, that bridges the gap between these fine- and coarse-grain transformations while retaining the desirable features of both. This is accomplished even in the presence of conditional branches and resource constraints. To make our claims rigorous, we develop a formalism for parallelization. The formalism can also be used to compare transformations across computational models. As an illustration, we show that Doacross, a transformation intended for synchronous and asynchronous multiprocessors, can be expressed as a restriction of Perfect Pipelining. 1 Introduction A si...
Scheduling time-critical instructions on risc machines
- ACM Transactions on Programming Languages and Systems
, 1993
"... We present a polynomial time algorithm for constructing a minimum completion time schedule of instructions from a basic block on RISC machines such as the Sun SPARC, the IBM 801, the Berkeley RISC machine, and the HP Precision Architecture. Our algorithm can be used as a heuristic for RISC processor ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
We present a polynomial time algorithm for constructing a minimum completion time schedule of instructions from a basic block on RISC machines such as the Sun SPARC, the IBM 801, the Berkeley RISC machine, and the HP Precision Architecture. Our algorithm can be used as a heuristic for RISC processors with longer pipelines, for which there is no known optimal algorithm. Our algorithm can also handle time-critical instructions, which are instructions that have to be completed by a specific time. Time-critical instructions occur in some real-time computations, and can also be used to make shared resources such as registers quickly available for reuse. We also prove that in the absence of time-critical constraints, a greedy scheduling algorithm always produces a schedule for a target machine with multiple identical pipelines that has a length less than twice that of an optimal schedule. The behavior of the heuristic is of interest because, as we show, the instruction scheduling problem becomes NP-hard for arbitrary length pipelines, even when the basic block of code being input consists of only several independent streams of straightline code, and there are no time-critical constraints, Finally, we prove that the problem becomes NP-hard even for small pipelines, no time-critical constraints, and input of several independent streams of straightline code if either there is only a single register or if no two instructions are allowed to complete simultaneously because of some shared resource such as a bus
Scheduling And Behavioral Transformations For Parallel Systems
, 1993
"... In a parallel system, either a VLSI architecture in hardware or a parallel program in software, the quality of the final design depends on the ability of a synthesis system to exploit the parallelism hidden in the input description of applications. Since iterative or recursive algorithms are usually ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
In a parallel system, either a VLSI architecture in hardware or a parallel program in software, the quality of the final design depends on the ability of a synthesis system to exploit the parallelism hidden in the input description of applications. Since iterative or recursive algorithms are usually the most time-critical parts of an application, the parallelism embedded in the repetitive pattern of an iterative algorithm needs to be explored. This thesis studies techniques and algorithms to expose the parallelism in an iterative algorithm so that the designer can find an implementation achieving a desired execution rate. In particular, the objective is to find an efficient schedule to be executed iteratively. A form of data-flow graphs is used to model the iterative part of an application, e.g. a digital signal filter or the while/for loop of a program. Nodes in the graph represent operations to be performed and edges represent both intra-iteration and inter-iteration precedence relat...
Unroll-and-Jam Using Uniformly Generated Sets
- In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO
, 1997
"... Modern architectural trends in instruction-level parallelism (ILP) are to increase the computational power of microprocessors significantly. As a result, the demands on memory have increased. Unfortunately, memory systems have not kept pace. Even hierarchical cache structures are ineffective if prog ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Modern architectural trends in instruction-level parallelism (ILP) are to increase the computational power of microprocessors significantly. As a result, the demands on memory have increased. Unfortunately, memory systems have not kept pace. Even hierarchical cache structures are ineffective if programs do not exhibit cache locality. Because of this compilers need to be concerned not only with finding ILP to utilize machine resources effectively, but also with ensuring that the resulting code has a high degree of cache locality. One compiler transformation that is essential for a compiler to meet the above objectives is unroll-and-jam, or outer-loop unrolling. Previous work either has used a dependence-based model [7] to compute unroll amounts, significantly increasing the size of the dependence graph, or has applied a more brute force technique [16]. In this paper, we present an algorithm that uses a linear-algebrabased technique to compute unroll amounts. This technique results in an...
Schedule-Based Multi-Dimensional Retiming on Data Flow Graphs
- Proceedings of 8th International Parallel Processing Symposium
, 1994
"... Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable tool in one-dimensional problems, represented by Data Flow Graphs (DFGs) such as DSP filters, which can maximize the parallelism of a loop ..."
Abstract
-
Cited by 15 (13 self)
- Add to MetaCart
Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable tool in one-dimensional problems, represented by Data Flow Graphs (DFGs) such as DSP filters, which can maximize the parallelism of a loop body represented by a DFG. Since most scientific or DSP applications are recursive or iterative, to increase the parallelism of the loop body can substantially decrease the overall computation time. Few results on retiming have been obtained for multi-dimensional problems. The previous result of multi-dimensional retiming is only applied to a restricted class of Data Flow Graphs in which every total delay vector in a cycle has to be strictly non-negative. This paper develops a novel retiming technique that considers the final schedule as part of the process. To authors' knowledge, this is the first retiming algorithm for general multi-dimensional Data Flow Graphs. The description and the correctness of our algorithm are presented in the paper. Through the experiments, results have shown that our algorithm runs efficiently. Some DSP filters are used in the paper as an example of the application of our algorithm. 1
Full Parallelism In Uniform Nested Loops Using Multi-Dimensional Retiming
, 1994
"... Most scientific and DSP applications are recursive or iterative. Uniform nested loops can be modeled as multi-dimensional data flow graphs (DFGs). To achieve full parallelism of the loop body, i.e., all the computational nodes executed in parallel, substantially decreases the overall computation ti ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Most scientific and DSP applications are recursive or iterative. Uniform nested loops can be modeled as multi-dimensional data flow graphs (DFGs). To achieve full parallelism of the loop body, i.e., all the computational nodes executed in parallel, substantially decreases the overall computation time. It is well known that for onedimensional DFGs retiming can not always achieve full parallelism. This paper shows an important and counterintuitive result, which proves that we can always obtain full-parallelism for DFGs with more than one dimension. It also presents two novel multi-dimensional retiming techniques to obtain full parallelism.
TimeC: A Time Constraint Language for ILP Processor Compilation
, 1998
"... . Enabled by RISC technologies, low-cost commodity microprocessors are performing at ever increasing levels, significantly via instruction level parallelism (ILP). This in turn increases the opportunities for their use in a variety of day-to-day applications ranging from the simple control of applia ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
. Enabled by RISC technologies, low-cost commodity microprocessors are performing at ever increasing levels, significantly via instruction level parallelism (ILP). This in turn increases the opportunities for their use in a variety of day-to-day applications ranging from the simple control of appliances such as microwave ovens, to sophisticated systems for cabin control in modern aircraft. Indeed, "embedded" applications such as these represent segments in the computer industry with great potential for growth. However, this growth is currently impeded by the lack of robust optimizing compiler technologies that support the assured, rapid and inexpensive prototyping of real-time software in the context of microprocessors with ILP. In this paper we describe a novel notation, TimeC, for specifying timing constraints in programs, independent of the base language being used to develop the embedded application; TimeC specifications are language independent and can be instrumented into imperat...

