Results 1 - 10
of
184
Meta Optimization: Improving Compiler Heuristics with Machine Learning
- In Proceedings of the ACM SIGPLAN ’03 Conference on Programming Language Design and Implementation
, 2002
"... Compiler writers have crafted many heuristics over the years to (approximately) solve NP-hard problems efficiently. Finding a heuristic that performs well on a broad range of applications is a tedious and difficult process. This paper introduces Meta Optimization, a methodology for automatically fin ..."
Abstract
-
Cited by 63 (4 self)
- Add to MetaCart
Compiler writers have crafted many heuristics over the years to (approximately) solve NP-hard problems efficiently. Finding a heuristic that performs well on a broad range of applications is a tedious and difficult process. This paper introduces Meta Optimization, a methodology for automatically fine-tuning compiler heuristics. Meta Optimization uses machine-learning techniques to automatically search an optimization's solution space. We implemented Meta Optimization on top of Trimaran [20] to test its efficacy. By `evolving' Trimaran's hyperblock selection optimization for a particular benchmark, our system achieves impressive speedups. Application-specific heuristics obtain an average speedup of 23% (up to 43%) for the applications in our suite. Furthermore, by evolving a compiler's heuristic over several benchmarks, we can create effective, general-purpose compilers. The best general-purpose heuristic our system found improved Trimaran's hyperblock selection algorithm by an average of 25% on our training set, and 9% on a completely unrelated test set. We further test the applicability of our system on Trimaran's priority-based coloring register allocator. For this well-studied optimization we were able to specialize the compiler for individual applications, achieving an average speedup of 6%.
Stage Scheduling: A Technique to Reduce the Register Requirements of a Modulo Schedule
- IN PROC. OF THE 28TH ANNUAL INT. SYMP. ON MICROARCHITECTURE (MICRO-28
, 1995
"... Modulo scheduling is an efficient technique for exploiting instruction level parallelism in a variety of loops, resulting in high performance code but increased register requirements. We present a set of low computational complexity stage-scheduling heuristics that reduce the register requirements o ..."
Abstract
-
Cited by 57 (5 self)
- Add to MetaCart
Modulo scheduling is an efficient technique for exploiting instruction level parallelism in a variety of loops, resulting in high performance code but increased register requirements. We present a set of low computational complexity stage-scheduling heuristics that reduce the register requirements of a given modulo schedule by shifting operations by multiples of II cycles. Measurements on a benchmark suite of 1289 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels shows that our best heuristic achieves on average 99% of the decrease in register requirements obtained by an optimal stage scheduler.
Effective Cluster Assignment for Modulo Scheduling
- IN PROCEEDINGS OF THE 31 INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO-31
, 1998
"... Clustering is one solution to the demand for wideissue machines and fast clock cycles because it allows for smaller, less ported register files and simpler bypass logic while remaining scaleable. Much of the previous work on scheduling for clustered architectures has focused on acyclic code. While m ..."
Abstract
-
Cited by 53 (0 self)
- Add to MetaCart
Clustering is one solution to the demand for wideissue machines and fast clock cycles because it allows for smaller, less ported register files and simpler bypass logic while remaining scaleable. Much of the previous work on scheduling for clustered architectures has focused on acyclic code. While minimizing schedule length of acyclic code is paramount, the primary objective when scheduling cyclic code is to maximize the throughput or steady state performance. This paper investigates a pre-modulo scheduling pass that performs cluster assignment in a way that minimizes performance degradation do to explicit communication required as the loops are split over clusters. The proposed cluster assignment algorithm annotates and adjusts the graph for use by the scheduler so that any traditional modulo scheduling algorithm, having no knowledge of clustering, can produce a valid and efficient schedule for a clustered machine.
Hypernode Reduction Modulo Scheduling
- IN PROC. OF THE 28TH ANNUAL INT. SYMP. ON MICROARCHITECTURE (MICRO28
, 1995
"... Software Pipelining is a loop scheduling technique that extracts parallelism from loops by overlapping the execution of several consecutive iterations. Most prior scheduling research has focused on achieving minimum execution time, without regarding register requirements. Most strategies tend to str ..."
Abstract
-
Cited by 53 (22 self)
- Add to MetaCart
Software Pipelining is a loop scheduling technique that extracts parallelism from loops by overlapping the execution of several consecutive iterations. Most prior scheduling research has focused on achieving minimum execution time, without regarding register requirements. Most strategies tend to stretch operand lifetimes because they schedule some operations too early or too late. The paper presents a novel strategy that simultaneously schedules some operations late and other operations early, minimizing all the stretchable dependencies and therefore reducing the registers required by the loop. The key of this strategy is a pre-ordering phase that selects the order in which the operations will be scheduled. The results show that the method described in this paper performs better than other heuristic methods and almost as well as a linear programming method but requiring much less time to produce the schedules.
Out-of-Order Vector Architectures
, 1997
"... Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace d ..."
Abstract
-
Cited by 46 (21 self)
- Add to MetaCart
Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace driven simulation we compare a conventional vector implementation, based on the Convex C3400, with an out-of-order, register renaming, vector implementation. When the number of physical registers is above 12, out-of-order execution coupled with register renaming provides a speedup of 1.24--1.72 for realistic memory latencies. Out-of-order techniques also tolerate main memory latencies of 100 cycles with a performance degradation less than 6%. The mechanisms used for register renaming and out-of-order issue can be used to support precise interrupts -- generally a difficult problem in vector machines. When precise interrupts are implemented, there is typically less than a 10% degradation in performance. A new technique based on register renaming is targeted at dynamically eliminating spill code; this technique is shown to provide an extra speedup ranging between 1.10 and 1.20 while reducing total memory traffic by an average of 15--20%.
Two-level Hierarchical Register File Organization for VLIW Processors
- In International Symposium on Microarchitecture
, 2000
"... High-performance microprocessors are currently designed to exploit the inherent instruction level parallelism (ILP) available in most applications. The techniques used in their design and the aggressive scheduling techniques used to exploit this ILP tend to increase the register requirements of the ..."
Abstract
-
Cited by 36 (2 self)
- Add to MetaCart
High-performance microprocessors are currently designed to exploit the inherent instruction level parallelism (ILP) available in most applications. The techniques used in their design and the aggressive scheduling techniques used to exploit this ILP tend to increase the register requirements of the loops. If more registers than those available in the architecture are required, some actions (such as spill code insertion) have to be applied to reduce this pressure, at the expense of some performance degradation. This degradation could be avoided if a high--capacity register file were included without causing a negative impact on the cycle time of the processor. In this paper we propose a two-level hierarchical register file organization for VLIW architectures that combines high capacity and low access time. For the configuration proposed in this paper, the new organization achieves a speed--up of 10--14% over a monolithic organization with 64 registers; it is obtained with a 43% (40%) reduction in area (peak power dissipation). Compared to a monolithic file with 32 registers, the speed--up is as much as 38% with just a 14% (4%) increase in area (peak power dissipation). 1.
Software Pipelining
, 1995
"... Utilizing parallelism at the instruction level is an important way to improve performance. Since the time spent in loop execution dominates total execution time, a large body of optimizations focus on decreasing the time to execute each iteration. Software pipelining is a technique that reforms t ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
Utilizing parallelism at the instruction level is an important way to improve performance. Since the time spent in loop execution dominates total execution time, a large body of optimizations focus on decreasing the time to execute each iteration. Software pipelining is a technique that reforms the loop so that a faster execution rate is realized. Iterations are executed in overlapped fashion to increase parallelism. 1 Let --ABC n represent a loop containing operations A, B, C that is executed n times. Although the operations of a single iteration can be parallelized, more parallelism may be achievable if the entire loop is considered rather than a single iteration. The software pipelining transformation utilizes the fact that a loop --ABC n is equivalent to A--BCA n\Gamma1 BC. Although the operations contained in the loop do not change, the operations are from different iterations of the original loop. Various algorithms for software pipelining exist. A comparison of ...
Unrolling-Based Optimizations for Modulo Scheduling
- in Proceedings of the 28th International Symposium on Microarchitecture
, 1995
"... Modulo scheduling is a method for overlapping successive iterations of a loop in order to find sufficient instruction-level parallelism to fully utilize high-issue-rate processors. The achieved throughput modulo scheduled loop depends on the resource requirements, the dependence pattern, and the reg ..."
Abstract
-
Cited by 33 (3 self)
- Add to MetaCart
Modulo scheduling is a method for overlapping successive iterations of a loop in order to find sufficient instruction-level parallelism to fully utilize high-issue-rate processors. The achieved throughput modulo scheduled loop depends on the resource requirements, the dependence pattern, and the register requirements of the computation in the loop body. Traditionally, unrolling followed by acyclic scheduling of the unrolled body has been an alternative to modulo scheduling. However, there are benefits to unrolling even if the loop is to be modulo scheduled. Unrolling can improve the throughput by allowing a smaller non-integral effective initiation interval to be achieved. After unrolling, optimizations can be applied to the loop that reduce both the resource requirements and the height of the critical paths. Together, unrolling and unrolling-based optimizations can enable the completion of multiple iterations per cycle in some cases. This paper describes the benefits of unrolling and ...
Orchestrating the execution of stream programs on multicore platforms
- In Proc. of the SIGPLAN ’08 Conference on Programming Language Design and Implementation
, 2008
"... While multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream programming is one model that has wide applicability in the multimedia, graphics, and signal processing dom ..."
Abstract
-
Cited by 33 (8 self)
- Add to MetaCart
While multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream programming is one model that has wide applicability in the multimedia, graphics, and signal processing domains. Streaming models execute as a set of independent actors that explicitly communicate data through channels. This paper presents a compiler technique for planning and orchestrating the execution of streaming applications on multicore platforms. An integrated unfolding and partitioning step based on integer linear programming is presented that unfolds data parallel actors as needed and maximally packs actors onto cores. Next, the actors are assigned to pipeline stages in such a way that all communication is maximally overlapped with computation on the cores. To facilitate experimentation, a generalized code generation template for mapping the software pipeline onto the Cell architecture is presented. For a range of streaming applications, a geometric mean speedup of 14.7x is achieved on a 16-core Cell platform compared to a single core.

