Results 11  20
of
29
Effective Instruction Scheduling with Limited Registers
, 2001
"... Effective global instruction scheduling techniques have become an important component in modern compilers for exposing more instructionlevel parallelism (ILP) and exploiting the everincreasing number of parallel function units. Effective register allocation has long been an essential component of a ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Effective global instruction scheduling techniques have become an important component in modern compilers for exposing more instructionlevel parallelism (ILP) and exploiting the everincreasing number of parallel function units. Effective register allocation has long been an essential component of a good compiler for reducing memory references. While instruction scheduling and register allocation are both essential compiler optimizations for fully exploiting the capability of modern highperformance microprocessors, there is a phaseordering problem when we perform these two optimizations separately: instruction scheduling before register allocation may create insatiable demands for registers; register allocation before instruction scheduling may reduce the amount of parallelism that instruction scheduling can exploit. In this thesis, we propose to solve this phaseordering problem by inserting a moderating optimization called code reorganization between prepass instruction scheduling and register allocation. Code reorganization adjusts the prepass scheduling results to make them demand fewer registers (i.e. exhibit lower register pressure) and guides register allocation to insert spill code that has less impact on schedule length. Our new approach avoids the complexity of simultaneous instruction scheduling and register allocation algorithms. In fact, it does not modify either instruction scheduling or register allocation algorithms. Therefore instruction scheduling can focus on maximizing instructionlevel parallelism, and register allocation can focus on minimizing the cost of spill code. We compare the performance of our approach with a particular successful registerpressuresensitive scheduling algorithm, and show an average of 18% improvement in speedup for an 8...
Binary translation to improve energy efficiency through postpass register reallocation
 in Proceedings of the fourth ACM international conference on Embedded software
, 2004
"... Energy efficiency is rapidly becoming a first class optimization parameter for modern systems. Caches are critical to the overall performance and thus, modern processors (both high and lowend) tend to deploy a cache with large size and high degree of associativity. Due a large size cache power take ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Energy efficiency is rapidly becoming a first class optimization parameter for modern systems. Caches are critical to the overall performance and thus, modern processors (both high and lowend) tend to deploy a cache with large size and high degree of associativity. Due a large size cache power takes up a significant percentage of total system power. One important way to reduce cache power consumption is to reduce the dynamic activities in the cache by reducing the dynamic loadstore counts. In this work, we focus on programs that are only available as binaries which need to be improved for energy efficiency. For adapting these programs for energyconstrained devices, we propose a feedback directed postpass solution that tries to do register reallocation to reduce dynamic load/store counts and to improve energyefficiency. Our approach is based on zero knowledge of original code generator or compiler and performs a postpass register allocation to get a more powerefficient binary. We attempt to find out the dead as well as unused registers in the binary and then reallocate them on hot paths to reduce dynamic load/store counts. It is shown that the static code size increase due to our framework is very minimal. Our experiments on SPEC2000 and MediaBench show that our technique is effective. We have seen dynamic spill loads/stores reduction in the datacache ranging from 0 % to 26.4%. Overall, our approach improves the energydelay product of the program.
Global Reduction of Spill Code by Live Range Splitting
, 1998
"... Current, stateoftheart compilers use Briggs' graph coloring heuristic for register allocation. This heuristic provides an efficient mapping of program variables to machine registers. However, if a variable cannot be assigned a register the variable is spilled and referenced through memory. Each u ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Current, stateoftheart compilers use Briggs' graph coloring heuristic for register allocation. This heuristic provides an efficient mapping of program variables to machine registers. However, if a variable cannot be assigned a register the variable is spilled and referenced through memory. Each use of the variable is preceded by a load from memory and each definition is followed by a store to memory. The algorithm presented in this thesis is a method to reduce the amount of spill code added by a Briggs' allocator. Graph coloring maps the live range of a variable to a machine register. If no machine register is available for a live range, the variable is spilled. There are often areas in the live range where spill code is not needed. Our algorithm identifies there areas, known as low register pressure regions. Once low register pressure regions are found for a spilled live range, it is possible to limit the amount of spill code added to the low register pressure region. This is known...
An analysis of graph coloring register allocation
, 2006
"... Graph coloring is the de facto standard technique for register allocation within a compiler. In this paper we examine the importance of the quality of the coloring algorithm and various extensions of the basic graph coloring technique by replacing the coloring phase of the GNU compiler’s register al ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Graph coloring is the de facto standard technique for register allocation within a compiler. In this paper we examine the importance of the quality of the coloring algorithm and various extensions of the basic graph coloring technique by replacing the coloring phase of the GNU compiler’s register allocator with an optimal coloring algorithm. We then extend this optimal algorithm to incorporate various extensions such as coalescing and preferential register assignment. We find that using an optimal coloring algorithm has surprisingly little benefit and empirically demonstrate the benefit of the various extensions.
Liverange Unsplitting for Faster Optimal Coalescing
"... Register allocation is often a twophase approach: spilling of registers to memory, followed by coalescing of registers. Extreme liverange splitting (i.e. liverange splitting after each statement) enables optimal solutions based on ILP, for both spilling and coalescing. However, while the solutions ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Register allocation is often a twophase approach: spilling of registers to memory, followed by coalescing of registers. Extreme liverange splitting (i.e. liverange splitting after each statement) enables optimal solutions based on ILP, for both spilling and coalescing. However, while the solutions are easily found for spilling, for coalescing they are more elusive. This difficulty stems from the huge size of interference graphs resulting from liverange splitting. This report focuses on optimal coalescing in the context of extreme liverange splitting. We present some theoretical properties that give rise to an algorithm for reducing interference graphs, while preserving optimality. This reduction consists mainly in finding and removing useless splitting points. It is followed by a graph decomposition based on clique separators. The last optimization consists in two preprocessing rules. Any coalescing technique can be applied after these optimizations. Our optimizations have been tested on a standard benchmark, the optimal coalescing challenge. For this benchmark, the cuttingplane algorithm for optimal coalescing (the only optimal algorithm for coalescing) runs 300 times faster when combined with our optimizations. Moreover, we provide all the solutions of the optimal coalescing challenge, including the 3 instances that were previously unsolved.
Register Allocation Deconstructed
, 2009
"... Register allocation is a fundamental part of any optimizing compiler. Effectively managing the limited register resources of the constrained architectures commonly found in embedded systems is essential in order to maximize code quality. In this paper we deconstruct the register allocation problem i ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Register allocation is a fundamental part of any optimizing compiler. Effectively managing the limited register resources of the constrained architectures commonly found in embedded systems is essential in order to maximize code quality. In this paper we deconstruct the register allocation problem into distinct components: coalescing, spilling, move insertion, and assignment. Using an optimal register allocation framework, we empirically evaluate the importance of each of the components, the impact of component integration, and the effectiveness of existing heuristics. We evaluate code quality both in terms of code performance and code size and consider four distinct instruction set architectures: ARM, Thumb, x86, and x8664. The results of our investigation reveal general principles for register allocation design.
DualIssue Scheduling For Binary Trees With Spills And Pipelined Loads
, 2001
"... We describe an algorithm that finds a minimum cost schedule, including spill code, for a registerconstrained machine that can issue up to one arithmetic operation and one memory access operation at a time, under the restrictions that the dependence graph is a full binary tree, all arithmetic and st ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We describe an algorithm that finds a minimum cost schedule, including spill code, for a registerconstrained machine that can issue up to one arithmetic operation and one memory access operation at a time, under the restrictions that the dependence graph is a full binary tree, all arithmetic and store operations have unit latency, and all load operations have a latency of 1 or all load operations have a latency of 2. This problem is a generalization of two problems whose efficient solutions are well understood: optimal dualissue scheduling without spills for binary expression trees, solved by Bernstein, Jaffe, and Rodeh [SIAM J. Comput., 18 (1989), pp. 10981127], and optimal singleissue scheduling with spill code and delayed loads, solved by Kurlander, Proebsting, and Fischer [ACM Transactions on Programming Languages and Systems, 17 (1995), pp. 740776], both assuming a fixed number of registers. We show that the algorithm's complexity is O(nk) where n is the number of operations to be scheduled and k is the number of spills in the schedule. The cost of a "contiguous" schedule (i.e., its length) is shown to be # + 2k + g + A, where # is the number of registers used, A is the number of arithmetic operations, k is the number of spills, and g is the number of empty slots in the associated single processor schedule. Therefore all contiguous schedules formed from optimal single processor schedules have minimum cost.
Scratchpad Memory Allocation for Data Aggregates via Interval Coloring in Superperfect Graphs
"... Existing methods place data or code in scratchpad memory, i.e., SPM by relying on heuristics or resorting to integer programming or mapping it to a graph coloring problem. In this paper, the SPM allocation problem for arrays is formulated as an interval coloring problem. The key observation is that ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Existing methods place data or code in scratchpad memory, i.e., SPM by relying on heuristics or resorting to integer programming or mapping it to a graph coloring problem. In this paper, the SPM allocation problem for arrays is formulated as an interval coloring problem. The key observation is that in many embedded C programs, two arrays can be modeled such that either their live ranges do not interfere or one contains the other (with good accuracy). As a result, array interference graphs often form a special class of superperfect graphs (known as comparability graphs) and their optimal interval colorings become efficiently solvable. This insight has led to the development of an SPM allocation algorithm that places arrays in an interference graph in SPM by examining its maximal cliques. If the SPM is no smaller than the clique number of an interference graph, then all arrays in the graph can be placed in SPM optimally. Otherwise, we rely on containmentmotivated heuristics to split or spill array live ranges until the resulting graph is optimally colorable. We have implemented our algorithm in SUIF/machSUIF and evaluated it using a set of embedded C benchmarks from MediaBench and MiBench. Compared to a graph coloring algorithm and an optimal ILP algorithm (when it runs to completion), our algorithm achieves closetooptimal results and is superior to graph coloring for the benchmarks tested.
Register Saturation in Superscalar and VLIW
 In Proceedings of The International Conference on Compiler Construction, Lecture Notes in Computer Science
, 2001
"... The registers constraints can be taken into account during the scheduling phase of an acyclic data dependence graph (DAG) : any schedule must minimize the register requirement. In this work, we mathematically study and extend the approach which consists of computing the exact upperbound of the ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
The registers constraints can be taken into account during the scheduling phase of an acyclic data dependence graph (DAG) : any schedule must minimize the register requirement. In this work, we mathematically study and extend the approach which consists of computing the exact upperbound of the register need for all the valid schedules, independently of the functional unit constraints. A previous work (URSA) was presented in [5, 4]. Its aim was to add some serial arcs to the original DAG such that the worst register need does not exceed the number of available registers. We write an appropriate mathematical formalism for this problem and extend the DAG model to take into account delayed read from and write into registers with multiple registers types.
Reducing the Impact of Spill Code
"... This memory would be as fast as cache memory, but it would be under the control of the compiler rather than the hardware. We use the results from the memory allocation study to show that this memory space could be quite small, and we present an algorithm that the compiler could employ to utilize thi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
This memory would be as fast as cache memory, but it would be under the control of the compiler rather than the hardware. We use the results from the memory allocation study to show that this memory space could be quite small, and we present an algorithm that the compiler could employ to utilize this space. We also present experimental results that suggest that this method would have a significant impact on a program's runtime