Results 1  10
of
19
Optimal Spilling for CISC Machines with Few Registers
 In Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
, 2000
"... Register allocation based on graph coloring performs poorly for machines with few registers, if each temporary is held either in machine registers or memory over its entire lifetime. With the exception of shortlived temporaries, most temporaries must spill  including long lived temporaries that a ..."
Abstract

Cited by 80 (1 self)
 Add to MetaCart
Register allocation based on graph coloring performs poorly for machines with few registers, if each temporary is held either in machine registers or memory over its entire lifetime. With the exception of shortlived temporaries, most temporaries must spill  including long lived temporaries that are used within inner loops. Liverange splitting before or during register allocation helps to alleviate the problem but prior techniques are sometimes complex, make no guarantees about subsequent colorability and thus require further iterations of splitting, pay no attention to addressing modes, and make no claim to optimality. We formulate the register allocation problem for CISC architectures with few registers in two parts: an integer linear program that determines the optimal location to break up the implementation of a live range between registers and memory, and a register assignment phase that we guarantee to complete without further spill code insertion. Our linear programming model ...
An experimental comparison of cacheoblivious and cacheconscious programs
 In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures. 93–104
, 2007
"... Cacheoblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divideandconquer paradigm – each division step creates subproblems o ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
(Show Context)
Cacheoblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divideandconquer paradigm – each division step creates subproblems of smaller size, and when the working set of a subproblem fits in some level of the memory hierarchy, the computations in that subproblem can be executed without suffering capacity misses at that level. In this way, divideandconquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy. An important question is the following: how well do carefully tuned cacheoblivious programs perform compared to carefully tuned cacheconscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question. This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cacheoblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cacheoblivious algorithms more competitive.
Datadependency graph transformations for instruction scheduling
 Journal of Scheduling
, 2006
"... This paper presents a set of efficient graph transformations for local instruction scheduling. These transformations to the datadependency graph prune redundant and inferior schedules from the solution space of the problem. Optimally scheduling the transformed problems using an enumerative schedule ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
This paper presents a set of efficient graph transformations for local instruction scheduling. These transformations to the datadependency graph prune redundant and inferior schedules from the solution space of the problem. Optimally scheduling the transformed problems using an enumerative scheduler is faster and the number of problems solved to optimality within a bounded time is increased. Furthermore, heuristic scheduling of the transformed problems often yields improved schedules for hard problems. The basic nodebased transformation runs in O(ne) time, where n is the number of nodes and e is the number of edges in the graph. A generalized subgraphbased transformation runs in O(n 2 e) time. The transformations are implemented within the Gnu Compiler Collection (GCC) and are evaluated experimentally using the SPEC CPU2000 floatingpoint benchmarks targeted to various processor models. The results show that the transformations are fast and improve the results of both heuristic and optimal scheduling. KEY WORDS: instruction scheduling, graph transformation, optimal scheduling, compiler scheduling 1.
Minimum register instruction sequencing to reduce register spills in outoforder issue superscalar architectures
 IEEE Trans. Comput
"... ..."
(Show Context)
Minimum Register Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs
 LAB., UNIVERSITY OF DELAWARE
, 2001
"... We revisit the optimal code generation or evaluation order determination problem  the problem of generating an instruction sequence from a data dependence graph (DDG). In particular, we are interested in generating an instruction sequence S that is optimal in terms of the number of registers used ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
We revisit the optimal code generation or evaluation order determination problem  the problem of generating an instruction sequence from a data dependence graph (DDG). In particular, we are interested in generating an instruction sequence S that is optimal in terms of the number of registers used by the sequence S. We call this MRIS (Minimum Register Instruction Sequence) problem. We
Integrated Prepass Scheduling for a Java Justintime Compiler on the IA64 Architecture
 in Proceedings of the International Symposium on Code Generation and Optimization
, 2003
"... We present a new integrated prepass scheduling (IPS) algorithm for a Java JustInTime (JIT) compiler, which integrates register minimization into list scheduling. We use backtracking in the list scheduling when we have used up all the available registers. To reduce the overhead of backtracking, w ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
We present a new integrated prepass scheduling (IPS) algorithm for a Java JustInTime (JIT) compiler, which integrates register minimization into list scheduling. We use backtracking in the list scheduling when we have used up all the available registers. To reduce the overhead of backtracking, we incrementally maintain a set of candidate instructions for undoing scheduling. To maximize the ILP after undoing scheduling, we select an instruction chain with the smallest increase in the total execution time. We implemented our new algorithm in a productionlevel Java JIT compiler for the Intel Itanium processor. The experiment showed that, compared to the best known algorithm by Govindarajan et al., our IPS algorithm improved the performance by up to +1.8 % while it reduced the compilation time for IPS by 58 % on average. 1.
Effective Instruction Scheduling with Limited Registers
, 2001
"... Effective global instruction scheduling techniques have become an important component in modern compilers for exposing more instructionlevel parallelism (ILP) and exploiting the everincreasing number of parallel function units. Effective register allocation has long been an essential component of a ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Effective global instruction scheduling techniques have become an important component in modern compilers for exposing more instructionlevel parallelism (ILP) and exploiting the everincreasing number of parallel function units. Effective register allocation has long been an essential component of a good compiler for reducing memory references. While instruction scheduling and register allocation are both essential compiler optimizations for fully exploiting the capability of modern highperformance microprocessors, there is a phaseordering problem when we perform these two optimizations separately: instruction scheduling before register allocation may create insatiable demands for registers; register allocation before instruction scheduling may reduce the amount of parallelism that instruction scheduling can exploit. In this thesis, we propose to solve this phaseordering problem by inserting a moderating optimization called code reorganization between prepass instruction scheduling and register allocation. Code reorganization adjusts the prepass scheduling results to make them demand fewer registers (i.e. exhibit lower register pressure) and guides register allocation to insert spill code that has less impact on schedule length. Our new approach avoids the complexity of simultaneous instruction scheduling and register allocation algorithms. In fact, it does not modify either instruction scheduling or register allocation algorithms. Therefore instruction scheduling can focus on maximizing instructionlevel parallelism, and register allocation can focus on minimizing the cost of spill code. We compare the performance of our approach with a particular successful registerpressuresensitive scheduling algorithm, and show an average of 18% improvement in speedup for an 8...
Optimal Global Instruction Scheduling Using Enumeration
 University of California
, 1991
"... of the ..."
(Show Context)
Minimum Register Instruction Scheduling: A New Approach for Dynamic Instruction Issue Processors
 In Proc. of the Twelfth International Workshop on Languages and Compilers for Parallel Computing
, 1999
"... . Modern superscalar architectures with dynamic scheduling and register renaming capabilities have introduced subtle but important changes into the tradeoffs between compiletime register allocation and instruction scheduling. In particular, it is perhaps not wise to increase the degree of parall ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
. Modern superscalar architectures with dynamic scheduling and register renaming capabilities have introduced subtle but important changes into the tradeoffs between compiletime register allocation and instruction scheduling. In particular, it is perhaps not wise to increase the degree of parallelism of the static instruction schedule at the expense of excessive register pressure which may result in additional spill code. To the contrary, it may even be beneficial to reduce the register pressure at the expense of constraining the degree of parallelism of the static instruction schedule. This leads to the following interesting problem: given a data dependence graph (DDG) G, can we derive a schedule S for G that uses the least number of registers ? In this paper, we present a heuristic approach to compute the nearoptimal number of registers required for a DDG G (under all possible legal schedules). We propose an extended listscheduling algorithm which uses the above number...
PowerAware Compilation Techniques for High Performance Processors
, 2004
"... xii Chapter 1 ..."
(Show Context)