## Code Generation Techniques (1992)

Venue: | In INFOCOM (1 |

Citations: | 5 - 0 self |

### BibTeX

@TECHREPORT{Proebsting92codegeneration,

author = {Todd Alan Proebsting},

title = {Code Generation Techniques},

institution = {In INFOCOM (1},

year = {1992}

}

### OpenURL

### Abstract

Optimal instruction scheduling and register allocation are NP-complete problems that require heuristic solutions. By restricting the problem of register allocation and instruction scheduling for delayed-load architectures to expression trees we are able to find optimal schedules quickly. This thesis presents a fast, optimal code scheduling algorithm for processors with a delayed load of 1 instruction cycle. The algorithm minimizes both execution time and register use and runs in time proportional to the size of the expression tree. In addition, the algorithm is simple

### Citations

11502 |
Computers and Intractability, A Guide to the Theory of NPCompleteness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...and global values, it is important not to overuse them when scheduling instructions. 2.1 Overview The problem of optimally scheduling instructions under arbitrary pipeline constraints is NP-complete (=-=[GJ79]-=-, [LLM + 87], [HG82], and [PS90]). Many heuristics have been proposed for scheduling pipelined code; all assume, however, that pipeline constraints can occur after any instruction, and that operators ... |

4331 |
Computer Architecture: a quantitative approach (3rd ed
- Hennessy, Patterson
- 2003
(Show Context)
Citation Context ...ned. Our simple machine's instruction set is given in Figure 2.2. This architecture is an approximation of the integer functional units of many modern RISC processors such as the SPARC and MIPS R3000 =-=[PH90]-=-. A delayed load requires that the destination of a load not be accessed by subsequent instructions for some number of instruction cycles, although other, unrelated instructions may execute. Delay wil... |

433 |
Register allocation and spilling via graph coloring
- Chaitin
- 1982
(Show Context)
Citation Context ...nt in the program that is over-allocated, but so far no legal assignment has been found. All of the variables that have been allocated registers are assigned registers using graph-coloring techniques =-=[Cha82]-=-, [BCKT89], [CH90]. An important difference between using graphcoloring for allocation (as other algorithms do) and for assignment (as we do) is that failure to find a legal coloring (unlikely) does n... |

288 | Optimally profiling and tracing programs
- Ball, Larus
- 1992
(Show Context)
Citation Context ...a register rather than from memory. This can be determined heuristically from loop and conditional nesting levels, or empirically through profiling information from previous executions of the program =-=[BL92]-=-. Once a particular use has been allocated a register, there is no need to do a load of the variable at that use, thus saving time and space. Allocating a register causes the probabilities for other i... |

183 |
Computer and Job-Shop Scheduling Theory
- Coffman
- 1976
(Show Context)
Citation Context ...ns. 2.2 Previous Work An adaptation of Hu's algorithm [Hu61] gives an optimal solution to scheduling a treestructured task system on multiple identical processors if each task has unit execution time =-=[Cof76]-=-, but the algorithm does not handle register allocation constraints. For an architecture with 2 functional units, one for loads and one for operations, with identical pipeline constraints, Bernstein e... |

181 | Global Register Allocation at Link Time
- Wall
- 1986
(Show Context)
Citation Context ...ister allocation attempts to allocate registers among procedures so that procedures may pass parameters in registers, avoid saves and restores around calls, and share global values in registers. Wall =-=[Wal86]-=- built a system that allocated registers interprocedurally, at link-time when the entire program was available. All local and global variables had been previously allocated to memory, and his allocato... |

175 |
Parallel Sequencing and Assembly Line Problemn
- Hu
- 1961
(Show Context)
Citation Context ... nodes are direct memory references. DLS is as an attractive, simple, fast and effective alternative to more complicated, slower heuristic solutions. 2.2 Previous Work An adaptation of Hu's algorithm =-=[Hu61]-=- gives an optimal solution to scheduling a treestructured task system on multiple identical processors if each task has unit execution time [Cof76], but the algorithm does not handle register allocati... |

164 |
The priority-based coloring approach to register allocation
- Chow, Hennessy
- 1990
(Show Context)
Citation Context ...other candidates whose aggregate benefit would have been greater. Previous global register allocation methods have concentrated on casting register allocation as a graph-coloring problem ([CAC + 81], =-=[CH90]-=-, [BCKT89], [LH86]). Since no two simultaneously live values may be assigned to the same register, an interference graph can be built where nodes represent register candidate values, and arcs exist be... |

143 |
Pattern matching in trees
- Hoffmann, O’Donnell
- 1982
(Show Context)
Citation Context ...than 50 VAX instructions per node of an expression tree [FH91c]. BURS code generators are fast for two reasons: they use bottom-up tree pattern matching technology (the theoretically fastest possible =-=[HO82]-=-), and they do all dynamic programming at compile-compile time (i.e., when the patterns are pre-processed to build the code generator). By doing dynamic programming at compile-compile time, a BURS cod... |

135 |
Coloring Heuristics for Register Allocation
- Briggs, Cooper, et al.
- 1989
(Show Context)
Citation Context ...ndidates whose aggregate benefit would have been greater. Previous global register allocation methods have concentrated on casting register allocation as a graph-coloring problem ([CAC + 81], [CH90], =-=[BCKT89]-=-, [LH86]). Since no two simultaneously live values may be assigned to the same register, an interference graph can be built where nodes represent register candidate values, and arcs exist between simu... |

120 |
Efficient instruction scheduling for a pipelined architecture
- Gibbons, Muchnick
- 1986
(Show Context)
Citation Context ...nd stall the processor for the additional cycle. Previously, most instruction schedulers handled delayed-load scheduling by solving a more general problem of arbitrary instruction scheduling ([HG83], =-=[GM86]-=-, [War90], [LLM + 87], and [PS90]). Arbitrary instruction scheduling considers operations other than loads with delays (such as multiplies/divides) that can take many cycles to complete. This thesis d... |

110 | Code scheduling and register allocation in large basic blocks
- Goodman, Hsu
- 1988
(Show Context)
Citation Context ... constraints and minimizing registers. Since both optimal scheduling and register allocation on DAGs are NP-complete problems, their solutions to the integrated problem are heuristic. Goodman and Hsu =-=[GH88]-=- describe a system, Integrated Prepass Scheduling (IPS), that combines register allocation and instruction scheduling. IPS is conceptually simple. 2 We will use operations to denote non-load instructi... |

108 |
Code generation using tree matching and dynamic programming
- Aho, Ganapathi, et al.
- 1989
(Show Context)
Citation Context ... tree patterns that describe different instructions are given weights to describe their relative costs, dynamic programming can be used to select the optimal set of instructions to evaluate the tree (=-=[AGT89]-=-, [AJ76], [PLG88], [BDB90], and [AG85]). Dynamic programming is an expensive operation since it finds all optimal subsolutions before finding a solution for the entire tree. Fortunately, Bottom-Up Rew... |

108 |
The generation of optimal code for arithmetic expressions
- Sethi, Ullman
- 1970
(Show Context)
Citation Context ... of registers in use will never exceed R. 3 There are (L--1) operations in a binary tree with L loads. 16 Figure 2.4 gives an example expression with a standard Sethi-Ullman (SU) instruction schedule =-=[SU70]-=-, a canonical order with 3 registers, and a canonical order assuming 4 registers. (The Sethi-Ullman order, which is optimal with respect to register usage, orders instructions by scheduling sub-trees ... |

105 |
Integrating register allocation and instruction scheduling for RISCs
- Bradlee, Eggers, et al.
- 1991
(Show Context)
Citation Context ...hreshold, IPS switches to CSR to reduce register usage. Once reduced appropriately, IPS reverts to CSP. This oscillation continues until the scheduling process is complete. Bradlee, Eggers, and Henry =-=[BEH91]-=- describe another integrated system, Register Allocation with Schedule Estimates (RASE), and compare it to IPS. RASE works in three sequential passes: PRESCHED, GRA, and FINALSCHED. For each basic blo... |

99 |
Code Optimization of Pipeline Constraints
- Gross
- 1972
(Show Context)
Citation Context ...erlock and stall the processor for the additional cycle. Previously, most instruction schedulers handled delayed-load scheduling by solving a more general problem of arbitrary instruction scheduling (=-=[HG83]-=-, [GM86], [War90], [LLM + 87], and [PS90]). Arbitrary instruction scheduling considers operations other than loads with delays (such as multiplies/divides) that can take many cycles to complete. This ... |

96 |
Optimal Code Generation for Expression Trees
- Aho, Johnson
- 1976
(Show Context)
Citation Context ...terns that describe different instructions are given weights to describe their relative costs, dynamic programming can be used to select the optimal set of instructions to evaluate the tree ([AGT89], =-=[AJ76]-=-, [PLG88], [BDB90], and [AG85]). Dynamic programming is an expensive operation since it finds all optimal subsolutions before finding a solution for the entire tree. Fortunately, Bottom-Up Rewrite Sys... |

96 | Register allocation via hierarchical graph coloring
- Callahan, Koblenz
- 1991
(Show Context)
Citation Context ...que involves creating a register interference graph and then pruning nodes from that graph that can be trivially colored (assigned a physical register) ([CAC + 81], [Cha82], [CH90], [LH86], [BCKT89], =-=[CK91]-=-). The nodes of the graph represent the live ranges of the different register candidates (variables and temporaries). The live range of a candidate is the set of all program points where that candidat... |

68 |
BEG – A Generator for Efficient Backends
- Emmelmann, Schröer, et al.
- 1989
(Show Context)
Citation Context ...e all costs must be compile-compile time constants to precompute dynamic programming decisions. 4.11.2 BEG A code generator generator based on tree pattern matching was developed by Emmelmann,set al. =-=[ESL89]-=-. The Back End Generator (BEG) uses naive pattern matching to find pattern matches within the tree IR to do instruction selection. The least-cost cover of the tree is found using dynamic programming t... |

63 | A retargetable compiler for ANSI
- Hanson
- 1991
(Show Context)
Citation Context ...7 minutes to under 15 seconds on a DECstation 5000. Since its development, BURG has been made publicly available, and is being used at AT&T Bell Labs to develop code generators for an ANSI C compiler =-=[FH91b]-=-. Chapter 2 Delayed-Load Scheduling Modern RISC architectures are characterized by small, simple instruction sets, and general-purpose registers. While simple functionally, many of the instructions ar... |

52 |
Code generation for expressions with common subexpressions
- Aho, Johnson
(Show Context)
Citation Context ...ix Chapter 1 Introduction The three main problems in code generation are what instructions to use, in what order to do the computations, and what values to keep in registers. Aho, Johnson, and Ullman =-=[AJU77]-=-. 1.1 Overview This thesis describes the following issues in code generation theory and technology: Chapter 2 develops an optimal instruction scheduler and register allocator for delayedload architect... |

52 |
Fast allocation and deallocation of memory based on object lifetimes
- Hanson
- 1990
(Show Context)
Citation Context ...a 68000 grammar the program computed over 100,000 redundant itemsets. Fortunately, knowledge of the the allocation/deallocation pattern of particular data can lead to very efficient memory management =-=[Han90]-=-. This is the case with itemsets. Itemsets, after allocation, are computed and then either retained forever or immediately released. It can never be the case, therefore, that two itemset deallocations... |

52 |
Optimal Code generation for expression tree: An application of BURS theory
- Pelegri-Llopart, Graham
- 1988
(Show Context)
Citation Context ...at describe different instructions are given weights to describe their relative costs, dynamic programming can be used to select the optimal set of instructions to evaluate the tree ([AGT89], [AJ76], =-=[PLG88]-=-, [BDB90], and [AG85]). Dynamic programming is an expensive operation since it finds all optimal subsolutions before finding a solution for the entire tree. Fortunately, Bottom-Up Rewrite System (BURS... |

49 | fast optimal instruction selection and tree parsing - Burg - 1992 |

38 | Probalistic register allocation - Proebsting, Fischer - 1992 |

37 |
CCG: A prototype coagulating code generator
- Morris
- 1991
(Show Context)
Citation Context ...reuse within a basic block. Simple, fast and nearly optimal local register allocators are known [HFG89]. Once local register needs are met, the effects of global allocation can be estimated ([Bea74], =-=[Mor91]-=-). In particular, a good global allocation improves upon good local allocation by eliminating unnecessary loads at the entrance to a basic block and by eliminating unnecessary stores at the exit from ... |

37 | Scheduling time-critical instructions on risc machines
- PALEM, SIMONS
- 1993
(Show Context)
Citation Context ...ditional cycle. Previously, most instruction schedulers handled delayed-load scheduling by solving a more general problem of arbitrary instruction scheduling ([HG83], [GM86], [War90], [LLM + 87], and =-=[PS90]-=-). Arbitrary instruction scheduling considers operations other than loads with delays (such as multiplies/divides) that can take many cycles to complete. This thesis describes the delayed-load schedul... |

36 |
An improvement to bottom-up tree pattern matching
- Chase
- 1987
(Show Context)
Citation Context ...igure 4.16). 91 of a transition table through a single indirection. Chase demonstrated that these maps can be produced on-the-fly during table generation so that no superfluous work need be performed =-=[Cha87]-=-. Pelegri-Llopart, the originator of BURS theory ([PLG88], [PL88]), encorporated Chase's ideas into a system that added cost information for dynamic programming at table generation time. In addition t... |

33 | Efficient retargetable code generation using bottom-up tree pattern matching
- Balachandran, Dhamdere, et al.
- 1990
(Show Context)
Citation Context ...systems, including the techniques described here, do not allow general rewrites, but instead defer that responsibility to another phase of the compilation process. Balachandran, Dhamdhere, and Biswas =-=[BDB90]-=- simplified Pelegri's model by disallowing rewrite rules, and also generalized Chase's ideas to use cost information. Henry [Hen89] developed optimization techniques to limit the number of BURS states... |

32 |
Instruction scheduling for the IBM RISC System/6000 processor
- Warren
- 1990
(Show Context)
Citation Context ... the processor for the additional cycle. Previously, most instruction schedulers handled delayed-load scheduling by solving a more general problem of arbitrary instruction scheduling ([HG83], [GM86], =-=[War90]-=-, [LLM + 87], and [PS90]). Arbitrary instruction scheduling considers operations other than loads with delays (such as multiplies/divides) that can take many cycles to complete. This thesis describes ... |

27 |
Efficient tree pattern matching: An aid to code generation
- Aho, Ganapathi
- 1985
(Show Context)
Citation Context ...instructions are given weights to describe their relative costs, dynamic programming can be used to select the optimal set of instructions to evaluate the tree ([AGT89], [AJ76], [PLG88], [BDB90], and =-=[AG85]-=-). Dynamic programming is an expensive operation since it finds all optimal subsolutions before finding a solution for the entire tree. Fortunately, Bottom-Up Rewrite System (BURS) technology, can hid... |

27 | Simple and efficient BURS table generation - Proebsting - 1992 |

25 |
On the minimization of loads/stores in local register allocation
- Hsu, Fisher, et al.
- 1989
(Show Context)
Citation Context ... operand is actually 43 used it must be in a register, and once a value is in a register, it is easy to reuse within a basic block. Simple, fast and nearly optimal local register allocators are known =-=[HFG89]-=-. Once local register needs are met, the effects of global allocation can be estimated ([Bea74], [Mor91]). In particular, a good global allocation improves upon good local allocation by eliminating un... |

24 |
Register allocation via usage counts
- FREIBURGHOUSE
- 1974
(Show Context)
Citation Context ... decisions must still be made via ad hoc heuristic methods. Few graph coloring techniques do local (basic block level) register allocation as well as established local allocation algorithms ([HFG89], =-=[Fre74]-=-, [FL88]). Unlike graph coloring algorithms, local allocation techniques are able to exploit information about the simple sequential nature of register usage in the block to minimize local spill code.... |

20 |
Crafting a Compiler. Benjamin-Cummings, Menlo Park
- Fischer, LeBlanc
- 1988
(Show Context)
Citation Context ...s must still be made via ad hoc heuristic methods. Few graph coloring techniques do local (basic block level) register allocation as well as established local allocation algorithms ([HFG89], [Fre74], =-=[FL88]-=-). Unlike graph coloring algorithms, local allocation techniques are able to exploit information about the simple sequential nature of register usage in the block to minimize local spill code. This in... |

19 |
Register Allocation in the SPUR Lisp Compiler
- Larus, Hilfinger
- 1986
(Show Context)
Citation Context ...hose aggregate benefit would have been greater. Previous global register allocation methods have concentrated on casting register allocation as a graph-coloring problem ([CAC + 81], [CH90], [BCKT89], =-=[LH86]-=-). Since no two simultaneously live values may be assigned to the same register, an interference graph can be built where nodes represent register candidate values, and arcs exist between simultaneous... |

17 |
A code generation interface for
- Fraser, Hanson
- 1991
(Show Context)
Citation Context ...y 6 integer registers available. 3.5 Implementation Results A prototype probabilistic register allocator has been built as part of an experimental code generator for an ANSI C compiler ("lcc"=-=; [FH91b] [FH91a]-=-). The code generator produces MIPS R2000 assembler. 3.5.1 Stanford Benchmarks The tables in Figure 3.10--3.11 summarizes the results of running the compiler on the Stanford benchmarks suite. Each pro... |

17 | High-quality code generation via bottom-up tree pattern matching - Hatcher, Christopher - 1986 |

13 |
Scheduling Arithmetic and Load Operations in parallel with No Spilling
- Bernstein, Jaffe, et al.
- 1989
(Show Context)
Citation Context ...with 2 functional units, one for loads and one for operations, with identical pipeline constraints, Bernstein et. al. have investigated code scheduling with register allocation for trees ([BPR84] and =-=[BJR89]-=-). Although applicable to a much different machine, Bernstein's results and algorithms are similar to ours 1 ---both minimize pipeline interlocks and register usage, and both run in O(n) time (where n... |

13 | Hard-coding bottom-up code generation tables to save time and space. Software--Practice~Experience
- Fraser, Henry
- 1991
(Show Context)
Citation Context ...costs to build automata that can drive instruction selection very quickly. BURS generated instruction selectors can be built that execute fewer than 50 VAX instructions per node of an expression tree =-=[FH91c]-=-. BURS code generators are fast for two reasons: they use bottom-up tree pattern matching technology (the theoretically fastest possible [HO82]), and they do all dynamic programming at compile-compile... |

12 | C.N.: Linear-Time, Optimal Code Scheduling for Delayed-Load Architectures - Proebsting, Fischer - 1991 |

10 | Code Generation and Reorganization in the Presence of Pipeline Constraints
- Hennessy
- 1982
(Show Context)
Citation Context ...t is important not to overuse them when scheduling instructions. 2.1 Overview The problem of optimally scheduling instructions under arbitrary pipeline constraints is NP-complete ([GJ79], [LLM + 87], =-=[HG82]-=-, and [PS90]). Many heuristics have been proposed for scheduling pipelined code; all assume, however, that pipeline constraints can occur after any instruction, and that operators may share common sub... |

10 | Tree templates and subtree transformational grammars - Kron - 1975 |

10 |
Rewrite Systems, Pattern Matching, and Code Generation
- Pelegr'i-Llopart
- 1987
(Show Context)
Citation Context .... Chase demonstrated that these maps can be produced on-the-fly during table generation so that no superfluous work need be performed [Cha87]. Pelegri-Llopart, the originator of BURS theory ([PLG88], =-=[PL88]-=-), encorporated Chase's ideas into a system that added cost information for dynamic programming at table generation time. In addition to recognizing that dynamic programming could be done prior to com... |

9 | Concise specification of locally optimal code generators - Appel - 1987 |

9 |
Encoding optimal pattern selection in a table-driven bottom-up treepattern matcher
- Henry
- 1989
(Show Context)
Citation Context ...hase of the compilation process. Balachandran, Dhamdhere, and Biswas [BDB90] simplified Pelegri's model by disallowing rewrite rules, and also generalized Chase's ideas to use cost information. Henry =-=[Hen89]-=- developed optimization techniques to limit the number of BURS states produced during table generation. With fewer states, a smaller automaton is produced more quickly. Henry's techniques are much mor... |

8 |
Register assignment algorithm for generation of highly optimized code
- Beatty
- 1974
(Show Context)
Citation Context ... easy to reuse within a basic block. Simple, fast and nearly optimal local register allocators are known [HFG89]. Once local register needs are met, the effects of global allocation can be estimated (=-=[Bea74]-=-, [Mor91]). In particular, a good global allocation improves upon good local allocation by eliminating unnecessary loads at the entrance to a basic block and by eliminating unnecessary stores at the e... |

6 | Table compression for tree automata - Borstler, Moncke, et al. - 1991 |

6 |
Optimal scheduling of arithmetic operations in parallel with memory access
- Bernstein, Jaffe, et al.
- 1985
(Show Context)
Citation Context ...rchitecture with 2 functional units, one for loads and one for operations, with identical pipeline constraints, Bernstein et. al. have investigated code scheduling with register allocation for trees (=-=[BPR84]-=- and [BJR89]). Although applicable to a much different machine, Bernstein's results and algorithms are similar to ours 1 ---both minimize pipeline interlocks and register usage, and both run in O(n) t... |

1 |
Extended delayed-load scheduling
- Kurlander, Fischer, et al.
- 1992
(Show Context)
Citation Context ...epresenting delayed loads. A more realistic machine model must be able to handle unary nodes, leaf instructions without delays, and delayed loads at internal nodes. Kurlander, Fischer, and Proebsting =-=[KFP92]-=- have extended DLS to optimally handle unary nodes and non-delayed leaf nodes. The improvements are called Extended DLS (EDLS). In addition, they give a simple heuristic for scheduling trees with inte... |