## Iterative modulo scheduling: An algorithm for software pipelining loops (1994)

### Cached

### Download Links

- [www.cs.princeton.edu]
- [www.hpl.hp.com]
- [www.crhc.uiuc.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of the 27th Annual International Symposium on Microarchitecture |

Citations: | 288 - 3 self |

### BibTeX

@INPROCEEDINGS{Rau94iterativemodulo,

author = {B. Ramakrishna Rau},

title = {Iterative modulo scheduling: An algorithm for software pipelining loops},

booktitle = {In Proceedings of the 27th Annual International Symposium on Microarchitecture},

year = {1994},

pages = {63--74}

}

### Years of Citing Articles

### OpenURL

### Abstract

Modulo scheduling is a framework within which a wide variety of algorithms and heuristics may be defined for software pipelining innermost loops. This paper presents a practical algorithm, iterative modulo scheduling, that is capable of dealing with realistic machine models. This paper also characterizes the algorithm in terms of the quality of the generated schedules as well the computational expense incurred.

### Citations

2510 |
The Design and Analysis of Computer Algorithms
- Aho, Hopcroft, et al.
- 1974
(Show Context)
Citation Context ... procedure is to iteratively solve the above implicit set of equations for HeightR0. An algorithm that is based on that for identifying the SCCs of a graph during a depth-first traversal of the graph =-=[2]-=- was employed. This algorithm is described elsewhere [33]. 68 HeightR(P) - { 0, - Max (HeightR(Q) + Delay(P,Q) - II*Distance(P,Q)), QsSucc(P) (a) if P is the STOP pseudo-op, otherwise. Estart(P) = Max... |

636 |
Trace Scheduling: A Technique for Global Microcode Compaction
- Fisher
- 1981
(Show Context)
Citation Context ...s in a single basic block and that higher levels of parallelism can only result from exploiting the ILP between successive basic blocks. Global acyclic scheduling techniques, such as trace scheduling =-=[13, 23]-=- and superblock scheduhng [19], do so by moving operations from their original basic blocks to preceding or succeeding basic blocks. In the case of loops, the successive basic blocks correspond to the... |

524 | Software Pipelining: An Effective Scheduling Technique for VLIW machines
- Lam
- 1988
(Show Context)
Citation Context ...mework was formulated over a decade ago [34], at least two product compilers have incorporated modulo scheduling algorithms [30, 10], and any number of research papers have been written on this topic =-=[16, 21, 41, 39, 44, 45, 18]-=-, there exists a vague and unfounded perception that modulo scheduling is computationally expensive, too complicated to implement, and that the resulting schedules are sub-optimal. In large part, this... |

352 | Effective compiler support for predicated execution using the hyperblock
- MAHLKE, LIN, et al.
- 1992
(Show Context)
Citation Context ...l flow graph. With the use of either profile information or heuristics, only those control flow paths that are expected to be frequently executed can be selected as is done with hyperblock scheduling =-=[25]-=-. This defines the region that is to be modulo scheduled. o Within this region, memory reference data flow analysis and optimization are performed in order to eliminate partially redundant loads and s... |

283 |
Combinational Optimization: Networks and
- Lawler
- 1976
(Show Context)
Citation Context ...e inequality for that circuit, and to use the largest such value across all circuits. The second approach, theoneused in this study, isto pose the problem as aminimal cost-to-time ratio cycle problem =-=[22]-=- as proposed by Huff [18]. The algorithm ComputeMinDist computes, for a given II, the MinDist matrix whose [i, j] entry specifies the minimum permissible interval between the time at which operation i... |

269 |
Conversion of control dependence to data dependence
- Allen, Kennedy, et al.
- 1983
(Show Context)
Citation Context ... if the memory peris are the critical (most heavily used) resources.sAt this point, the selected region is IF-converted, with the result that all branches except for the loop-closing branch disappear =-=[4, 29, 10]-=-. With control flow converted to data dependences involving predicates [37, 5], the region now looks like a single basic block.sAnti- and output dependences are minimized by putting the computation in... |

268 | The superblock: an effective technique for VLIW and superscalar compilation
- Hwu
- 1993
(Show Context)
Citation Context ...higher levels of parallelism can only result from exploiting the ILP between successive basic blocks. Global acyclic scheduling techniques, such as trace scheduling [13, 23] and superblock scheduling =-=[19]-=-, do so by moving operations from their original basic blocks to preceding or succeeding basic blocks. In the case of loops, the successive basic blocks correspond to the successive iterations of the ... |

243 | Some Scheduling Techniques and an Easily Schedulable Horizontal Architecture for - Rau, Glaeser - 1981 |

206 | The Perfect Club benchmarks: Effective performance evaluation of supercomputers - BERRY - 1989 |

177 |
A comparison of list schedules for parallel processing systems
- Adam, Chandy, et al.
- 1974
(Show Context)
Citation Context ...er. One way of performing software pipelining, the “movethen-schedule” approach, is to move instructions, one by one, across the back-edge of the loop, in either the forward or the backward direction =-=[11, 12, 20, 15, 28]-=-. Although such code motion can yield improvements in the schedule, it is not always clear which operations should be moved around the back edge, in which direction and how many times to get the best ... |

177 | The multiflow trace scheduling compiler
- Lowney, Freudenberger, et al.
- 1993
(Show Context)
Citation Context ...s in a single basic block and that higher levels of parallelism can only result from exploiting the ILP between successive basic blocks. Global acyclic scheduling techniques, such as trace scheduling =-=[13, 23]-=- and superblock scheduhng [19], do so by moving operations from their original basic blocks to preceding or succeeding basic blocks. In the case of loops, the successive basic blocks correspond to the... |

168 |
Parallel Sequencing and Assembly Line Problems
- Hu
- 1961
(Show Context)
Citation Context ...schedule such operations since all but the first one scheduled in a SCC are subject to a 68 deadline. Instead, we shall use a priority function that is a direct extension of the height-based priority =-=[17, 31]-=- that is popular in acyclic list scheduling [1]. function FindTimeSlot (Operation, MinTlme, MaxTime: integer) integer; var CurrTime, SchedSlot: integer; begin CurrTime : = MinTime; SchedSlot :. null; ... |

147 |
The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range
- McMahon
- 1986
(Show Context)
Citation Context ...ts 4.1 The experimental setup The experimental input to the research scheduler was obtained from the Perfect Club benchmark suite [6], the Spcc benchmarks [43] and the Livermore Fortran Kernels (LFK) =-=[27]-=- using the Fortran77 compiler for the Cydra 5. The Cydra 5 compiler examines every innermost loop as a potential candidate for modulo scheduIing. Candidate loops are rejected if they are not DO-loops,... |

146 | Compiling for the Cydra 5 - Dehnert, Towle - 1993 |

141 | Highly concurrent scalar processing
- Hsu, Davidson
- 1986
(Show Context)
Citation Context ...mework was formulated over a decade ago [34], at least two product compilers have incorporated modulo scheduling algorithms [30, 10], and any number of research papers have been written on this topic =-=[16, 21, 41, 39, 44, 45, 18]-=-, there exists a vague and unfounded perception that modulo scheduling is computationally expensive, too complicated to implement, and that thc resulting schedules are sub-optimal. In large part, this... |

132 | Lifetime-sensitive modulo scheduling
- Huff
- 1993
(Show Context)
Citation Context ...mework was formulated over a decade ago [34], at least two product compilers have incorporated modulo scheduling algorithms [30, 10], and any number of research papers have been written on this topic =-=[16, 21, 41, 39, 44, 45, 18]-=-, there exists a vague and unfounded perception that modulo scheduling is computationally expensive, too complicated to implement, and that the resulting schedules are sub-optimal. In large part, this... |

110 |
Register allocation for software pipelined loops
- RAU, LEE, et al.
- 1992
(Show Context)
Citation Context ..., the prologue ana epilogues can te scneculecl along wth the rest of the code surrounding the loop while honoring the constraints imposed by the schedule for the kernel.)sRotating register allocation =-=[35]-=- (or traditional register allocation if modulo variable expansion was done) is performed for the kernel. The prologue and epilogues are treated along with the rest of the code surrounding the loop in ... |

104 | On predicated execution - Park, Schlansker - 1991 |

86 | Code Generation Schema for Modulo Scheduled Loops
- Rau, Schlansker, et al.
- 1992
(Show Context)
Citation Context ...rmed the initiation interval (II). In contrast to unrolling approaches, the code expansion is quite limited. In fact, with the appropriate hardware support, there need be no code expansion whatsoever =-=[36]-=-. Once the modulo schedule has been Permi::ion to ,p without f,, all or War of thi mt.rial i* granted provided that the copies are not made or distributed for direct ommemial advantage, the ACM copyri... |

85 | Iterative modulo scheduling - RAU |

85 |
approach to scientific array processing: The architectural design of the AP-120b/FPS-164 family
- Charlesworth
- 1981
(Show Context)
Citation Context ... scheduling barrier at the back-edge. The resulting performance degradation can be reduced by increasing the extent of the unrolling, but it is at the cost of increased code size. Software pipelining =-=[8] refers to-=- a class of global cyclic scheduling algorithms which impose no such scheduling barrier. One way of perf,)fining software pipelining, the "movethen -schedule" approach, is to move instructio... |

78 |
An efficie ' resourceconstrained global scheduling tee nique for superscalar and VLIW processors
- Moon, Ebcioglu
- 1992
(Show Context)
Citation Context ...er. One way of performing software pipelining, the “movethen-schedule” approach, is to move instructions, one by one, across the back-edge of the loop, in either the forward or the backward direction =-=[11, 12, 20, 15, 28]-=-. Although such code motion can yield improvements in the schedule, it is not always clear which operations should be moved around the back edge, in which direction and how many times to get the best ... |

69 | Parallelization of loops with exits on pipelined architectures
- Tirumalai, Lee, et al.
- 1990
(Show Context)
Citation Context ...putation into the dynamic single assignment form [32].sIf control dependences are the limiting factor in schedule performance, they may be selectively ignored thereby enabling speculative code motion =-=[41, 24]-=-.sBack-substitution of data and control dependences may be enployed to further reduce critical path lengths [38, 10].sNext, the lower bound on the initiation interval is computed. If this is not an in... |

66 |
The Cydra 5 departmental supercomputer: Design philosophies, decisions, and trade-offs
- Rau, Yen, et al.
- 1989
(Show Context)
Citation Context ..., the selected region is IF-converted, with the result that all branches except for the loop-closing branch disappear [4, 29, 10]. With control flow converted to data dependences involving predicates =-=[37, 5]-=-, the region now looks like a single basic block.sAnti- and output dependences are minimized by putting the computation into the dynamic single assignment form [32].sIf control dependences are the lim... |

65 | A compilation technique for software pipelining of loops with conditional jumps - Ebcioglu - 1987 |

64 | Reverse if-conversion
- Warter, Mahlke, et al.
- 1993
(Show Context)
Citation Context ...e loop in such a way as to honor the constraints imposed by the register allocation for the kernel. 64 ● Finally, if the hardware has no predicated execution capability [37, 5], reverse IF-conversion =-=[46]-=- is employed to regenerate control flow. The subject of this paper is the modulo scheduling algorithm itself, which is at the heart of this entire process. This includes the computation of the lower b... |

62 |
The Cydra 5 minisupercomputer: Architecture and implementation
- Beck, Yen, et al.
- 1993
(Show Context)
Citation Context ..., the selected region is IF-converted, with the result that all branches except for the loop-closing branch disappear [4, 29, 10]. With control flow converted to data dependences involving predicates =-=[37, 5]-=-, the region now looks like a single basic block.sAnti- and output dependences are minimized by putting the computation into the dynamic single assignment form [32].sIf control dependences are the lim... |

52 |
A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture
- Ebcioglu, Nakatani
- 1989
(Show Context)
Citation Context ... One way of perf,)fining software pipelining, the "movethen -schedule" approach, is to move instructions, one by one, across the back-edge of the loop, in either the forward or the backward =-=direction [11, 12, 20, 15, 28]-=-. Although such code motion can yield improvements in the schedule, it is not always clear which operations should be moved around the back edge, in which direction and how many times to get the best ... |

47 | SPEC Benchmark Suite: Designed for Today's Advanced Systems - Uniejewski - 1989 |

45 | Sentinel scheduling: A model for compiler-controlled speculative execution. A
- Mahlke
- 1993
(Show Context)
Citation Context ...putation into the dynamic single assignment form [32].sIf control dependences are the limiting factor in schedule performance, they may be selectively ignored thereby enabling speculative code motion =-=[41, 24]-=-.sBack-substitution of data and control dependences may be enployed to further reduce critical path lengths [38, 10].sNext, the lower bound on the initiation interval is computed. If this is not an in... |

43 |
Circular scheduling: A new technique to perform software pipelining
- Jain
- 1991
(Show Context)
Citation Context ... One way of perf,)fining software pipelining, the "movethen -schedule" approach, is to move instructions, one by one, across the back-edge of the loop, in either the forward or the backward =-=direction [11, 12, 20, 15, 28]-=-. Although such code motion can yield improvements in the schedule, it is not always clear which operations should be moved around the back edge, in which direction and how many times to get the best ... |

40 | ResourceContrained Software Pipelining - Aiken, Nicolau, et al. - 1995 |

38 |
Optimal scheduling strategies in a multiprocessor system
- Ramamoorthy, Chandy, et al.
- 1972
(Show Context)
Citation Context ...schedule such operations since all but the first one scheduled in a SCC are subject to a 68 deadline. Instead, we shall use a priority function that is a direct extension of the height-based priority =-=[17, 31]-=- that is popular in acyclic list scheduling [1]. function FindTimeSlot (Operation, MinTlme, MaxTime: integer) integer; var CurrTime, SchedSlot: integer; begin CurrTime : = MinTime; SchedSlot :. null; ... |

32 |
GURPR: A New Global Software Pipelining Algorithm
- Su, Wang
- 1991
(Show Context)
Citation Context ...mework was formulated over a decade ago [34], at least two product compilers have incorporated modulo scheduling algorithms [30, 10], and any number of research papers have been written on this topic =-=[16, 21, 41, 39, 44, 45, 18]-=-, there exists a vague and unfounded perception that modulo scheduling is computationally expensive, too complicated to implement, and that the resulting schedules are sub-optimal. In large part, this... |

29 | A polynomial time method for optimal software pipelining
- Dongen, Gao, et al.
- 1992
(Show Context)
Citation Context ...mework was formulated over a decade ago [34], at least two product compilers have incorporated modulo scheduling algorithms [30, 10], and any number of research papers have been written on this topic =-=[16, 21, 41, 39, 44, 45, 18]-=-, there exists a vague and unfounded perception that modulo scheduling is computationally expensive, too complicated to implement, and that thc resulting schedules are sub-optimal. In large part, this... |

28 |
On algorithms for enumerating all circuits of a graph
- Mateti, Deo
- 1976
(Show Context)
Citation Context ... H imposed by this one recurrence circuit. The RecMII is determined by considering the worst-case constraint across all circuits. One approach is to enumerate all the elementary circuits in the graph =-=[40, 26]-=- as was done in the Cydra 5 compiler, calculate the smallest value of II that satisfies the above inequality for that circuit, and to use the largest such value across all circuits. The second approac... |

27 |
An efficient search algorithm to find the elementary circuits of a graph
- Tiernan
- 1970
(Show Context)
Citation Context ... H imposed by this one recurrence circuit. The RecMII is determined by considering the worst-case constraint across all circuits. One approach is to enumerate all the elementary circuits in the graph =-=[40, 26]-=- as was done in the Cydra 5 compiler, calculate the smallest value of II that satisfies the above inequality for that circuit, and to use the largest such value across all circuits. The second approac... |

25 |
Effective control for pipelined computers
- Davidson, Shar, et al.
- 1975
(Show Context)
Citation Context ... last cycle of execution. Likewise, Figure lb shows the resource usage pattern of a multiply operation on the multplier pipeline. This method of modelling resource usage is termed a reservation table =-=[9]-=-. From these two reservation tables, it is evident that an ALU operation (such as an add) and a multiply cannot be scheduled for issue at the same time since they will collide in their usage of the so... |

20 |
Software pipelining in PA-RISC compilers
- Ramakrishnan
- 1992
(Show Context)
Citation Context ... precede or follow the actual scheduling. Although the modulo scheduling framework was formulated over a decade ago [34], at least two product compilers have incorporated modulo scheduling algorithms =-=[30, 10]-=-, and any number of research papers have been written on this topic [16, 21, 41, 39, 44, 45, 18], there exists a vague and unfounded perception that modulo scheduling is computationally expensive, too... |

17 |
Data flow and dependence analysis for instruction level parallelism
- Rau
- 1992
(Show Context)
Citation Context ...is defines the region that is to be modulo scheduled. Within this region, memory reference data flow analysis and optimization are performed in order to eliminate partially redundant loads and stores =-=[32, 10]-=-. This can improve the schedule if either a load is on a critical path or If the memory ports are the critical (most heavily used) resources. At this point, the selected region is IF-converted, with t... |

14 |
Scheduling loops on parallel processors: a simple algorithm with close to optimum performance
- Gasperoni, Schwiegelshohn
- 1992
(Show Context)
Citation Context ...er. One way of performing software pipelining, the “movethen-schedule” approach, is to move instructions, one by one, across the back-edge of the loop, in either the forward or the backward direction =-=[11, 12, 20, 15, 28]-=-. Although such code motion can yield improvements in the schedule, it is not always clear which operations should be moved around the back edge, in which direction and how many times to get the best ... |

14 | The benefit of predicated execution for software pipelining
- Warter, Lavery, et al.
- 1993
(Show Context)
Citation Context |

13 |
Microcode compaction: Looking backward and looking forward
- Fisher, Landskov, et al.
- 1981
(Show Context)
Citation Context ...lbefore -scheduling" schemes which rely on unrolling the body of the original loop prior to scheduling [13, 19, 23] and the "unroll-while-scheduling" schemes which unroll concurrently w=-=ith scheduling [14, 7, 3]-=-. To be competitive with iterative modulo scheduling, those schemes would need to get within 2.8% of the (possibly unachievable) lower bound on execution time without unrolling the loop body to more t... |

12 | Acceleration of first and higher order recurrences on processors with instruction level parallelism
- Schlansker, Kathail
- 1993
(Show Context)
Citation Context ...rformance, they may be selectively ignored thereby enabling speculative code motion [41, 24].sBack-substitution of data and control dependences may be enployed to further reduce critical path lengths =-=[38, 10]-=-.sNext, the lower bound on the initiation interval is computed. If this is not an integer, and if the percentage degradation in rounding it up to the next larger integer is unacceptably high, the body... |

11 | The bene t of predicated execution for software pipelining - Warter, Lavery, et al. - 1993 |

7 | Loop optimization for horizontal microcoded machines - Bodin, Charot - 1990 |

4 | A comparison of list schedules for parallel processing systems - Dickson - 1974 |

4 |
A technique of global optimization of microprograms
- Tokoro, Takizuka, et al.
- 1978
(Show Context)
Citation Context ...irection and how many times to get the best results. The process is somewhat arbitrary and reminiscent of early attempts at global acyclic scheduling by the ad hoc motion of code between basic blocks =-=[42]-=-. On the other hand, this currently represents the only approach to software pipelining that at least has the potential to handle loops containing control flow in a near-optimal fashion, and which has... |

2 | A compilation technique for software pipelining of loops with conditional jumps - unknown authors - 1987 |

2 | Cotnbinatorial Optimization: Networks and - L - 1976 |