Results 1 - 10
of
82
Iterative modulo scheduling: An algorithm for software pipelining loops
- In Proceedings of the 27th Annual International Symposium on Microarchitecture
, 1994
"... Modulo scheduling is a framework within which a wide variety of algorithms and heuristics may be defined for software pipelining innermost loops. This paper presents a practical algorithm, iterative modulo scheduling, that is capable of dealing with realistic machine models. This paper also characte ..."
Abstract
-
Cited by 263 (2 self)
- Add to MetaCart
Modulo scheduling is a framework within which a wide variety of algorithms and heuristics may be defined for software pipelining innermost loops. This paper presents a practical algorithm, iterative modulo scheduling, that is capable of dealing with realistic machine models. This paper also characterizes the algorithm in terms of the quality of the generated schedules as well the computational expense incurred.
Instruction-Level Parallel Processing: History, Overview and Perspective
, 1992
"... Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a muc ..."
Abstract
-
Cited by 166 (0 self)
- Add to MetaCart
Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built, and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP have become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.
Lifetime-Sensitive Modulo Scheduling
- In Proc. of the ACM SIGPLAN '93 Conf. on Programming Language Design and Implementation
, 1993
"... This paper shows how to software pipeline a loop for minimal register pressure without sacrificing the loop's minimum execution time. This novel bidirectional slack-scheduling method has been implemented in a FORTRAN compiler and tested on many scientific benchmarks. The empirical results---when me ..."
Abstract
-
Cited by 129 (0 self)
- Add to MetaCart
This paper shows how to software pipeline a loop for minimal register pressure without sacrificing the loop's minimum execution time. This novel bidirectional slack-scheduling method has been implemented in a FORTRAN compiler and tested on many scientific benchmarks. The empirical results---when measured against an absolute lower bound on execution time, and against a novel schedule-independent absolute lower bound on register pressure---indicate nearoptimal performance. 1 Introduction Software pipelining increases a loop's throughput by overlapping the loop's iterations; that is, by initiating successive iterations before prior iterations complete. With sufficient overlap, a functional unit can be saturated, at which point the loop initiates iterations at the maximum possible rate. To find an overlapped schedule, a compiler must represent the complex resource constraints that can arise. Efficiently representing these constraints is especially difficult when adjacent iterations do n...
Improving the Ratio of Memory Operations to Floating-Point Operations in Loops
- ACM Transactions on Programming Languages and Systems
, 1994
"... this paper we attempt to answer that question. To do so, we develop and evaluate techniques that automatically restructure program loops to achieve high performance on specific target architectures. These methods attempt to balance computation and memory accesses and seek to eliminate or reduce pipe ..."
Abstract
-
Cited by 91 (16 self)
- Add to MetaCart
this paper we attempt to answer that question. To do so, we develop and evaluate techniques that automatically restructure program loops to achieve high performance on specific target architectures. These methods attempt to balance computation and memory accesses and seek to eliminate or reduce pipeline interlock. To do this, they statically estimate the balance between memory operations and floating-point operations for each loop in a particular program and use these estimates to determine whether to apply various loop transformations. Experiments with our automatic techniques show that integer-factor speedups are possible on kernels. Additionally, the estimate of the balance between memory operations and computation, and the application of the estimate are very accurate---experiments reveal little difference between the balance achieved by our automatic system and that possible by hand optimization. Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors---Compilers ;
Code Generation Schema for Modulo Scheduled Loops
- in Proceedings of the 25th Annual International Symposium on Microarchitecture
, 1992
"... Software pipelining is an important instruction scheduling technique for efficiently overlapping successive iterations of loops and executing them in parallel. Modulo scheduling is one approach for generating such schedules. This paper addresses an issue which has received little attention thus far, ..."
Abstract
-
Cited by 80 (6 self)
- Add to MetaCart
Software pipelining is an important instruction scheduling technique for efficiently overlapping successive iterations of loops and executing them in parallel. Modulo scheduling is one approach for generating such schedules. This paper addresses an issue which has received little attention thus far, but which is non-trivial in its complexity: the task of generating correct, high-performance code once the modulo schedule has been generated, taking into account the nature of the loop and the register allocation strategy that will be used. This issue is studied both with and without hardware features that are specifically aimed at supporting modulo scheduling.
Minimizing Register Requirements under Resource-Constrained Rate-Optimal Software Pipelining
, 1995
"... The rapid advances in high-performance computer architecture and compilation techniques provide both challenges and opportunities to exploit the rich solution space of software pipelined loop schedules. In this paper, we develop a framework to construct a software pipelined loop schedule which runs ..."
Abstract
-
Cited by 73 (13 self)
- Add to MetaCart
The rapid advances in high-performance computer architecture and compilation techniques provide both challenges and opportunities to exploit the rich solution space of software pipelined loop schedules. In this paper, we develop a framework to construct a software pipelined loop schedule which runs on the given architecture (with a fixed number of processor resources) at the maximum possible iteration rate (`a la rate-optimal) while minimizing the number of buffers --- a close approximation to minimizing the number of registers. The main contributions of this paper are: ffl First, we demonstrate that such problem can be described by a simple mathematical formulation with precise optimization objectives under a periodic linear scheduling framework. The mathematical formulation provides a clear picture which permits one to visualize the overall solution space (for rate-optimal schedules) under different sets of constraints. ffl Secondly, we show that a precise mathematical formulation...
Enhanced Modulo Scheduling for Loops with Conditional Branches
- In Proceedings of the 25th Annual International Symposium on Microarchitecture
, 1992
"... Loops with conditional branches have multiple execution paths which are di cult to software pipeline. The modulo scheduling technique for software pipelining addresses this problem by converting loops with conditional branches into straight-line code before scheduling. In this paper we present an En ..."
Abstract
-
Cited by 69 (6 self)
- Add to MetaCart
Loops with conditional branches have multiple execution paths which are di cult to software pipeline. The modulo scheduling technique for software pipelining addresses this problem by converting loops with conditional branches into straight-line code before scheduling. In this paper we present an Enhanced Modulo Scheduling (EMS) technique that can achieve a lower minimum Initiation Interval than modulo scheduling techniques that rely on either Hierarchical Reduction or If-conversion with Predicated Execution. These three modulo scheduling techniques have been implemented inaprototype compiler. We show that for existing architectures which support one branch per cycle, EMS performs approximately 18 % better than Hierarchical Reduction. We also show that If-conversion with Predicated Execution outperforms EMS assuming one branch per cycle. However, with hardware support for multiple branches per cycle, EMS should perform as well as or better than If-conversion with Predicated Execution. 1
Effective Cluster Assignment for Modulo Scheduling
- IN PROCEEDINGS OF THE 31 INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO-31
, 1998
"... Clustering is one solution to the demand for wideissue machines and fast clock cycles because it allows for smaller, less ported register files and simpler bypass logic while remaining scaleable. Much of the previous work on scheduling for clustered architectures has focused on acyclic code. While m ..."
Abstract
-
Cited by 53 (0 self)
- Add to MetaCart
Clustering is one solution to the demand for wideissue machines and fast clock cycles because it allows for smaller, less ported register files and simpler bypass logic while remaining scaleable. Much of the previous work on scheduling for clustered architectures has focused on acyclic code. While minimizing schedule length of acyclic code is paramount, the primary objective when scheduling cyclic code is to maximize the throughput or steady state performance. This paper investigates a pre-modulo scheduling pass that performs cluster assignment in a way that minimizes performance degradation do to explicit communication required as the loops are split over clusters. The proposed cluster assignment algorithm annotates and adjusts the graph for use by the scheduler so that any traditional modulo scheduling algorithm, having no knowledge of clustering, can produce a valid and efficient schedule for a clustered machine.
Hypernode Reduction Modulo Scheduling
- IN PROC. OF THE 28TH ANNUAL INT. SYMP. ON MICROARCHITECTURE (MICRO28
, 1995
"... Software Pipelining is a loop scheduling technique that extracts parallelism from loops by overlapping the execution of several consecutive iterations. Most prior scheduling research has focused on achieving minimum execution time, without regarding register requirements. Most strategies tend to str ..."
Abstract
-
Cited by 53 (22 self)
- Add to MetaCart
Software Pipelining is a loop scheduling technique that extracts parallelism from loops by overlapping the execution of several consecutive iterations. Most prior scheduling research has focused on achieving minimum execution time, without regarding register requirements. Most strategies tend to stretch operand lifetimes because they schedule some operations too early or too late. The paper presents a novel strategy that simultaneously schedules some operations late and other operations early, minimizing all the stretchable dependencies and therefore reducing the registers required by the loop. The key of this strategy is a pre-ordering phase that selects the order in which the operations will be scheduled. The results show that the method described in this paper performs better than other heuristic methods and almost as well as a linear programming method but requiring much less time to produce the schedules.
Out-of-Order Vector Architectures
, 1997
"... Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace d ..."
Abstract
-
Cited by 46 (21 self)
- Add to MetaCart
Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace driven simulation we compare a conventional vector implementation, based on the Convex C3400, with an out-of-order, register renaming, vector implementation. When the number of physical registers is above 12, out-of-order execution coupled with register renaming provides a speedup of 1.24--1.72 for realistic memory latencies. Out-of-order techniques also tolerate main memory latencies of 100 cycles with a performance degradation less than 6%. The mechanisms used for register renaming and out-of-order issue can be used to support precise interrupts -- generally a difficult problem in vector machines. When precise interrupts are implemented, there is typically less than a 10% degradation in performance. A new technique based on register renaming is targeted at dynamically eliminating spill code; this technique is shown to provide an extra speedup ranging between 1.10 and 1.20 while reducing total memory traffic by an average of 15--20%.

