Results 1 -
9 of
9
Software pipelining: An effective scheduling technique for VLIW machines
, 1988
"... This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipe ..."
Abstract
-
Cited by 478 (3 self)
- Add to MetaCart
This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code. This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an im-proved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control con-structs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction comple-ments the software pipelining technique, permitting a consis-tent performance improvement be obtained. The techniques proposed have been validated by an im-plementation of a compiler for Warp, a systolic array consist-ing of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.
Estimating Interlock And Improving Balance For Pipelined Architectures
- Journal of Parallel and Distributed Computing
, 1988
"... Pipelining is now a standard technique for increasing the speed of computers, particularly for oating-point arithmetic. Single-chip, pipelined oating-point functional units are available as \o the shelf" components. Addressing arithmetic can be done concurrently with oating-point operations to co ..."
Abstract
-
Cited by 80 (14 self)
- Add to MetaCart
Pipelining is now a standard technique for increasing the speed of computers, particularly for oating-point arithmetic. Single-chip, pipelined oating-point functional units are available as \o the shelf" components. Addressing arithmetic can be done concurrently with oating-point operations to construct a fast processor that can exploit ne grain parallelism. This paper describes a metric to estimate the optimal execution time of DO loops on particular processors. This metric is parameterized by the memory bandwidth and peak oating-point rate of the processor, as well as the length of the pipelines used in the functional units. Data dependence analysis provides information about the execution order constraints of the operations in the DO loop and is used to estimate the amount of pipeline interlock required by a loop. Several transformations are investigated to determine their impact on loops under this metric. 1
The Warp Computer: Architecture, Implementation, and Performance
- IEEE Transactions on Computers
, 1987
"... The Warp machine is a systolic array computer of linearly connected cells, each of which is a programmable processor capable of performing 10 million floating-point operations per second (10 MFLOPS). A typical Warp array includes 10 cells, thus having a peak computation rate of 100 MFLOPS. The Warp ..."
Abstract
-
Cited by 42 (2 self)
- Add to MetaCart
The Warp machine is a systolic array computer of linearly connected cells, each of which is a programmable processor capable of performing 10 million floating-point operations per second (10 MFLOPS). A typical Warp array includes 10 cells, thus having a peak computation rate of 100 MFLOPS. The Warp array can be extended to include more cells to accommodate applications capable of using the increased computational bandwidth. Warp is integrated as an attached processor into a UN host system. Programs for Warp are written in a high-level language supported by an optimizing compiler.
Architecture and Compiler Tradeoffs for a Long Instruction Word Microprocessor
- Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems
, 1989
"... A very long instruction word (VLIW) processor exploits parallelism by controlling multiple operations in a single instruction word. This paper describes the architecture and compiler tradeoffs in the design of iWarp, a VLIW single-chip microprocessor developed in a joint project with Intel Corp. The ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
A very long instruction word (VLIW) processor exploits parallelism by controlling multiple operations in a single instruction word. This paper describes the architecture and compiler tradeoffs in the design of iWarp, a VLIW single-chip microprocessor developed in a joint project with Intel Corp. The iWarp processor is capable of spec-ifying up to nine operations in an instruction word and has a peak performance of 20 million floating-point op-erations and 20 million integer operations per second. An optimizing compiler has been constructed and used as a tool to evaluate the different architectural proposals in the development of iWarp. We present here the anal-ysis and compiler optimizations for those architectural features that address two key issues in the design of a VLIW microprocessor: code density and a streamlined execution cycle. We support the results of our analysis with performance data for the Livermore Loops and a selection of programs from the LINPACK library.
Comparing Static And Dynamic Code Scheduling for Multiple-Instruction-Issue Processors
- In Proc. of the 24th International Symposium on Microarchitecture
, 1991
"... This paper examines two alternative approaches to supporting code scheduling for multiple-instruction-issue processors. One is to provide a set of non-trapping instructions so that the compiler can perform aggressive static code scheduling. The application of this approach to existing commercial arc ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
This paper examines two alternative approaches to supporting code scheduling for multiple-instruction-issue processors. One is to provide a set of non-trapping instructions so that the compiler can perform aggressive static code scheduling. The application of this approach to existing commercial architectures typically requires extending the instruction set. The other approach is to support out-of-order execution in the microarchitecture so that the hardware can perform aggressive dynamic code scheduling. This approach usually does not require modifying the instruction set but requires complex hardware support. In this paper, we analyze the performance of the two alternative approaches using a set of important nonnumerical C benchmark programs. A distinguishing feature of the experiment is that the code for the dynamic approach has been optimized and scheduled as much as allowed by the architecture. The hardware is only responsible for the additional reordering that cannot be performed...
Allocating Registers in Multiple Instruction-Issuing Processors
- In Proceedings of the IFIP WG 10.3 Working Conference on Parallel Architectures and Compilation Techniques, PACT'95
, 1995
"... : This work addresses the problem of scheduling a basic block of operations on a multiple instruction-issuing processor. We show that integrating register constraints into operation sequencing algorithms is a complex problem in itself. Indeed, while scheduling a forest of unit time operations on a p ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
: This work addresses the problem of scheduling a basic block of operations on a multiple instruction-issuing processor. We show that integrating register constraints into operation sequencing algorithms is a complex problem in itself. Indeed, while scheduling a forest of unit time operations on a processor with P parallel instruction slots can be solved in polynomial time, the problem becomes NP-hard when P is unbounded but only R registers are available. As a result we have devised a concise integer linear programming formulation of this scheduling problem that accounts for both register and instruction issuing constraints. This allows the use of off-the-shelf routines to find optimum solutions, which can then be compared with the results obtained by polynomial-time heuristics. Two such heuristics are given, and their combined results are shown to be optimal in 99.5% of the cases for trees of height at most 6. A byproduct of these experiments is to show that our integer programming f...
Speedup of Band Linear Recurrences in the Presence of Resource Constraints
- Proc. 1992 6th ACM International Conf. Supercomputing
, 1992
"... An m-th order linear recurrence system of N equations computes x i = c i + P j=i0m i01 a ij x j for 1 i N . Linear recurrences have a role of central importance in computer design, numerical analysis, program analysis, digital signal processing and many non-numerical algorithms. However, progra ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
An m-th order linear recurrence system of N equations computes x i = c i + P j=i0m i01 a ij x j for 1 i N . Linear recurrences have a role of central importance in computer design, numerical analysis, program analysis, digital signal processing and many non-numerical algorithms. However, programs containing band linear recurrences are difficult to significantly parallelize due to loop-carried dependences. We present a new method for systematically approaching the optimal parallel schedules for computing mth-order linear recurrences with a fixed number of processors p independent of problem size N . Using our method, we first derive two kinds of parallel schedules, called the pipelined schedules and the exact schedules, for parallel evaluation of band linear recurrences. Our schedules have better execution times than the fastest previously published parallel schedules for p ? m 1. In particular, the exact schedules achieve an execution time of (2m 2 + 3m)N p + (m(m+1)(2m+1...
Stream Algorithms and Architecture
, 2003
"... Wire-exposed, programmable microarchitectures including Trips [11], Smart Memories [8], and Raw [13] offer an opportunity to schedule instruction execution and data movement explicitly. This paper proposes stream algorithms, which, along with a decoupled systolic architecture, provide an excellent m ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Wire-exposed, programmable microarchitectures including Trips [11], Smart Memories [8], and Raw [13] offer an opportunity to schedule instruction execution and data movement explicitly. This paper proposes stream algorithms, which, along with a decoupled systolic architecture, provide an excellent match for the physical and technological constraints of single-chip tiled architectures. Stream algorithms enable programmed systolic computations for different problem sizes, without incurring the cost of memory accesses. To that end, we decouple memory accesses from computation and move the memory accesses off the critical path. By structuring computations in systolic phases, and deferring memory accesses to dedicated memory processors, stream algorithms can solve many regular problems with varying sizes on a constant-sized tiled array. Contrary to common sense, the compute efficiency of stream algorithms increases as we increase the number of processing elements. In particular, we show that the compute efficiency of stream algorithms can approach 100 % asymptotically, that is for large numbers of processors and appropriate problem size.
Loop Optimization Techniques On Multi-Issue Architectures
, 1994
"... CONTENTS ACKNOWLEDGMENTS.................................................................................................. iii LIST OF TABLES ............................................................................................................. vi LIST OF FIGURES .......................... ..."
Abstract
- Add to MetaCart
CONTENTS ACKNOWLEDGMENTS.................................................................................................. iii LIST OF TABLES ............................................................................................................. vi LIST OF FIGURES .......................................................................................................... vii CHAPTER I INTRODUCTION ...............................................................................................................1 1 Scheduling....................................................................................................2 2 Methodology. ...............................................................................................5 3 Research Contributions ..............................................................................12 4 Thesis Organization ...................................................................................13 CHAPTER II INSTRUCTION

