Results 1 - 10
of
59
Limits of Control Flow on Parallelism
, 1992
"... This paper discusses three techniques useful in relaxing the constraints imposed by control flow on parallelism: control dependence analysis, executing multiple flows of control simultaneously, and speculative execution. We evaluate these techniques by using trace simulations to find the limits of p ..."
Abstract
-
Cited by 218 (2 self)
- Add to MetaCart
This paper discusses three techniques useful in relaxing the constraints imposed by control flow on parallelism: control dependence analysis, executing multiple flows of control simultaneously, and speculative execution. We evaluate these techniques by using trace simulations to find the limits of parallelism for machines that employ different combinations of these techniques. We have three major results. First, local regions of code have limited parallelism, and control dependence analysis is useful in extracting global parallelism from different parts of a program. Second, a superscalar processor is fundamentally limited because it cannot execute independent regions of code concurrently. Higher performance can be obtained with machines, such as multiprocessors and dataflow machines, that can simultaneously follow multiple flows of control. Finally, without speculative execution to allow instructions to execute before their control dependences are resolved, only modest amounts of parallelism can be obtained for programs with complex control flow.
Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines
, 1989
"... Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruc ..."
Abstract
-
Cited by 192 (13 self)
- Add to MetaCart
Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizations are presented. The average degree of superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many non-numeric applications, even without parallel instruction issue or higher degrees of pipelining. This is a preprint of a paper that will be presented at the 3rd International Conference on Architectur...
Instruction-Level Parallel Processing: History, Overview and Perspective
, 1992
"... Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a muc ..."
Abstract
-
Cited by 166 (0 self)
- Add to MetaCart
Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built, and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP have become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.
The Multiscalar Architecture
, 1993
"... The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits fine-grain parallelism by executing multiple, possibly (control and/or data) depen-dent t ..."
Abstract
-
Cited by 113 (8 self)
- Add to MetaCart
The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits fine-grain parallelism by executing multiple, possibly (control and/or data) depen-dent tasks in parallel using multiple processing elements. Splitting the instruction stream at statically determined boundaries allows the compiler to pass substantial information about the tasks to the hardware. The processing paradigm can be viewed as extensions of the superscalar and multiprocess-ing paradigms, and shares a number of properties of the sequential processing model and the dataflow processing model. The multiscalar paradigm is easily realizable, and we describe an implementation of the multis-calar paradigm, called the multiscalar processor. The central idea here is to connect multiple sequen-tial processors, in a decoupled and decentralized manner, to achieve overall multiple issue. The mul-tiscalar processor supports speculative execution, allows arbitrary dynamic code motion (facilitated by an efficient hardware memory disambiguation mechanism), exploits communication localities, and does all of these with hardware that is fairly straightforward to build. Other desirable aspects of the
The Expandable Split Window Paradigm for Exploiting Fine-Grain Parallelism
- In Proceedings of the 19th Annual International Symposium on Computer Architecture
, 1992
"... We propose a new processing paradigm, called the Expandable Split Window (ESW) paradigm, for exploiting finegrain parallelism. This paradigm considers a window of instructions (possibly having dependencies) as a single unit, and exploits fine-grain parallelism by overlapping the execution of multipl ..."
Abstract
-
Cited by 110 (10 self)
- Add to MetaCart
We propose a new processing paradigm, called the Expandable Split Window (ESW) paradigm, for exploiting finegrain parallelism. This paradigm considers a window of instructions (possibly having dependencies) as a single unit, and exploits fine-grain parallelism by overlapping the execution of multiple windows. The basic idea is to connect multiple sequential processors, in a decoupled and decentralized manner, to achieve overall multiple issue. This processing paradigm shares a number of properties of the restricted dataflow machines, but was derived from the sequential von Neumann architecture. We also present an implementation of the Expandable Split Window execution model, and preliminary performance results. 1. INTRODUCTION The execution of a program, in an abstract form, can be considered to be a dynamic dataflow graph that encapsulates the data dependencies in the program. The nodes of the graph represent computation operations, and the arcs of the graph represent communication o...
Limits on Multiple Instruction Issue
- in Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems
, 1989
"... This paper demonstrates that highly-optimized, non-scientific applications also contain ample instruction-level concurrency to sustain an execution rate of two instructions per clock cycle. However, the cost requirements necessary to provide the instruction bandwidth needed by the instructionexecuti ..."
Abstract
-
Cited by 101 (5 self)
- Add to MetaCart
This paper demonstrates that highly-optimized, non-scientific applications also contain ample instruction-level concurrency to sustain an execution rate of two instructions per clock cycle. However, the cost requirements necessary to provide the instruction bandwidth needed by the instructionexecution unit make this performance difficult to achieve
Automatic Program Parallelization
, 1993
"... This paper presents an overview of automatic program parallelization techniques. It covers dependence analysis techniques, followed by a discussion of program transformations, including straight-line code parallelization, do loop transformations, and parallelization of recursive routines. The last s ..."
Abstract
-
Cited by 97 (8 self)
- Add to MetaCart
This paper presents an overview of automatic program parallelization techniques. It covers dependence analysis techniques, followed by a discussion of program transformations, including straight-line code parallelization, do loop transformations, and parallelization of recursive routines. The last section of the paper surveys several experimental studies on the effectiveness of parallelizing compilers.
Dynamic Dependency Analysis of Ordinary Programs
- In Proceedings of the 19th Annual International Symposium on Computer Architecture
, 1992
"... A quantitative analysis of program execution is essential to the computer architecture design process. With the current trend in architecture of enhancing the performance of uniprocessors by exploiting fine-grain parallelism, first-order metrics of program execution, such as operation frequencies, a ..."
Abstract
-
Cited by 83 (9 self)
- Add to MetaCart
A quantitative analysis of program execution is essential to the computer architecture design process. With the current trend in architecture of enhancing the performance of uniprocessors by exploiting fine-grain parallelism, first-order metrics of program execution, such as operation frequencies, are not sufficient; characterizing the exact nature of dependencies between operations is essential. This paper presents a methodology for constructing the dynamic execution graph that characterizes the execution of an ordinary program (an application program written in an imperative language such as C or FORTRAN) from a serial execution trace of the program. It then uses the methodology to study parallelism in the SPEC benchmarks. We see that the parallelism can be bursty in nature (periods of lots of parallelism followed by periods of little parallelism), but the average parallelism is quite high, ranging from 13 to 23,302 operations per cycle. Exposing this parallelism requires renaming of...
Code Generation Schema for Modulo Scheduled Loops
- in Proceedings of the 25th Annual International Symposium on Microarchitecture
, 1992
"... Software pipelining is an important instruction scheduling technique for efficiently overlapping successive iterations of loops and executing them in parallel. Modulo scheduling is one approach for generating such schedules. This paper addresses an issue which has received little attention thus far, ..."
Abstract
-
Cited by 80 (6 self)
- Add to MetaCart
Software pipelining is an important instruction scheduling technique for efficiently overlapping successive iterations of loops and executing them in parallel. Modulo scheduling is one approach for generating such schedules. This paper addresses an issue which has received little attention thus far, but which is non-trivial in its complexity: the task of generating correct, high-performance code once the modulo schedule has been generated, taking into account the nature of the loop and the register allocation strategy that will be used. This issue is studied both with and without hardware features that are specifically aimed at supporting modulo scheduling.
Load Latency Tolerance In Dynamically Scheduled Processors
- JOURNAL OF INSTRUCTION LEVEL PARALLELISM
, 1998
"... This paper provides a quantitative evaluation of load latency tolerance in a dynamically scheduled processor. To determine the latency tolerance of each memory load operation, our simulations use flexible load completion policies instead of a fixed memory hierarchy that dictates the latency. Alth ..."
Abstract
-
Cited by 67 (2 self)
- Add to MetaCart
This paper provides a quantitative evaluation of load latency tolerance in a dynamically scheduled processor. To determine the latency tolerance of each memory load operation, our simulations use flexible load completion policies instead of a fixed memory hierarchy that dictates the latency. Although our policies delay load completion as long as possible, they produce performance (instructions committed per cycle (IPC)) comparable to a processor with an ideal memory system where all loads complete in one cycle. Our simulations reveal that to produce IPC values within 12% of a processor with an ideal memory system, between 1% and 71% of loads need to be satisfied within a single cycle and that up to 74% can be satisfied in as many as 32 cycles, depending on the benchmark and processor configuration. Load latency

