Results 1 - 10
of
193
Complexity-Effective Superscalar Processors
- In Proceedings of the 24th Annual International Symposium on Computer Architecture
, 1997
"... The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for ..."
Abstract
-
Cited by 385 (5 self)
- Add to MetaCart
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and0:18 m. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster – consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines. 1
The Case for a Single-Chip Multiprocessor
- IEEE Computer
, 1996
"... Advances in IC processing allow for more microprocessor design options. The increasing gate density and cost of wires in advanced integrated circuit technologies require that we look for new ways to use their capabilities effectively. This paper shows that in advanced technologies it is possible to ..."
Abstract
-
Cited by 326 (5 self)
- Add to MetaCart
Advances in IC processing allow for more microprocessor design options. The increasing gate density and cost of wires in advanced integrated circuit technologies require that we look for new ways to use their capabilities effectively. This paper shows that in advanced technologies it is possible to implement a single-chip multiproces-sor in the same area as a wide issue superscalar processor. We find that for applications with little parallelism the performance of the two microarchitectures is comparable. For applications with large amounts of parallelism at both the fine and coarse grained levels, the multiprocessor microarchitectnre outperforms the superscrdar architecture by a significant margin. Single-chip multiprocessor architectures have the advantage in that they offer localized imple-mentation of a high-clock rate processor for inherently sequential applications and low latency interprocessor communication for par-allel applications. 1
Limits of Control Flow on Parallelism
, 1992
"... This paper discusses three techniques useful in relaxing the constraints imposed by control flow on parallelism: control dependence analysis, executing multiple flows of control simultaneously, and speculative execution. We evaluate these techniques by using trace simulations to find the limits of p ..."
Abstract
-
Cited by 218 (2 self)
- Add to MetaCart
This paper discusses three techniques useful in relaxing the constraints imposed by control flow on parallelism: control dependence analysis, executing multiple flows of control simultaneously, and speculative execution. We evaluate these techniques by using trace simulations to find the limits of parallelism for machines that employ different combinations of these techniques. We have three major results. First, local regions of code have limited parallelism, and control dependence analysis is useful in extracting global parallelism from different parts of a program. Second, a superscalar processor is fundamentally limited because it cannot execute independent regions of code concurrently. Higher performance can be obtained with machines, such as multiprocessors and dataflow machines, that can simultaneously follow multiple flows of control. Finally, without speculative execution to allow instructions to execute before their control dependences are resolved, only modest amounts of parallelism can be obtained for programs with complex control flow.
Optimization of Instruction Fetch Mechanisms for High Issue Rates
- In 22nd Annual International Symposium on Computer Architecture
, 1995
"... Recent superscalar processors issue four instructions per cycle. These processors are also powered by highly-parallel superscalar cores. The potential performance can only be exploited when fed by high instruction bandwidth. This task is the responsibility of the instruction fetch unit. Accurate bra ..."
Abstract
-
Cited by 115 (4 self)
- Add to MetaCart
Recent superscalar processors issue four instructions per cycle. These processors are also powered by highly-parallel superscalar cores. The potential performance can only be exploited when fed by high instruction bandwidth. This task is the responsibility of the instruction fetch unit. Accurate branch prediction and low I-cache miss ratios are essential for the efficient operation of the fetch unit. Several studies on cache design and branch prediction address this problem. However, these techniques are not sufficient. Even in the presence of efficient cache designs and branch prediction, the fetch unit must continuously extract multiple, non-sequential instructions from the instruction cache, realign these in the proper order, and supply them to the decoder. This paper explores solutions to this problem and presents several schemes with varying degrees of performance and cost. The most-general scheme, the collapsing buffer, achieves near-perfect performance and consistently aligns in...
The Multiscalar Architecture
, 1993
"... The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits fine-grain parallelism by executing multiple, possibly (control and/or data) depen-dent t ..."
Abstract
-
Cited by 113 (8 self)
- Add to MetaCart
The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits fine-grain parallelism by executing multiple, possibly (control and/or data) depen-dent tasks in parallel using multiple processing elements. Splitting the instruction stream at statically determined boundaries allows the compiler to pass substantial information about the tasks to the hardware. The processing paradigm can be viewed as extensions of the superscalar and multiprocess-ing paradigms, and shares a number of properties of the sequential processing model and the dataflow processing model. The multiscalar paradigm is easily realizable, and we describe an implementation of the multis-calar paradigm, called the multiscalar processor. The central idea here is to connect multiple sequen-tial processors, in a decoupled and decentralized manner, to achieve overall multiple issue. The mul-tiscalar processor supports speculative execution, allows arbitrary dynamic code motion (facilitated by an efficient hardware memory disambiguation mechanism), exploits communication localities, and does all of these with hardware that is fairly straightforward to build. Other desirable aspects of the
A Chip-Multiprocessor Architecture with Speculative Multithreading
- IEEE Transactions on Computers
, 1999
"... Keywords: Chip-multiprocessor, speculative multithreading, data-dependence speculation, control speculation \Lambda Corresponding Author 1 1 INTRODUCTION The superscalar approach [12], which allows more than one instruction to be issued in a single cycle, has become the norm for today's high-perform ..."
Abstract
-
Cited by 112 (13 self)
- Add to MetaCart
Keywords: Chip-multiprocessor, speculative multithreading, data-dependence speculation, control speculation \Lambda Corresponding Author 1 1 INTRODUCTION The superscalar approach [12], which allows more than one instruction to be issued in a single cycle, has become the norm for today's high-performance microprocessors. The issue rate of these microprocessors has continued to increase over the past few years, with today's high-performance superscalar processors such as the Compaq Alpha 21264 [4], IBM PowerPC [16], Intel Pentium-Pro [3] or MIPS R10000 [19] able to issue up to four instructions per cycle.
The Superthreaded Architecture: Thread Pipelining with Run-time Data Dependence Checking and Control Speculation
, 1996
"... This paper presents a new concurrent multiplethreaded architectural model, called superthreading, for exploiting thread-level parallelism on a processor. This architectural model adopts a thread pipelining execution model that allows threads with data dependences and control dependences to be execut ..."
Abstract
-
Cited by 111 (11 self)
- Add to MetaCart
This paper presents a new concurrent multiplethreaded architectural model, called superthreading, for exploiting thread-level parallelism on a processor. This architectural model adopts a thread pipelining execution model that allows threads with data dependences and control dependences to be executed in parallel. The basic idea of thread pipelining is to compute and forward recurrence data and possible dependent store addresses to the next thread as soon as possible, so the next thread can start execution and perform runtime data dependence checking. Thread pipelining also forces contiguous threads to perform their memory write-backs in order, which enables the compiler to fork threads with control speculation. With run-time support for data dependence checking and control speculation, the superthreaded architectural model can exploit loop-level parallelism from a broad range of applications. 1 Introduction As the rapid progress of VLSI technology allows microprocessors to have more...
The M-Machine Multicomputer
, 1995
"... The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems. The M-Machine computing nodes are con- nected with a 3-D mesh network; each node is a multithreaded pr ..."
Abstract
-
Cited by 100 (13 self)
- Add to MetaCart
The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems. The M-Machine computing nodes are con- nected with a 3-D mesh network; each node is a multithreaded processor incorporating 12 function units, on-chip cache, and local memory. The multiple function units are used to exploit both instruction-level and thread-level parallelism. A user accessible message passing system yields fast communication and synchronization between nodes. RapM access to remote memory is provided transparently to the user with a combination of hardware and software mechanisms. This paper presents the architecture of the M-Machine and describes how its mechanisms attempt to maximize both single thread performance and overall system throughput. The architecture is complete and the MAP chip, which will serve as the M-Machine processing node, is currently being implemented.
Energy Efficient CMOS Microprocessor Design
- In Proc. of The HICSS Conference
, 1995
"... Reduction of power dissipation in microprocessor design is becoming a key design constraint. This is motivated not only by portable electronics, in which battery weight and size is critical, but by heat dissipation issues in larger desktop and parallel machines as well. By identifying the major mode ..."
Abstract
-
Cited by 99 (3 self)
- Add to MetaCart
Reduction of power dissipation in microprocessor design is becoming a key design constraint. This is motivated not only by portable electronics, in which battery weight and size is critical, but by heat dissipation issues in larger desktop and parallel machines as well. By identifying the major modes of computation of these processors and by proposing figures of merit for each of these modes, a power analysis methodology is developed. It allows the energy efficiency of various architectures to be quantified, and provides techniques for either individually optimizing or trading off throughput and energy consumption. The methodology is then used to qualify three important design principles for energy efficient microprocessor design. 1: Introduction Throughput and area have been the main forces driving microprocessor design, but recently the explosive growth in portable electronics has forced a shift in these design optimizations toward more power conscious solutions. Even for desktop un...

