Results 1 - 10
of
17
Complexity-Effective Superscalar Processors
- In Proceedings of the 24th Annual International Symposium on Computer Architecture
, 1997
"... The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for ..."
Abstract
-
Cited by 385 (5 self)
- Add to MetaCart
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and0:18 m. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster – consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines. 1
Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures
, 2000
"... The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scaling of the processor clock with technology generation. Our results show that, due to both diminishing improvements in clock rates and poor wire scaling as se ..."
Abstract
-
Cited by 264 (22 self)
- Add to MetaCart
The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scaling of the processor clock with technology generation. Our results show that, due to both diminishing improvements in clock rates and poor wire scaling as semiconductor devices shrink, the achievable performance growth of conventional microarchitectures will slow substantially. In this paper, we describe technology-driven models for wire capacitance, wire delay, and microarchitectural component delay. Using the results of these models, we measure the simulated performance---estimating both clock rate and IPC--- of an aggressive out-of-order microarchitecture as it is scaled from a 250nm technology to a 35nm technology. We perform this analysis for three clock scaling targets and two microarchitecture scaling strategies: pipeline scaling and capacity scaling. We find that no scaling strategy permits annual performance improvements of better than 12.5%, which is far worse than the annual 50-60% to which we have grown accustomed. 1
Instruction-Level Parallel Processing: History, Overview and Perspective
, 1992
"... Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a muc ..."
Abstract
-
Cited by 166 (0 self)
- Add to MetaCart
Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built, and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP have become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
- in Proceedings of the 29th Annual International Symposium on Computer Architecture
, 2002
"... Microprocessor clock frequency has improved by nearly 40% annually over the past decade. This improvement has been provided, in equal measure, by smaller technologies and deeper pipelines. From our study of the SPEC 2000 bench-marks, we find that for a high-performance architecture imple-mented in l ..."
Abstract
-
Cited by 100 (14 self)
- Add to MetaCart
Microprocessor clock frequency has improved by nearly 40% annually over the past decade. This improvement has been provided, in equal measure, by smaller technologies and deeper pipelines. From our study of the SPEC 2000 bench-marks, we find that for a high-performance architecture imple-mented in lOOnm technology, the optimal clock period is ap-proximately 8fan-out-of-four ( F04) inverter delays for integer benchmarks, comprised of 6 F04 of useful work and an over-head of about 2 F04. The optimal clock period for floating-point benchmarks is 6F04. We find these optimal points to be insensitive to latch and clock skew overheads. Our study indi-cates that further pipelining can at best improve performance of integer programs by a factor of 2 over current designs. At these high clock frequencies it will be difficult to design the instruction issue window to operate in a single cycle. Con-sequently, we propose and evaluate a high-frequency design called a segmented instruction window. 1
The Microarchitecture of Superscalar Processors
, 1995
"... Superscalar processing is the latest in a long series of innovations aimed at producing ever-faster microprocessors. By exploiting instruction-level parallelism, superscalar processors are capable of executing more than one instruction in a clock cycle. This paper discusses the microarchitecture of ..."
Abstract
-
Cited by 76 (0 self)
- Add to MetaCart
Superscalar processing is the latest in a long series of innovations aimed at producing ever-faster microprocessors. By exploiting instruction-level parallelism, superscalar processors are capable of executing more than one instruction in a clock cycle. This paper discusses the microarchitecture of superscalar processors. We begin with a discussion of the general problem solved by superscalar processors: converting an ostensibly sequential program into a more parallel one. The principles underlying this process, and the constraints that must be met, are discussed. The paper then provides a description of the specific implementation techniques used in the important phases of superscalar processing. The major phases include: i) instruction fetching and conditional branch processing, ii) the determination of data dependences involving register values, iii) the initiation, or issuing, of instructions for parallel execution, iv) the communication of data values through memory via loads and stores, and v) committing the process state in correct order so that precise interrupts can be supported. Examples of recent superscalar microprocessors, the MIPS R10000, the DEC 21164, and the AMD K5 are used to illustrate a variety of superscalar methods.
Quantifying the Complexity of Superscalar Processors
, 1996
"... The delay of pipeline structures in superscalar processors are studied to determine their potential for limiting clock cycle times in future designs. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and o ..."
Abstract
-
Cited by 72 (0 self)
- Add to MetaCart
The delay of pipeline structures in superscalar processors are studied to determine their potential for limiting clock cycle times in future designs. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and 0:18 m.
Optimizing pipelines for power and performance
- in International Symposium on Microarchitecture (MICRO35), Nov. 2002. Selected as one of the four Best IBM Research Papers in Computer Science, Electrical Engineering and Math published in
, 2002
"... During the concept phase and definition of next generation high-end processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPI-centric view alone in early-stage definition studies. One of the fundamental issues co ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
During the concept phase and definition of next generation high-end processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPI-centric view alone in early-stage definition studies. One of the fundamental issues confronting the architect at this stage is the choice of pipeline depth and target frequency. In this paper we present an optimization methodology that starts with an analytical power-performance model to derive optimal pipeline depth for a superscalar processor. The results are validated and further refined using detailed simulation based analysis. As part of the power-modeling methodology, we have developed equations that model the variation of energy as a function of pipeline depth. Our results using a set of SPEC2000 applications show that when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of these energy models. 1
The Effect of Technology Scaling on Microarchitectural Structures
, 2000
"... In this report, we describe technology-driven models for wire capacitance, wire delay, and microar- chitectural component delay. We used a 3D-field solver (Space3D) to generate our capacitance model based on technology parameters derived from the International Technology Roadmap for Semiconductors ( ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
In this report, we describe technology-driven models for wire capacitance, wire delay, and microar- chitectural component delay. We used a 3D-field solver (Space3D) to generate our capacitance model based on technology parameters derived from the International Technology Roadmap for Semiconductors (ITRS). This reports shows how we used the ITRS parameters and our capaci- tance model produce a reasonable set of scaling rules that we then used to modify an existing cache model (ECACTI). Finally the report then shows how this modified model can be used to generate access times for various microarchitectural structures like caches, register files, TLBs and other array like structures. Based on the wire model, it is predicted that it will take 12-32 cycles to traverse the chip in top-level metal under optimistic assumptions about adjoining wires. Based on our component model it is predicted that access a 64KB cache in a 35nm technology with a clock rate of 13.5GHz (as projected by ITRS) will take as many as 7 cycles. To obtain lower latency accesses, designers will have to either reduce the clock rate or use smaller storage structures in such a senario.
Integrated analysis of power and performance for pipelined microprocessors
- IEEE Transactions on Computers
, 2004
"... been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be ..."
Abstract
-
Cited by 18 (8 self)
- Add to MetaCart
been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). Copies may be requested from IBM T. J. Watson Research Center, P.
Power-Optimal Pipelining in Deep Submicron Technology
- In International Symposium on Low Power Electronics and Design
, 2004
"... This paper explores the effectiveness of pipelining as a power saving tool, where the reduction in logic depth per stage is used to reduce supply voltage at a fixed clock frequency. We examine poweroptimal pipelining in deep submicron technology, both analytically and by simulation. Simulation uses ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
This paper explores the effectiveness of pipelining as a power saving tool, where the reduction in logic depth per stage is used to reduce supply voltage at a fixed clock frequency. We examine poweroptimal pipelining in deep submicron technology, both analytically and by simulation. Simulation uses a 70 nm predictive process with a fanout-of-four inverter chain model including input/output flipflops, and results are shown to match theory well. The simulation results show that power-optimal logic depth is 6 to 8 FO4 and optimal power saving varies from 55 to 80 % compared to a 24 FO4 logic depth, depending on threshold voltage, activity factor, and presence of clock-gating. We decompose the power consumption of a circuit into three components, switching power, leakage power, and idle power, and present the following insights into power-optimal pipelining. First, power-optimal logic depth decreases and optimal power savings increase for larger activity factors, where switching power dominates over leakage and idle power. Second, pipelining is more effective with lower threshold voltages at high activity factors, but higher threshold voltages give better results at lower activity factors where leakage current dominates. Lastly, clock-gating enables deeper pipelining and more power saving because it reduces timing element overhead when the activity factor is low.

