Results 1 - 10
of
23
Complexity-Effective Superscalar Processors
- In Proceedings of the 24th Annual International Symposium on Computer Architecture
, 1997
"... The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for ..."
Abstract
-
Cited by 385 (5 self)
- Add to MetaCart
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and0:18 m. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster – consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines. 1
Quantifying the Complexity of Superscalar Processors
, 1996
"... The delay of pipeline structures in superscalar processors are studied to determine their potential for limiting clock cycle times in future designs. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and o ..."
Abstract
-
Cited by 72 (0 self)
- Add to MetaCart
The delay of pipeline structures in superscalar processors are studied to determine their potential for limiting clock cycle times in future designs. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and 0:18 m.
A Scalable Register File Architecture for Dynamically Scheduled Processors
- In International Conference on Parallel Architectures and Compilation Techniques
, 1996
"... A major obstacle in designing dynamically scheduled processors is the size and port requirement of the register file. By using a multiple banked register file and performing dynamic result renaming, a scalable register file architecture can be implemented without performance degradation. In addition ..."
Abstract
-
Cited by 57 (1 self)
- Add to MetaCart
A major obstacle in designing dynamically scheduled processors is the size and port requirement of the register file. By using a multiple banked register file and performing dynamic result renaming, a scalable register file architecture can be implemented without performance degradation. In addition, a new hybrid register renaming technique to efficiently map the logical to physical registers and reduce the branch misprediction penalty is introduced. The performance was simulated using the SPEC95 benchmark suite. 1 Introduction The instruction level parallelism in programs allows a superscalar microprocessor to execute multiple instructions per cycle. It can be exploited by hardware providing resources to perform dynamic instruction scheduling. Independent instructions can be discovered at run-time and scheduled to functional units out-of-order. To avoid anti and outputdependencies, registers can be renamed by using a reorder buffer [5] or a mapping table [10]. The Power PC [7], Penti...
Hierarchical scheduling windows
- In Proceedings of the 35th International Symposium on Microarchitecture
, 2002
"... Large scheduling windows are an effective mechanism for increasing microprocessor performance through the extraction of instruction level parallelism. Current techniques do not scale effectively for very large windows, leading to slow wakeup and select logic as well as large complicated bypass netwo ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
Large scheduling windows are an effective mechanism for increasing microprocessor performance through the extraction of instruction level parallelism. Current techniques do not scale effectively for very large windows, leading to slow wakeup and select logic as well as large complicated bypass networks. This paper introduces a new instruction scheduler implementation, referred to as Hierarchical Scheduling Windows or HSW, which exploits latency tolerant instructions in order to reduce implementation complexity. HSW yields a very large instruction window that tolerates wakeup, select, and bypass latency, while extracting significant far-flung ILP. Results: It is shown that HSW loses <0.5 % performance per additional cycle of bypass/select/wakeup latency as compared to a monolithic window that loses ~5 % per additional cycle. Also, HSW achieves the performance of traditional implementations with only 1/3 to 1/2 the number of entries in the critical timing path. 1.
Use-Based Register Caching with Decoupled Indexing
- In Proceedings of the 31st annual international symposium on Computer architecture
, 2004
"... Wide, deep pipelines need many physical registers to hold the results of in-flight instructions. Simultaneously, high clock frequencies prohibit using large register files and bypass networks without a significant performance penalty. Previously proposed techniques using register caching to reduce t ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
Wide, deep pipelines need many physical registers to hold the results of in-flight instructions. Simultaneously, high clock frequencies prohibit using large register files and bypass networks without a significant performance penalty. Previously proposed techniques using register caching to reduce this penalty suffer from several problems including poor insertion and replacement decisions and the need for a fully-associative cache for good performance. We present novel mechanisms for managing and indexing register caches that address these problems using knowledge of the number of consumers of each register value. The insertion policy reduces pollution by not caching a register value when all of its predicted consumers are satisfied by the bypass network. The replacement policy selects register cache entries with the fewest remaining uses (often zero), lowering the miss rate. We also introduce a new, general method of mapping physical registers to register cache sets that improves the performance of set-associative cache organizations by reducing conflicts. Our results indicate that a 64-entry, two-way set associative cache using these techniques outperforms multi-cycle monolithic register files and previously proposed hierarchical register files. 1.
A complexity-effective approach to alu bandwidth enhancement for instruction-level temporal redundancy
- In Proceedings of the International Symposium on Computer Architecture (ISCA
, 2004
"... Previous proposals for implementing instruction-level temporal redundancy in out-of-order cores have reported a performance degradation of upto 45 % in certain applications compared to an execution which does not have any temporal redundancy. An important contributor to this problem is the insuffici ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
Previous proposals for implementing instruction-level temporal redundancy in out-of-order cores have reported a performance degradation of upto 45 % in certain applications compared to an execution which does not have any temporal redundancy. An important contributor to this problem is the insufficient number of ALUs for handling the amplified load injected into the core. At the same time, increasing the number of ALUs can increase the complexity of the issue logic, which has been pointed out to be one of the most timing critical components of the processor. This paper proposes a novel extension of a prior idea on instruction reuse to ease ALU bandwidth requirements in a complexity-effective way by exploiting certain interesting properties of a dual (temporally redundant) instruction stream. We present microarchitectural extensions necessary for implementing an instruction reuse buffer (IRB) and integrating this with the issue logic of a dual instruction stream superscalar core, and conduct extensive evaluations to demonstrate how well it can alleviate the ALU bandwidth problem. We show that on the average we can gain back nearly 50% of the IPC loss that occurred due to ALU bandwidth limitations for an instruction-level temporally redundant superscalar execution, and 23 % of the overall IPC loss. Keywords: Complexity-effective design, Instruction Reuse, Temporal Redundancy.
Routed Inter-ALU Networks for ILP Scalability and Performance
, 2003
"... Modern processors rely heavily on broadcast networks to bypass instruction results to dependent instructions in the pipeline. However, as clock rates increase, architectures get wider, and pipelines get deeper, broadcasting becomes more complex, slower, and more difficult to implement. This complexi ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Modern processors rely heavily on broadcast networks to bypass instruction results to dependent instructions in the pipeline. However, as clock rates increase, architectures get wider, and pipelines get deeper, broadcasting becomes more complex, slower, and more difficult to implement. This complexity is compounded by shrinking feature size, as the communication speed decreases relative to transistor switching speeds. This paper examines the fundamental needs of bypassing networks and proposes a method for classifying these Inter-ALU Networks based on how operands are routed from producers to consumers. We then propose and evaluate at both the circuit and architectural level a fine grain point-to-point Routed Inter-ALU Network (RIAN) that delivers the same or higher instruction throughput as a full bypass network but at higher speeds while using fewer wires.
Using Internal Redundant Representations and Limited Bypass to Support Pipelined Adders and Register Files
- In Proceedings of the 8th Annual International Symposium on High-Performance Computer Architecture
, 2001
"... This paper evaluates the use of redundant binary and pipelined 2's complement adders in out-of-order execution cores. Redundant binary adders reduce the ADD latency to less than half that of traditional 2's complement adders, allowing higher core clock frequencies and greater execution bandwidth (in ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
This paper evaluates the use of redundant binary and pipelined 2's complement adders in out-of-order execution cores. Redundant binary adders reduce the ADD latency to less than half that of traditional 2's complement adders, allowing higher core clock frequencies and greater execution bandwidth (in instructions per second). Pipelined 2's complement adders allow a higher clock frequency, but do not reduce the ADD latency. Machines with redundant binary adders are compared to machines with 2's complement adders and the same execution bandwidth and bypass network complexity. Results show that on the SPECint95 benchmarks, the average IPC of an 8-wide machine with 1cycle redundant binary adders is 9% higher than a machine using 2-cycle pipelined adders.
PBExplore: A framework for compiler-in-the-loop exploration of partial bypassing in embedded processors
- In DATE ’05: Proceedings of the conference on Design, Automation and Test in Europe
, 2005
"... Varying partial bypassing in pipelined processors is an effective way to make performance, area and energy tradeoffs in embedded processors. However, performance evaluation of partial bypassing in processors has been inaccurate, largely due to the absence of bypass-sensitive retargetable compilation ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
Varying partial bypassing in pipelined processors is an effective way to make performance, area and energy tradeoffs in embedded processors. However, performance evaluation of partial bypassing in processors has been inaccurate, largely due to the absence of bypass-sensitive retargetable compilation techniques. Furthermore no existing partial bypass exploration framework estimates the power and cost overhead of partial bypassing. In this paper we present PBExplore: A framework for Compiler-in-the-Loop exploration of partial bypassing in processors. PBExplore accurately evaluates the performance of a partially bypassed processor using a generic bypass-sensitive compilation technique. It synthesizes the bypass control logic and estimates the area and energy overhead of each bypass configuration. PBExplore is thus able to effectively perform multi-dimensional exploration of the partial bypass design space. We present experimental results on the Intel XScale architecture on MiBench benchmarks and demonstrate the need, utility and exploration capabilities of PBExplore. 1
Heterogeneous Architecture Models for Interconnect-Motivated System Design
- IEEE Trans. on VLSI Systems, Special Issue on System-Level Interconnect Prediction
, 2000
"... Abstract—On-chip interconnect demand is becoming the dominant factor in modern processor performance and must be estimated early in the design process. This paper presents a set of heterogeneous architectural models that combines architecture description and Rent’s Rule-based wiring models. These ar ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Abstract—On-chip interconnect demand is becoming the dominant factor in modern processor performance and must be estimated early in the design process. This paper presents a set of heterogeneous architectural models that combines architecture description and Rent’s Rule-based wiring models. These architecture models allow flexible heterogeneous system specifications, enabling investigations of prospective designs in different technology scenarios. Comparisons against actual data demonstrate the models ’ effectiveness for architecture explorations with highly accurate estimations of local and global wiring demand, as well as chip area and cycle time. Simulation of two candidate system designs reveal trends in interconnect delay with increasing architectural complexity, and confirm the need for high computational locality and short global wires for future architectures. Index Terms—Architecture modeling, interconnect prediction, VLSI modeling, wire-demand prediction. I.

