Results 1 - 10
of
28
Complexity-Effective Superscalar Processors
- In Proceedings of the 24th Annual International Symposium on Computer Architecture
, 1997
"... The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for ..."
Abstract
-
Cited by 385 (5 self)
- Add to MetaCart
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and0:18 m. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster – consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines. 1
Quantifying the Complexity of Superscalar Processors
, 1996
"... The delay of pipeline structures in superscalar processors are studied to determine their potential for limiting clock cycle times in future designs. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and o ..."
Abstract
-
Cited by 72 (0 self)
- Add to MetaCart
The delay of pipeline structures in superscalar processors are studied to determine their potential for limiting clock cycle times in future designs. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and 0:18 m.
A Scalable Register File Architecture for Dynamically Scheduled Processors
- In International Conference on Parallel Architectures and Compilation Techniques
, 1996
"... A major obstacle in designing dynamically scheduled processors is the size and port requirement of the register file. By using a multiple banked register file and performing dynamic result renaming, a scalable register file architecture can be implemented without performance degradation. In addition ..."
Abstract
-
Cited by 57 (1 self)
- Add to MetaCart
A major obstacle in designing dynamically scheduled processors is the size and port requirement of the register file. By using a multiple banked register file and performing dynamic result renaming, a scalable register file architecture can be implemented without performance degradation. In addition, a new hybrid register renaming technique to efficiently map the logical to physical registers and reduce the branch misprediction penalty is introduced. The performance was simulated using the SPEC95 benchmark suite. 1 Introduction The instruction level parallelism in programs allows a superscalar microprocessor to execute multiple instructions per cycle. It can be exploited by hardware providing resources to perform dynamic instruction scheduling. Independent instructions can be discovered at run-time and scheduled to functional units out-of-order. To avoid anti and outputdependencies, registers can be renamed by using a reorder buffer [5] or a mapping table [10]. The Power PC [7], Penti...
The Energy Complexity of Register Files
- In ISLPED
, 1997
"... Register files represent a substantial portion of the energy budget in modern processors, and are growing rapidly with the trend towards larger Instruction Level Parallelism (ILP). The energy cost of a register file access depends greatly on the register file circuitry used. This paper compares vari ..."
Abstract
-
Cited by 54 (2 self)
- Add to MetaCart
Register files represent a substantial portion of the energy budget in modern processors, and are growing rapidly with the trend towards larger Instruction Level Parallelism (ILP). The energy cost of a register file access depends greatly on the register file circuitry used. This paper compares various register file circuitry techniques for their energy efficiencies, as a function of the architectural parameters such as the number of registers and the number of ports. The Port Priority Selection technique combined with differential reads and low-swing writes was found to be the most energy efficient and provided significant energy savings compared to traditional approaches in the case of large register files. The dependence of register file access energy upon technology scaling is also studied. However, as this paper shows, it appears that none of these will be enough to prevent centralized register files from becoming the dominant power component of next-generation superscalar compute...
Exploiting Dead Value Information
- In 30th International Symposium on Microarchitecture
, 1997
"... We describe Dead Value Information (DVI) and introduce three new optimizations which exploit it. DVI provides assertions that certain register values are dead, meaning they will not be read before being overwritten. The processor can use DVI to track dead registers and dynamically eliminate unnecess ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
We describe Dead Value Information (DVI) and introduce three new optimizations which exploit it. DVI provides assertions that certain register values are dead, meaning they will not be read before being overwritten. The processor can use DVI to track dead registers and dynamically eliminate unnecessary save and restore instructions from the execution stream at procedure calls and context switches. Our results indicate that dynamic saves and restore instances can be reduced by 46 % for procedure calls and by 51 % for context switches. In addition, save/restore elimination for procedure calls can improve overall performance by up to 5%. DVI also allows the processor manage physical registers to efficiently, reducing the size requirements of the physical register file. When the system clock rate is proportional to the register file cycle time, this optimization can improve performance. All of these optimizations can be supported with only a few new instructions and minimal additional hardware structures. 1
Half-Price Architecture
- In Proceedings of the International Symposium on Computer Architecture
, 2003
"... Current-generation microprocessors are designed to process instructions with one and two source operands at equal cost. Handling two source operands requires multiple ports for each instruction in structures--such as the register file and wakeup logic--which are often in the processor critical timi ..."
Abstract
-
Cited by 32 (2 self)
- Add to MetaCart
Current-generation microprocessors are designed to process instructions with one and two source operands at equal cost. Handling two source operands requires multiple ports for each instruction in structures--such as the register file and wakeup logic--which are often in the processor critical timing paths. We argue that these structures are overdesigned since only a small fraction of instructions require two source operands to be processed simultaneously. [n this paper, we propose the half-price architecture that judiciously removes this overdesign by restricting the processor capability to handle two source operands in certain timing-critical cases. Two techniques are proposed and evaluated: one for the wakeup logic is sequential wakeup, which decouples half of the tag matching logic from the wakeup bus to reduce the load capacitance of the bus. The other technique for the register file is sequential register access, which halves the register read ports by sequentially accessing two values using a single port when needed. We show that a pipeline that optimizes scheduling and register access for a single operand achieves nearly the same performance as an ideal base machine that fully handles two operands, with 2.2% (worst case 4.8%) IPC degradation.
Distributed Modulo Scheduling
, 1999
"... Wide-issue ILP machines can be built using the VLIW approach as many of the hardware complexities found in superscalar processors can be transferred to the compiler. However, the scalability of VLIW architectures is still constrained by the size and number of ports of the register file required by a ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
Wide-issue ILP machines can be built using the VLIW approach as many of the hardware complexities found in superscalar processors can be transferred to the compiler. However, the scalability of VLIW architectures is still constrained by the size and number of ports of the register file required by a large number of functional units. Organizations composed by clusters of a few functional units and small private register files have been proposed to deal with this problem, an approach highly dependent on scheduling and partitioning strategies. This paper presents DMS, an algorithm that integrates modulo scheduling and code partitioning in a single procedure. Experimental results have shown the algorithm is effective for configurations up to 8 clusters, or even more when targeting vectorizable loops. 1 Keywords: ILP, VLIW, Clustering, Software Pipelining 1. Introduction Current microprocessor technology relies on two basic approaches to improve performance. One is to increase clock rates...
Software-directed register deallocation for simultaneous multithreaded processors
- IEEE Transactions on Parallel and Distributed Systems
, 1999
"... This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions, out of order, every cycle. By supporting better in ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions, out of order, every cycle. By supporting better inter-thread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deallocation. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they are limited in knowing when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to (1) free registers immediately upon their last use, and (2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs. 1
Cherry-MP: correctly integrating checkpointed early resource recycling in chip multiprocessors
- in Chip Multiprocessors,” in International Symposium on Microarchitecture
, 2005
"... Checkpointed Early Resource Recycling (Cherry) is a recently-proposed micro-architectural technique that aims at improving critical resource utilization by performing aggressive resource recycling decoupled from instruction retirement, using a checkpoint/rollback mechanism to recover from occasional ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Checkpointed Early Resource Recycling (Cherry) is a recently-proposed micro-architectural technique that aims at improving critical resource utilization by performing aggressive resource recycling decoupled from instruction retirement, using a checkpoint/rollback mechanism to recover from occasional incorrect execution. In this paper, we explore correctness and performance issues that arise when Cherryenabled processors are used in chip multiprocessor architectures. We propose mechanisms to address cache coherence, memory consistency, and forward progress issues in such environments. We also provide quantitative insight on the performance impact of the Cherry mechanism on parallel processing. 1
Heterogeneous Architecture Models for Interconnect-Motivated System Design
- IEEE Trans. on VLSI Systems, Special Issue on System-Level Interconnect Prediction
, 2000
"... Abstract—On-chip interconnect demand is becoming the dominant factor in modern processor performance and must be estimated early in the design process. This paper presents a set of heterogeneous architectural models that combines architecture description and Rent’s Rule-based wiring models. These ar ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Abstract—On-chip interconnect demand is becoming the dominant factor in modern processor performance and must be estimated early in the design process. This paper presents a set of heterogeneous architectural models that combines architecture description and Rent’s Rule-based wiring models. These architecture models allow flexible heterogeneous system specifications, enabling investigations of prospective designs in different technology scenarios. Comparisons against actual data demonstrate the models ’ effectiveness for architecture explorations with highly accurate estimations of local and global wiring demand, as well as chip area and cycle time. Simulation of two candidate system designs reveal trends in interconnect delay with increasing architectural complexity, and confirm the need for high computational locality and short global wires for future architectures. Index Terms—Architecture modeling, interconnect prediction, VLSI modeling, wire-demand prediction. I.

