Results 1 - 10
of
44
Reducing the Complexity of the Register File in Dynamic Superscalar Processors
, 2001
"... Dynamic superscalar processors execute multiple instructions out-of-order by looking for independent operations within a large window. The number of physical registers within the processor has a direct impact on the size of this window as most in-flight instructions require a new physical register a ..."
Abstract
-
Cited by 75 (1 self)
- Add to MetaCart
Dynamic superscalar processors execute multiple instructions out-of-order by looking for independent operations within a large window. The number of physical registers within the processor has a direct impact on the size of this window as most in-flight instructions require a new physical register at dispatch. A large multiported register file helps improve the instruction-level parallelism (ILP), but may have a detrimental effect on clock speed, especially in future wire-limited technologies. In this paper, we propose a register file organization that reduces register file size and port requirements for a given amount of ILP. We use a two-level register file organization to reduce register file size requirements, and a banked organization to reduce port requirements. We demonstrate empirically that the resulting register file organizations have reduced latency and (in the case of the banked organization) energy requirements for similar instructions per cycle (IPC) performance and improved instructions per second (IPS) performance in comparison to a conventional monolithic register file. The choice of organization is dependent on design goals.
Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors
- In Proceedings of MICRO-33
, 2000
"... We investigate instruction distribution methods for quad-clustec dynamically-scheduled superscalar processors. We study a variety of methods with different cost, performance and complexity characteristics. We investigate both non-adaptive and adaptive methods and their sensitivity both to inter-clus ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
We investigate instruction distribution methods for quad-clustec dynamically-scheduled superscalar processors. We study a variety of methods with different cost, performance and complexity characteristics. We investigate both non-adaptive and adaptive methods and their sensitivity both to inter-cluster communication latencies and pipeline depth. Furthermore, we develop a set of models that allow us to identify how well each method attacks issue-bandwidth and inter-cluster communication restrictions. We find that a relatively simple method that changes clusters every other three instructions offers only a 17 % performance slowdown compared to a non-clustered conjguration operating at the same frequency. Moreover; we show that by utilizing adaptive methods it is possible to further reduce this gap down to about 14%. Furthermore, performance appears to be more sensitive to inter-cluster communication latencies rather than to pipeline depth. The best performing method offers a slowdown of about 24 % when inter-cluster communication latency is two cycle. This gap is only 20 % when two additional stages are introduced in the front-end pipeline. 1
Dynamically managing the communication-parallelism trade-off in future clustered processors
- IN PROCEEDINGS OF INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2003
"... Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, there ..."
Abstract
-
Cited by 47 (10 self)
- Add to MetaCart
Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, thereby allowing more aggressive use of instructionlevel parallelism (ILP), the inter-cluster communication increases as data values get spread across a wider area. As a result of the emergence of this trade-off between communication and parallelism, a subset of the total on-chip clusters is optimal for performance. To match the hardware to the application’s needs, we use a robust algorithm to dynamically tune the clustered architecture. The algorithm, which is based on program metrics gathered at periodic intervals, achieves an 11 % performance improvement on average over the best statically defined architecture. We also show that the use of additional hardware and reconfiguration at basic block boundaries can achieve average improvements of 15%. Our results demonstrate that reconfiguration provides an effective solution to the communication and parallelism trade-off inherent in the communicationbound processors of the future.
CARS: A new code generation framework for clustered ILP processors
- In HPCA
, 2001
"... Clustered ILP processors are characterized by a large number of non-centralized on-chip resources grouped into clusters. Traditional code generation schemes for these processors consist of multiple phases for cluster assignment, register allocation and instruction scheduling. Most of these approache ..."
Abstract
-
Cited by 40 (1 self)
- Add to MetaCart
Clustered ILP processors are characterized by a large number of non-centralized on-chip resources grouped into clusters. Traditional code generation schemes for these processors consist of multiple phases for cluster assignment, register allocation and instruction scheduling. Most of these approaches need additional re-scheduling phases because they often do not impose finite resource constraints in all phases of code generation. These phase-ordered solutions have several drawbacks, resulting in the generation of poor performance code. Moreover, the iterative/back-tracking algorithms used in some of these schemes have large running times. In this paper we present CARS, a code generation framework for Clustered ILP processors, which combines the cluster assignment, register allocation, and instruction scheduling phases into a single code generation phase, thereby eliminating the problems associated with phase-ordered solutions. The CARS algorithm explicitly takes into account all the resource constraints at each cluster scheduling step to reduce spilling and to avoid iterative re-scheduling steps. We also present a new on-the-fly register allocation scheme developed for CARS. We describe an implementation of the proposed code generation framework and the results of a performance evaluation study using the SPEC95/2000 and MediaBench benchmarks.
Efficient interconnects for clustered microarchitectures
- In International Conference on Parallel Architectures and Compilation Techniques
, 2002
"... Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered microarchitectures. This new class of interconnects ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered microarchitectures. This new class of interconnects has different demands and characteristics than traditional multiprocessor networks. In a clustered microarchitecture, a low inter-cluster communication latency is essential for high performance. We propose point-to-point interconnects together with an effective latency-aware instruction steering scheme and show that they achieve much better performance than busbased interconnects. The results show that the connectivity of the network together with latency-aware steering schemes are key for high performance. We also show that
Improving Dynamic Cluster Assignment for Clustered Trace Cache Processors
, 2003
"... This work examines dynamic cluster assignment for a clustered trace cache processor (CTCP). Previously proposed cluster assignment techniques run into unique problems as issue width and cluster count increase. Realistic design conditions, such as variable data forwarding latencies between clusters a ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
This work examines dynamic cluster assignment for a clustered trace cache processor (CTCP). Previously proposed cluster assignment techniques run into unique problems as issue width and cluster count increase. Realistic design conditions, such as variable data forwarding latencies between clusters and a heavily partitioned instruction window, increase the degree of difficulty for effective cluster assignment.
A complexity-effective approach to alu bandwidth enhancement for instruction-level temporal redundancy
- In Proceedings of the International Symposium on Computer Architecture (ISCA
, 2004
"... Previous proposals for implementing instruction-level temporal redundancy in out-of-order cores have reported a performance degradation of upto 45 % in certain applications compared to an execution which does not have any temporal redundancy. An important contributor to this problem is the insuffici ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
Previous proposals for implementing instruction-level temporal redundancy in out-of-order cores have reported a performance degradation of upto 45 % in certain applications compared to an execution which does not have any temporal redundancy. An important contributor to this problem is the insufficient number of ALUs for handling the amplified load injected into the core. At the same time, increasing the number of ALUs can increase the complexity of the issue logic, which has been pointed out to be one of the most timing critical components of the processor. This paper proposes a novel extension of a prior idea on instruction reuse to ease ALU bandwidth requirements in a complexity-effective way by exploiting certain interesting properties of a dual (temporally redundant) instruction stream. We present microarchitectural extensions necessary for implementing an instruction reuse buffer (IRB) and integrating this with the issue logic of a dual instruction stream superscalar core, and conduct extensive evaluations to demonstrate how well it can alleviate the ALU bandwidth problem. We show that on the average we can gain back nearly 50% of the IPC loss that occurred due to ALU bandwidth limitations for an instruction-level temporally redundant superscalar execution, and 23 % of the overall IPC loss. Keywords: Complexity-effective design, Instruction Reuse, Temporal Redundancy.
A high-speed dynamic instruction scheduling scheme for superscalar processors
- In Proceedings of the 34th International Symposium on Microarchitecture
, 2001
"... The wakeup logic is a part of the issuing window and is responsible to manage the ready flags of the operands for dynamic instruction scheduling. The conventional wakeup logic is based on association, and composed of a RAM and a CAM. Since the logic is not pipelinable and the delays of these memorie ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
The wakeup logic is a part of the issuing window and is responsible to manage the ready flags of the operands for dynamic instruction scheduling. The conventional wakeup logic is based on association, and composed of a RAM and a CAM. Since the logic is not pipelinable and the delays of these memories are dominated by the wire delays, the logic will be more critical with deeper pipelines and smaller feature sizes. This paper describes a new scheduling scheme not based on the association but on matrices which represent the dependences between instructions. Since the update logic of the matrices detects the dependencies between instructions as the register renaming logic does, the wakeup operation is realized by just reading the matrices. This paper also describes a technique to reduce the effective size of the matrices for small IPC penalties. We designed the layouts of the logics guided by a � ��m CMOS design rule provided by Fujitsu Limited, and calculated the delays. We also evaluated the penalties by cycle-level simulation. The results show that our scheme achieves 2.7GHz clock speed for the IPC degradation of about 1%. 1.
Register packing: Exploiting narrow-width operands for reducing register file pressure
- In Proc. of the 37th Annual International Symposium on Microarchitecture
, 2004
"... A large percentage of computed results have fewer significant bits compared to the full width of a register. We exploit this fact to pack multiple results into a single physical register to reduce the pressure on the register file in a superscalar processor. Two schemes for dynamically packing multi ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
A large percentage of computed results have fewer significant bits compared to the full width of a register. We exploit this fact to pack multiple results into a single physical register to reduce the pressure on the register file in a superscalar processor. Two schemes for dynamically packing multiple "narrow-width " results into partitions within a single register are evaluated. The first scheme is conservative and allocates a full-width register for a computed result. If the computed result turns out to be narrow, the result is reallocated to partitions within a common register, freeing up the full-width register. The second scheme allocates register partitions based on a prediction of the width of the result and reallocates register partitions when the actual result width is higher than what was predicted. If the actual width is narrower than what was predicted, allocated partitions are freed up. A detailed evaluation of our schemes show that average IPC gains of up to 15 % can be realized across the SPEC 2000 benchmarks on a somewhat register-constrained datapath. 1.
Microarchitectural wire management for performance and power in partitioned architectures
- In Proceedings of HPCA-11
, 2005
"... Future high-performance billion-transistor processors are likely to employ partitioned architectures to achieve high clock speeds, high parallelism, low design complexity, and low power. In such architectures, inter-partition communication over global wires has a significant impact on overall proces ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
Future high-performance billion-transistor processors are likely to employ partitioned architectures to achieve high clock speeds, high parallelism, low design complexity, and low power. In such architectures, inter-partition communication over global wires has a significant impact on overall processor performance and power consumption. VLSI techniques allow a variety of wire implementations, but these wire properties have previously never been exposed to the microarchitecture. This paper advocates global wire management at the microarchitecture level and proposes a heterogeneous interconnect that is comprised of wires with varying latency, bandwidth, and energy characteristics. We propose and evaluate microarchitectural techniques that can exploit such a heterogeneous interconnect to improve performance and reduce energy consumption. These techniques include a novel cache pipeline design, the identification of narrow bit-width operands, the classification of non-critical data, and the detection of interconnect load imbalance. For a dynamically scheduled partitioned architecture, our results demonstrate that the proposed innovations result in up to 11 % reductions in overall processor ED 2, compared to a baseline processor that employs a homogeneous interconnect. 1.

