Results 1 - 10
of
19
Dynamically managing the communication-parallelism trade-off in future clustered processors
- IN PROCEEDINGS OF INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2003
"... Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, there ..."
Abstract
-
Cited by 47 (10 self)
- Add to MetaCart
Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, thereby allowing more aggressive use of instructionlevel parallelism (ILP), the inter-cluster communication increases as data values get spread across a wider area. As a result of the emergence of this trade-off between communication and parallelism, a subset of the total on-chip clusters is optimal for performance. To match the hardware to the application’s needs, we use a robust algorithm to dynamically tune the clustered architecture. The algorithm, which is based on program metrics gathered at periodic intervals, achieves an 11 % performance improvement on average over the best statically defined architecture. We also show that the use of additional hardware and reconfiguration at basic block boundaries can achieve average improvements of 15%. Our results demonstrate that reconfiguration provides an effective solution to the communication and parallelism trade-off inherent in the communicationbound processors of the future.
Improving Dynamic Cluster Assignment for Clustered Trace Cache Processors
, 2003
"... This work examines dynamic cluster assignment for a clustered trace cache processor (CTCP). Previously proposed cluster assignment techniques run into unique problems as issue width and cluster count increase. Realistic design conditions, such as variable data forwarding latencies between clusters a ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
This work examines dynamic cluster assignment for a clustered trace cache processor (CTCP). Previously proposed cluster assignment techniques run into unique problems as issue width and cluster count increase. Realistic design conditions, such as variable data forwarding latencies between clusters and a heavily partitioned instruction window, increase the degree of difficulty for effective cluster assignment.
Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication
, 2004
"... In the modern era of wire-dominated architectures, specific effort must be made to reduce needless communication within out-of-order pipelines while still maintaining binary compatibility. To ease pressure on highly-connected elements such as the issue logic and bypass network, we propose the dynami ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
In the modern era of wire-dominated architectures, specific effort must be made to reduce needless communication within out-of-order pipelines while still maintaining binary compatibility. To ease pressure on highly-connected elements such as the issue logic and bypass network, we propose the dynamic detection and speculative execution of instruction strands--linear chains of dependent instructions without intermediate fan-out. The hardware required for detecting these chains is simple and resides off the critical path of the pipeline, and the execution targets are the normal ALUs with a self-bypass mode. By collapsing these strings of dependencies into atomic macro-instructions, the efficiency of the issue queue and reorder buffer can be increased. Our results show that over 25% of all dynamic ALU instructions can be grouped, decreasing both the average reorder buffer occupancy and issue queue occupancy by over a third. Additionally, these strands have several properties which make them amenable to simple performance optimizations. Our experiments show average IPC increases of 17% on a four-wide machine and 20% on an eight-wide machine in Spec2000int and Mediabench applications. Finally, strands ease the IPC penalties of multicycle issue and bypass by reducing dependency pressures, providing opportunity for clock frequency gains as well.
Microarchitectural wire management for performance and power in partitioned architectures
- In Proceedings of HPCA-11
, 2005
"... Future high-performance billion-transistor processors are likely to employ partitioned architectures to achieve high clock speeds, high parallelism, low design complexity, and low power. In such architectures, inter-partition communication over global wires has a significant impact on overall proces ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
Future high-performance billion-transistor processors are likely to employ partitioned architectures to achieve high clock speeds, high parallelism, low design complexity, and low power. In such architectures, inter-partition communication over global wires has a significant impact on overall processor performance and power consumption. VLSI techniques allow a variety of wire implementations, but these wire properties have previously never been exposed to the microarchitecture. This paper advocates global wire management at the microarchitecture level and proposes a heterogeneous interconnect that is comprised of wires with varying latency, bandwidth, and energy characteristics. We propose and evaluate microarchitectural techniques that can exploit such a heterogeneous interconnect to improve performance and reduce energy consumption. These techniques include a novel cache pipeline design, the identification of narrow bit-width operands, the classification of non-critical data, and the detection of interconnect load imbalance. For a dynamically scheduled partitioned architecture, our results demonstrate that the proposed innovations result in up to 11 % reductions in overall processor ED 2, compared to a baseline processor that employs a homogeneous interconnect. 1.
Routed Inter-ALU Networks for ILP Scalability and Performance
, 2003
"... Modern processors rely heavily on broadcast networks to bypass instruction results to dependent instructions in the pipeline. However, as clock rates increase, architectures get wider, and pipelines get deeper, broadcasting becomes more complex, slower, and more difficult to implement. This complexi ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Modern processors rely heavily on broadcast networks to bypass instruction results to dependent instructions in the pipeline. However, as clock rates increase, architectures get wider, and pipelines get deeper, broadcasting becomes more complex, slower, and more difficult to implement. This complexity is compounded by shrinking feature size, as the communication speed decreases relative to transistor switching speeds. This paper examines the fundamental needs of bypassing networks and proposes a method for classifying these Inter-ALU Networks based on how operands are routed from producers to consumers. We then propose and evaluate at both the circuit and architectural level a fine grain point-to-point Routed Inter-ALU Network (RIAN) that delivers the same or higher instruction throughput as a full bypass network but at higher speeds while using fewer wires.
Independent front-end and back-end dynamic voltage scaling for a gals microarchitecture
- In ISLPED ’06: Proceedings of the 2006 International Symposium on Low Power Electronics and Design
, 2006
"... In recent years, Globally Asynchronous Locally Synchronous (GALS) designs and dynamic voltage scaling (DVS) have emerged as some of the most popular approaches to address the ever increasing microprocessor energy consumption. In this work, we propose two on-line algorithms for adjusting dynamically, ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
In recent years, Globally Asynchronous Locally Synchronous (GALS) designs and dynamic voltage scaling (DVS) have emerged as some of the most popular approaches to address the ever increasing microprocessor energy consumption. In this work, we propose two on-line algorithms for adjusting dynamically, and independently, the voltage and frequency of the front-end and back-end domains of a novel two-domain microprocessor. We evaluate our mechanisms for both internal and external voltage regulators, and we present optimal dynamic voltage scaling results for the proposed microarchitecture. Our schemes achieve average improvement of 12 % of the energy-delay 2 metric, when using internal voltage regulators.
Frontend Frequency-Voltage Adaptation for Optimal Energy-Delay 2
- In International Conference on Computer Design
, 2004
"... In this paper we present a clustered, multiple-clock domain (CMCD) microarchitecture that combines the benefits of both clustering and Globally Asynchronous Locally Synchronous (GALS) designs. We also present a mechanism for dynamically adapting the frequency and voltage of the frontend of the CMCD ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
In this paper we present a clustered, multiple-clock domain (CMCD) microarchitecture that combines the benefits of both clustering and Globally Asynchronous Locally Synchronous (GALS) designs. We also present a mechanism for dynamically adapting the frequency and voltage of the frontend of the CMCD with the goal to optimize the energy-delay 2 product (ED2P). Our mechanism has minimal hardware cost, is entirely selfadjustable, does not depend on any thresholds, and achieves results close to optimal. We evaluate it on 16 SPEC 2000 applications and report 17.5 % ED2P reduction on average (80 % of the upper bound). 1.
Cluster Assignment Strategies for a Clustered Trace Cache Processor
, 2003
"... This report examines dynamic cluster assignment for a clustered trace cache processor (CTCP). ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
This report examines dynamic cluster assignment for a clustered trace cache processor (CTCP).
Exploring Energy-Performance Trade-offs for Heterogeneous Interconnect Clustered VLIW Processors
- In Proc. of Intl. Conf. on High Performance Computing
, 2005
"... Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving clock speed, reducing energy consumption of the logic, and making design ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving clock speed, reducing energy consumption of the logic, and making design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long global wires having high load capacitance which leads to delay in execution and significantly high energy consumption. Technological advancements permit design of a variety of clustered architectures by varying the degree of clustering and the type of interconnects. In this paper, we focus on exploring energy performance trade-offs in going from a unified VLIW architecture to different types of clustered VLIW architectures. We propose a new instruction scheduling algorithm that exploits scheduling slacks of instructions and communication slacks of data values together to achieve better energy-performance trade-offs for clustered architectures. Our instruction scheduling algorithm for clustered architectures with heterogeneous interconnect achieves 35 % and 40 % reduction in communication energy, whereas the overall energy-delay product improves by 4.5 % and 6.5 % respectively for 2 cluster and 4 cluster machines with marginal 1.6 % and 1.1 % increase in execution time. Our test bed uses the Trimaran compiler infrastructure. 1 1.
Qibin Sun Shih-Fu Chang, Semi-Fragile Authentication of JPEG-2000 Images with Control, Columbia University ADVENT Technical Report, 2002-101. Qibin Shih-Fu Chang, Maeno Kurato Masayuki Suto, semi-fragile image authentication framework combining infrastruc
- Information Technology---JPEG2000 Image Coding System, ISO/IEC International Standard 15444-1, Recommendation T.800, 2000. Rabbani R. Joshi, overview JPEG2000 image compression standard, Signal Processing: Image Communication, Vol.17, No.1, 2001. Taubman,
, 2002
"... Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered superscalar microarchitectures. This new class of i ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Clustering is an effective microarchitectural technique for reducing the impact of wire delays, the complexity, and the power requirements of microprocessors. In this work, we investigate the design of on-chip interconnection networks for clustered superscalar microarchitectures. This new class of interconnects has demands and characteristics different from traditional multiprocessor net-works. In particular, in a clustered microarchitecture, a low inter-cluster communication latency is essential for high performance. We propose some point-to-point cluster interconnects and new improved instruction steering schemes. The results show that these point-to-point interconnects achieve much better performance than bus-based ones, and that the connectivity of the network together with effective steering schemes, are key for high performance. We also show that these interconnects can be built with

