Results 1 -
9 of
9
Dynamically managing the communication-parallelism trade-off in future clustered processors
- IN PROCEEDINGS OF INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2003
"... Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, there ..."
Abstract
-
Cited by 47 (10 self)
- Add to MetaCart
Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, thereby allowing more aggressive use of instructionlevel parallelism (ILP), the inter-cluster communication increases as data values get spread across a wider area. As a result of the emergence of this trade-off between communication and parallelism, a subset of the total on-chip clusters is optimal for performance. To match the hardware to the application’s needs, we use a robust algorithm to dynamically tune the clustered architecture. The algorithm, which is based on program metrics gathered at periodic intervals, achieves an 11 % performance improvement on average over the best statically defined architecture. We also show that the use of additional hardware and reconfiguration at basic block boundaries can achieve average improvements of 15%. Our results demonstrate that reconfiguration provides an effective solution to the communication and parallelism trade-off inherent in the communicationbound processors of the future.
Decoupling Local Variable Accesses in a Wide-Issue Superscalar Processor
- In Proc. 26th Ann. Int’l Symp. on Computer Architecture
, 1998
"... Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes ..."
Abstract
-
Cited by 28 (4 self)
- Add to MetaCart
Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with support from the compiler and/or hardware, partitions the memory stream into two independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies the potential of decoupling memory accesses to program's local variables that are allocated on the run-time stack. Using a set of integer and floating-point programs from the SPEC95 benchmark suite, it is shown that local variable accesses constitute a large portion of all the memory references, while their reference space is ...
L1 Data Cache Decomposition for Energy Efficiency
- In Proceedings of the International Symposium on Low Power Electronics and Design
, 2001
"... ABSTRACT The L1 data cache is a time-critical module and, at the same time,a major consumer of energy. To reduce its energy-delay product, we apply two principles of low-power design: specialize part of thecache structure and break the cache down into smaller caches. To this end, we propose a new L1 ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
ABSTRACT The L1 data cache is a time-critical module and, at the same time,a major consumer of energy. To reduce its energy-delay product, we apply two principles of low-power design: specialize part of thecache structure and break the cache down into smaller caches. To this end, we propose a new L1 data cache structure that combinesa Specialized Stack Cache (SSC) and a Pseudo Set-Associative Cache (PSAC). Individually, our SSC and PSAC designs have alower energy-delay product than previously-proposed related designs. In addition, their combined operation is very effective. Rela-tive to a conventional 2-way 32 KB data cache, a design containing a 4-way 32 KB PSAC and a 512 B SSC reduces the energy-delayproduct of several applications by an average of 44%. 1.
Partitioned First-Level Cache Design for Clustered Microarchitectures
- ICS'03
, 2003
"... The high clock frequencies of modern superscalar processors make the wire delay incurred in moving data across the processor chip a significant concern. As frequencies continue to increase, it will become more difficult for a centralized first level data cache to supply the timely data bandwidth req ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
The high clock frequencies of modern superscalar processors make the wire delay incurred in moving data across the processor chip a significant concern. As frequencies continue to increase, it will become more difficult for a centralized first level data cache to supply the timely data bandwidth required by superscalar processors. This paper
Effective Instruction Scheduling with Limited Registers
, 2001
"... Effective global instruction scheduling techniques have become an important component in modern compilers for exposing more instruction-level parallelism (ILP) and exploiting the everincreasing number of parallel function units. Effective register allocation has long been an essential component of a ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Effective global instruction scheduling techniques have become an important component in modern compilers for exposing more instruction-level parallelism (ILP) and exploiting the everincreasing number of parallel function units. Effective register allocation has long been an essential component of a good compiler for reducing memory references. While instruction scheduling and register allocation are both essential compiler optimizations for fully exploiting the capability of modern high-performance microprocessors, there is a phase-ordering problem when we perform these two optimizations separately: instruction scheduling before register allocation may create insatiable demands for registers; register allocation before instruction scheduling may reduce the amount of parallelism that instruction scheduling can exploit. In this thesis, we propose to solve this phase-ordering problem by inserting a moderating optimization called code reorganization between prepass instruction scheduling and register allocation. Code reorganization adjusts the prepass scheduling results to make them demand fewer registers (i.e. exhibit lower register pressure) and guides register allocation to insert spill code that has less impact on schedule length. Our new approach avoids the complexity of simultaneous instruction scheduling and register allocation algorithms. In fact, it does not modify either instruction scheduling or register allocation algorithms. Therefore instruction scheduling can focus on maximizing instruction-level parallelism, and register allocation can focus on minimizing the cost of spill code. We compare the performance of our approach with a particular successful register-pressure-sensitive scheduling algorithm, and show an average of 18% improvement in speedup for an 8...
Predictive Precharging for Bitline Leakage Energy Reduction
- IEEE the 15th ASIC/SOC Conf., Page(s): 36~40
, 2002
"... As technology scales down into deepsubmicron, leakage energy is becoming a dominant source of energy consumption. Leakage energy is generally proportional to the area of a circuit and caches constitute a large portion of the die area. Therefore, there has been much effort to reduce leakage energy in ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
As technology scales down into deepsubmicron, leakage energy is becoming a dominant source of energy consumption. Leakage energy is generally proportional to the area of a circuit and caches constitute a large portion of the die area. Therefore, there has been much effort to reduce leakage energy in caches. Most techniques have been targeted at cell leakage energy optimization. Bitline leakage energy also is critical. Thus, we propose a predictive precharging scheme to reduce bitline leakage energy. Results show that energy savings are significant with little performance degradation. Also, our predictive precharging is more beneficial in more aggressively scaled technologies.
Microarchitectural Trade-offs in the Design of a Scalable Clustered Microprocessor
"... Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. In such a microarchitecture, the distribution of functional units, the register fi ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. In such a microarchitecture, the distribution of functional units, the register files, and the issue queues across multiple clusters reduces the latency of various cycle time critical paths, thereby enabling a faster clock. However, a penalty in terms of instructions per cycle is incurred if instructions frequently communicate values among clusters because of dependences.
L1 Data Cache Decomposition for Energy Efficiency
- In Proceedings of the International Symposium on Low Power Electronics and Design
, 2001
"... The L1 data cache is a time-critical module and, at the same time, a major consumer of energy. To reduce its energy-delay product, we apply two principles of low-power design: specialize part of the cache structure and break the cache down into smaller caches. To this end, we propose a new L1 data c ..."
Abstract
- Add to MetaCart
The L1 data cache is a time-critical module and, at the same time, a major consumer of energy. To reduce its energy-delay product, we apply two principles of low-power design: specialize part of the cache structure and break the cache down into smaller caches. To this end, we propose a new L1 data cache structure that combines a Specialized Stack Cache (SSC) and a Pseudo Set-Associative Cache (PSAC). Individually, our SSC and PSAC designs have a lower energy-delay product than previously-proposed related designs. In addition, their combined operation is very effective. Relative to a conventional 2-way 32 KB data cache, a design containing a 4-way 32 KB PSAC and a 512 B SSC reduces the energy-delay product of several applications by an average of 44%.
Cache Characterization and Performance Studies Using Locality Surfaces
, 2003
"... Introduction Moore's Law states that processor speeds double every 18 months. Memory density is increasing at a similar rate, but memory speeds increase at the much slower rate of about 7% per year [1]. This means that the time needed to access memory is an increasing bottleneck. To help overcome t ..."
Abstract
- Add to MetaCart
Introduction Moore's Law states that processor speeds double every 18 months. Memory density is increasing at a similar rate, but memory speeds increase at the much slower rate of about 7% per year [1]. This means that the time needed to access memory is an increasing bottleneck. To help overcome this mismatch in speed, computer architecture designers take advantage of the fact that smaller memories placed close to the processor are signi cantly faster than main memory [2]. Small, fast memories between the processor and main memory are called caches. Caches contain a subset of the data in main memory. If a signi cant portion of the data the processor requires is found in the cache, the processor is relieved from waiting for slower main memory. Cache access speeds are about 1 ns and main memory access speeds are about 100 ns [1]. When the contents of an accessed memory location are in the cache, it is termed a hit. When the contents are not in the cache it is termed a miss. Large ca

