Results 1 - 10
of
160
An adaptive, nonuniform cache structure for wire-delay dominated on-chip caches
- In International Conference on Architectural Support for Programming Languages and Operating Systems
, 2002
"... Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a ..."
Abstract
-
Cited by 199 (34 self)
- Add to MetaCart
Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a line's physical location within the cache. Consequently, cache access times will become a continuum of latencies rather than a single discrete latency. This nonuniformity can be exploited to provide faster access to cache lines in the portions of the cache that reside closer to the processor. In this paper, we evaluate a series of cache designs that provides fast hits to multi-megabyte cache memories. We first propose physical designs for these Non-Uniform Cache Architectures (NUCAs). We extend these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache. We show that, for multi-megabyte level-two caches, an adaptive, dynamic NUCA design achieves 1.5 times the IPC of a Uniform Cache Architecture of any size, outperforms the best static NUCA scheme by 11%, outperforms the best three-level hierarchywhile using less silicon area-by 13%, and comes within 13 % of an ideal minimal hit latency solution. 1.
Managing Multi-Configurable Hardware via Dynamic Working Set Analysis
- In 29th Annual International Symposium on Computer Architecture
, 2002
"... Microprocessors are designed to provide good average performance over a variety of workloads. This can lead to inefficiencies both in power and performance for individual programs and during individual phases within the same program. Microarchitectures with multi-configuration units (e.g. caches, pr ..."
Abstract
-
Cited by 140 (3 self)
- Add to MetaCart
Microprocessors are designed to provide good average performance over a variety of workloads. This can lead to inefficiencies both in power and performance for individual programs and during individual phases within the same program. Microarchitectures with multi-configuration units (e.g. caches, predictors, instruction windows) are able to adapt dynamically to program behavior and enable /disable resources as needed. A key element of existing configuration algorithms is adjusting to program phase changes. This is typically done by "tuning" when a phase change is detected -- i.e. sequencing through a series of trial configurations and selecting the best. We study algorithms that dynamically collect and analyze program working set information. To make this practical, we propose working set signatures -- highly compressed working set representations (e.g. 32-128 bytes total). We describe algorithms that use working set signatures to 1) detect working set changes and trigger re-tuning; 2) identify recurring working sets and re-install saved optimal reconfigurations, thus avoiding the time-consuming tuning process; 3) estimate working set sizes to configure caches directly to the proper size, also avoiding the tuning process. We use reconfigurable instruction caches to demonstrate the performance of the proposed algorithms. When applied to reconfigurable instruction caches, an algorithm that identifies recurring phases achieves power savings and performance similar to the best algorithm reported to date, but with orders-of-magnitude savings in retunings. 1
Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures
, 2000
"... Conventional microarchitectures choose a single memory hierarchy design point targeted at the average application. In this paper, we propose a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application ..."
Abstract
-
Cited by 130 (17 self)
- Add to MetaCart
Conventional microarchitectures choose a single memory hierarchy design point targeted at the average application. In this paper, we propose a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application phase basis. A novel configuration management algorithm dynamically detects phase changes and reacts to an application’s hit and miss intolerance in order to improve memory hierarchy performance while taking energy consumption into consideration. When applied to a two-level cache and TLB hierarchy at 0.1 £ m technology, the result is an average 15 % reduction in cycles per instruction (CPI), corresponding to an average 27 % reduction in memory-CPI, across a broad class of applications compared to the best conventional two-level hierarchy of comparable size. Projecting to sub-.1 £ m technology design considerations that call for a three-level conventional cache hierarchy for performance reasons, we demonstrate that a configurable L2/L3 cache hierarchy coupled with a conventional L1 results in an average 43 % reduction in memory hierarchy energy in addition to improved performance.
A static power model for architects
- In Proceedings of the 33rd International Symposium on Microarchitecture (MICRO-33
, 2000
"... Static power dissipation due to transistor leakage constitutes an increasing fraction of the total power in modern semiconductor technologies. Current technology trends indicate that the contribution will increase rapidly, reaching one half of total power dissipation within three process generations ..."
Abstract
-
Cited by 112 (1 self)
- Add to MetaCart
Static power dissipation due to transistor leakage constitutes an increasing fraction of the total power in modern semiconductor technologies. Current technology trends indicate that the contribution will increase rapidly, reaching one half of total power dissipation within three process generations. Developing power efficient products will require consideration of static power in the earliest phases of design, including architecture and microarchitecture definition. We propose a simple equation for estimating static power consumption at the architectural level: Pstatic = VCC ⋅ N ⋅ kdesign ⋅ Îleak, where VCC is the supply voltage, N is the number of transistors, kdesign is a design dependent parameter, and Îleak is a technology dependent parameter. This model enables high-level reasoning about the likely static power demands of alternative microarchitectures. Reasonably accurate values for the factors within the equation may be obtained directly from the high-level designs or by straightforward scaling arguments. The factors within the equation also suggest opportunities for static power optimization, including reducing the total number of devices, partitioning the design to allow for lower supply voltages or slower, less leaky transistors, turning off unused devices, favoring certain design styles, and favoring high bandwidth over low latency. Speculation is also examined as a means to employ slower transistors without a significant performance penalty. 1.
An Integrated Circuit/Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches
, 2001
"... Deep-submicron CMOS designs maintain high transistor switching speeds by scaling down the supply voltage and proportionately reducing the transistor threshold voltage. Lowering the threshold voltage increases leakage energy dissipation due to subthreshold leakage current even when the transistor is ..."
Abstract
-
Cited by 103 (6 self)
- Add to MetaCart
Deep-submicron CMOS designs maintain high transistor switching speeds by scaling down the supply voltage and proportionately reducing the transistor threshold voltage. Lowering the threshold voltage increases leakage energy dissipation due to subthreshold leakage current even when the transistor is not switching. Estimates suggest a five-fold increase in leakage energy in every future generation. In modern microarchitectures, much of the leakage energy is dissipated in large on-chip cache memory structures with high transistor densities. While cache utilization varies both within and across applications, modern cache designs are fixed in size resulting in transistor leakage inefficiencies. This paper
Reconfigurable Caches and their Application to Media Processing
- In Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... High performance general-purpose processors are increasingly being used for a variety of application domains -- scientific, engineering, databases, and more recently, media processing. It is therefore important to ensure that architectural features that use a significant fraction of the on-chip tra ..."
Abstract
-
Cited by 91 (4 self)
- Add to MetaCart
High performance general-purpose processors are increasingly being used for a variety of application domains -- scientific, engineering, databases, and more recently, media processing. It is therefore important to ensure that architectural features that use a significant fraction of the on-chip transistors are applicable across these different domains. For example, current processor designs often devote the largest fraction of on-chip transistors (up to 80%) to caches. Many workloads, however, do not make effective use of large caches; e.g., media processing workloads which often have streaming data access patterns and large working sets. This paper proposes a new reconfigurable cache design. This design enables the cache SRAM arrays to be dynamically divided into multiple partitions that can be used for different processor activities. These activities can benefit applications that would otherwise not use the storage allocated to large conventional caches. Our design involves relativ...
Power and Energy Reduction Via Pipeline Balancing
- In International Symposium on Computer Architecture
, 2001
"... Minimizing power dissipation is an important design require-ment for both portable and non-portable systems. In this work, we propose an architectural solution to the power problem that retains performance while reducing power The technique, known as Pipeline Balancing (PLB), dynamically tunes the r ..."
Abstract
-
Cited by 86 (4 self)
- Add to MetaCart
Minimizing power dissipation is an important design require-ment for both portable and non-portable systems. In this work, we propose an architectural solution to the power problem that retains performance while reducing power The technique, known as Pipeline Balancing (PLB), dynamically tunes the resources of a general purpose processor to the needs of the program by mon-itoring performance within each program. We analyze metrics for triggering PLB, and detail instruction queue design and energy savings based on an extension of the Alpha 21264 processor Us-ing a detailed simulator, we present component and full chip power and energy savings for single and multi-threaded execution. Re-sults show an issue queue and execution unit power reduction of up to 23 % and 13%, respectively, with an average performance loss of 1 % to 2 %. 1.
Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources
- in Proc. of MICRO–34
, 2001
"... The “one–size–fits–all ” philosophy used for permanently allocating datapath resources in today’s superscalar CPUs to maximize performance across a wide range of applications results in the overcommitment of resources in general. To reduce power dissipation in the datapath, the resource allocations ..."
Abstract
-
Cited by 81 (12 self)
- Add to MetaCart
The “one–size–fits–all ” philosophy used for permanently allocating datapath resources in today’s superscalar CPUs to maximize performance across a wide range of applications results in the overcommitment of resources in general. To reduce power dissipation in the datapath, the resource allocations can be dynamically adjusted based on the demands of applications. We propose a mechanism to dynamically, simultaneously and independently adjust the sizes of the issue queue (IQ), the reorder buffer (ROB) and the load/store queue (LSQ) based on the periodic sampling of their occupancies to achieve significant power savings with minimal impact on performance. Resource upsizing is done more aggressively (compared to downsizing) using the relative rate of blocked dispatches to limit the performance penalty. Our results are validated by the execution of SPEC 95 benchmark suite on a substantially modified version of Simplescalar simulator, where the IQ, the ROB, the LSQ and the register files are implemented as separate structures, as is the case with most practical implementations. For the SPEC 95 benchmarks, the use of our technique in a 4–way superscalar processor results in a power savings in excess of 70 % within individual components and an average power savings of 53 % for the IQ, LSQ and ROB combined for the entire benchmark suite with an average performance penalty of only 5%.
Positional Adaptation of Processors: Application to Energy Reduction
- In International Symposium on Computer Architecture
, 2003
"... Although adaptive processors can exploit application variability to improve performance or save energy, effectively managing their adaptivity is challenging. To address this problem, we introduce a new approach to adaptivity: the Positional approach. In this approach, both the testing of configurati ..."
Abstract
-
Cited by 69 (2 self)
- Add to MetaCart
Although adaptive processors can exploit application variability to improve performance or save energy, effectively managing their adaptivity is challenging. To address this problem, we introduce a new approach to adaptivity: the Positional approach. In this approach, both the testing of configurations and the application of the chosen configurations are associated with particular code sections. This is in contrast to the currently-used Temporal approach to adaptation, where both the testing and application of configurations are tied to successive intervals in time.
Adaptive Mode Control: A Static-Power-Efficient Cache Design
, 2001
"... Lower threshold voltages in deep sub-micron technologies cause more leakage current, increasing static power dissipation. This trend, combined with the trend of larger/more cache memories dominating die area, has prompted circuit designers to develop SRAM cells with low-leakage operating modes (e.g. ..."
Abstract
-
Cited by 57 (0 self)
- Add to MetaCart
Lower threshold voltages in deep sub-micron technologies cause more leakage current, increasing static power dissipation. This trend, combined with the trend of larger/more cache memories dominating die area, has prompted circuit designers to develop SRAM cells with low-leakage operating modes (e.g., sleep mode). Sleep mode reduces static power dissipation but data stored in a sleeping cell is unreliable or lost. So, at the architecture level, there is interest in exploiting sleep mode to reduce static power dissipation while maintaining high performance. Current

