Results 1 - 10
of
246
Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures
, 2000
"... The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scaling of the processor clock with technology generation. Our results show that, due to both diminishing improvements in clock rates and poor wire scaling as se ..."
Abstract
-
Cited by 264 (22 self)
- Add to MetaCart
The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scaling of the processor clock with technology generation. Our results show that, due to both diminishing improvements in clock rates and poor wire scaling as semiconductor devices shrink, the achievable performance growth of conventional microarchitectures will slow substantially. In this paper, we describe technology-driven models for wire capacitance, wire delay, and microarchitectural component delay. Using the results of these models, we measure the simulated performance---estimating both clock rate and IPC--- of an aggressive out-of-order microarchitecture as it is scaled from a 250nm technology to a 35nm technology. We perform this analysis for three clock scaling targets and two microarchitecture scaling strategies: pipeline scaling and capacity scaling. We find that no scaling strategy permits annual performance improvements of better than 12.5%, which is far worse than the annual 50-60% to which we have grown accustomed. 1
Selective Cache Ways: On-Demand Cache Resource Allocation
, 2000
"... Increasing levels of microprocessor power dissipation call for new approaches at the architectural level that save energy by better matching of on-chip resources to application requirements. Selective cache ways provides the ability to disable a subset of the ways in a set associative cache durin ..."
Abstract
-
Cited by 228 (7 self)
- Add to MetaCart
Increasing levels of microprocessor power dissipation call for new approaches at the architectural level that save energy by better matching of on-chip resources to application requirements. Selective cache ways provides the ability to disable a subset of the ways in a set associative cache during periods of modest cache activity, while the full cache may remain operational for more cache-intensive periods. Because this approach leverages the subarray partitioning that is already present for performance reasons, only minor changes to a conventional cache are required, and therefore, full-speed cache operation can be maintained. Furthermore, the tradeoff between performance and energy is flexible, and can be dynamically tailored to meet changing application and machine environmental conditions. We show that trading off a small performance degradation for energy savings can produce a significant reduction in cache energy dissipation using this approach. 1. Introduction Contin...
Modeling the effect of technology trends on the soft error rate of combinational logic
, 2002
"... This paper examines the effect of technology scaling and microarchitectural trends on the rate of soft errors in CMOS memory and logic circuits. We describe and validate an end-to-end model that enables us to compute the soft error rates (SER) for existing and future microprocessor-style designs. Th ..."
Abstract
-
Cited by 216 (5 self)
- Add to MetaCart
This paper examines the effect of technology scaling and microarchitectural trends on the rate of soft errors in CMOS memory and logic circuits. We describe and validate an end-to-end model that enables us to compute the soft error rates (SER) for existing and future microprocessor-style designs. The model captures the effects of two important masking phenomena, electrical masking and latchingwindow masking, which inhibit soft errors in combinational logic. We quantify the SER in combinational logic and latches for feature sizes from 600nm to 50nm and clock rates from 16 to 6 fan-out-of-4 delays. Our model predicts that the SER per chip of logic circuits will increase eight orders of magnitude by the year 2011 and at that point will be comparable to the SER per chip of unprotected memory elements. Our result emphasizes the need for computer system designers to address the risks of SER in logic circuits in future designs. 1
An adaptive, nonuniform cache structure for wire-delay dominated on-chip caches
- In International Conference on Architectural Support for Programming Languages and Operating Systems
, 2002
"... Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a ..."
Abstract
-
Cited by 199 (34 self)
- Add to MetaCart
Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a line's physical location within the cache. Consequently, cache access times will become a continuum of latencies rather than a single discrete latency. This nonuniformity can be exploited to provide faster access to cache lines in the portions of the cache that reside closer to the processor. In this paper, we evaluate a series of cache designs that provides fast hits to multi-megabyte cache memories. We first propose physical designs for these Non-Uniform Cache Architectures (NUCAs). We extend these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache. We show that, for multi-megabyte level-two caches, an adaptive, dynamic NUCA design achieves 1.5 times the IPC of a Uniform Cache Architecture of any size, outperforms the best static NUCA scheme by 11%, outperforms the best three-level hierarchywhile using less silicon area-by 13%, and comes within 13 % of an ideal minimal hit latency solution. 1.
Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures
, 2000
"... Conventional microarchitectures choose a single memory hierarchy design point targeted at the average application. In this paper, we propose a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application ..."
Abstract
-
Cited by 130 (17 self)
- Add to MetaCart
Conventional microarchitectures choose a single memory hierarchy design point targeted at the average application. In this paper, we propose a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application phase basis. A novel configuration management algorithm dynamically detects phase changes and reacts to an application’s hit and miss intolerance in order to improve memory hierarchy performance while taking energy consumption into consideration. When applied to a two-level cache and TLB hierarchy at 0.1 £ m technology, the result is an average 15 % reduction in cycles per instruction (CPI), corresponding to an average 27 % reduction in memory-CPI, across a broad class of applications compared to the best conventional two-level hierarchy of comparable size. Projecting to sub-.1 £ m technology design considerations that call for a three-level conventional cache hierarchy for performance reasons, we demonstrate that a configurable L2/L3 cache hierarchy coupled with a conventional L1 results in an average 43 % reduction in memory hierarchy energy in addition to improved performance.
Runahead execution: An alternative to very large instruction windows for out-of-order processors
- In HPCA-9
, 2003
"... Today’s high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of ..."
Abstract
-
Cited by 123 (19 self)
- Add to MetaCart
Today’s high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large, in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor, without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel R ○ Pentium R ○ 4 processor, having a 128-entry instruction window, adding runahead execution improves the IPC (Instructions Per Cycle) by 22 % across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1 % of a machine with no runahead execution and a 384-entry instruction window. 1.
Increasing processor performance by implementing deeper pipelines
- In Proceedings of the 29th International Symposium on Computer Architecture (ISCA-29
, 2002
"... One architectural method for increasing processor performance involves increasing the frequency by implementing deeper pipelines. This paper will explore the relationship between performance and pipeline depth using a Pentium ® 4 processor like architecture as a baseline and will show that deeper pi ..."
Abstract
-
Cited by 122 (0 self)
- Add to MetaCart
One architectural method for increasing processor performance involves increasing the frequency by implementing deeper pipelines. This paper will explore the relationship between performance and pipeline depth using a Pentium ® 4 processor like architecture as a baseline and will show that deeper pipelines can continue to increase performance. This paper will show that the branch misprediction latency is the single largest contributor to performance degradation as pipelines are stretched, and therefore branch prediction and fast branch recovery will continue to increase in importance. We will also show that higher performance cores, implemented with longer pipelines for example, will put more pressure on the memory system, and therefore require larger on-chip caches. Finally, we will show that in the same process technology, designing deeper pipelines can increase the processor frequency by 100%, which, when combined with larger on-chip caches can yield performance improvements of 35 % to 90 % over a Pentium ® 4 like processor. 1.
Measuring Experimental Error in Microprocessor Simulation
, 2001
"... We measure the ex4fl840E468 error that arises from the use of non-validated simulators in computer architecture research, with the goal of increasing the rigor of simulation -based studies. We describe the methodology that we used to validate a microprocessor simulator against a Compaq DS-10L work ..."
Abstract
-
Cited by 105 (28 self)
- Add to MetaCart
We measure the ex4fl840E468 error that arises from the use of non-validated simulators in computer architecture research, with the goal of increasing the rigor of simulation -based studies. We describe the methodology that we used to validate a microprocessor simulator against a Compaq DS-10L workstation, which contains an Alpha 21264 processor. Our evaluation suite consists of a set of 21 microbenchmarks that stress different aspects of the 21264 microarchitecture. Using the microbenchmark suite as the set of workloads, we describe how we reduced our simulator error to an arithmetic mean of 2%, and include details about the specific aspects of the pipeline that required ex8 a care to reduce the error. We show how these low-level optimizations reduce average error from 40% to less than 20% on macrobenchmarks drawn from the SPEC2000 suite. Finally, we exI837 the degree to which performance optimizations are stable across different simulators, showing that researchers would draw differ...
A Design Space Evaluation of Grid Processor Architectures
, 2001
"... In this paper, we survey the design space of a new class of architec-tures called Grid Processor Architectures (GPAs). These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on trad ..."
Abstract
-
Cited by 100 (31 self)
- Add to MetaCart
In this paper, we survey the design space of a new class of architec-tures called Grid Processor Architectures (GPAs). These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on traditional workloads and high performance across a range of application classes. A GPA consists of an array of ALUs, each with limited control, connected by a thin operand network. Pro-grams are executed by mapping blocks of statically scheduled instruc-tions to the ALU array and executing them dynamically in dataflow or-der This organization enables the critical paths of instruction blocks to be executed on chains of ALUs without transmitting temporary val-ues back to the register file, avoiding most of the large, unscalable structures that limit the scalability of conventional architectures. Fi-nally, we present simulation results of a preliminary design, the GPA-1. With a half-cycle routing delay, we obtain performance roughly equal to an ideal 8-way, 512-entry window superscalar core. With no inter-ALU delay, perfect memory, and perfect branch prediction, the 1PC of the GPA-1 is more than twice that of the ideal superscalar core, achieving an average of 11 IPC across nine SPEC CPU2000 and Mediabench benchmarks.
Register Organization for Media Processing
- HPCA6
, 2000
"... Processor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis, and image understanding, require arithmetic rates of up to 10 11 operations per second. As the number of arithmetic uni ..."
Abstract
-
Cited by 94 (10 self)
- Add to MetaCart
Processor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis, and image understanding, require arithmetic rates of up to 10 11 operations per second. As the number of arithmetic units in a processor increases to meet these demands, register storage and communication between the arithmetic units dominate the area, delay, and power of the arithmetic units. In this paper we show that partitioning the register file along three axes reduces the cost of register storage and communication without significantly impacting performance. We develop a taxonomy of register architectures by partitioning across the data-parallel, instruction-level parallel, and memory hierarchy axes, and by optimizing the hierarchical register organization to operate on streams of data. Compared to a centralized global register file, the most compact of these organizations reduces the register file ar...

