Results 1 - 10
of
105
A Large, Fast Instruction Window for Tolerating Cache Misses
"... Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle ti ..."
Abstract
-
Cited by 109 (1 self)
- Add to MetaCart
Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately, naively scaling conventional window designs can significantly degrade clock cycle time, undermining the benefits of increased parallelism. This paper presents a new instruction window design targeted at achieving the latency tolerance of large windows with the clock cycle time of small windows. The key observation is that instructions dependent on a long latency operation (e.g., cache miss) cannot execute until that source operation completes. These instructions are moved out of the conventional, small, issue queue to a much larger waiting instruction buffer (WIB). When the long latency operation completes, the instructions are reinserted into the issue queue. In this paper, we focus specifically on load cache misses and their dependent instructions. Simulations reveal that, for an 8-way processor, a 2K-entry WIB with a 32entry issue queue can achieve speedups of 20%, 84%, and 50% over a conventional 32-entry issue queue for a subset of the SPEC CINT2000, SPEC CFP2000, and Olden benchmarks, respectively.
Energy-Efficient Processor Design Using Multiple Clock Domains with Dynamic Voltage and Frequency Scaling
- In Proceedings of the 8th International Symposium on High-Performance Computer Architecture
, 2002
"... As clock frequency increases and feature size decreases, clock distribution and wire delays present a growing challenge to the designers of singly-clocked, globally synchronous systems. We describe an alternative approach, which we call a Multiple Clock Domain (MCD) processor, in which the chip is d ..."
Abstract
-
Cited by 108 (13 self)
- Add to MetaCart
(Show Context)
As clock frequency increases and feature size decreases, clock distribution and wire delays present a growing challenge to the designers of singly-clocked, globally synchronous systems. We describe an alternative approach, which we call a Multiple Clock Domain (MCD) processor, in which the chip is divided into several (coarse-grained) clock domains, within which independent voltage and frequency scaling can be performed. Boundaries between domains are chosen to exploit existing queues, thereby minimizing inter-domain synchronization costs. We propose four clock domains, corresponding to the front end (including L1 instruction cache), integer units, floating point units, and load-store units (including L1 data cache and L2 cache). We evaluate this design using a simulation infrastructure based on SimpleScalar and Wattch. In an attempt to quantify potential energy savings independent of any particular on-line control strategy, we use off-line analysis of traces from a single-speed run of each of our benchmark applications to identify profitable reconfiguration points for a subsequent dynamic scaling run. Dynamic runs incorporate a detailed model of inter-domain synchronization delays, with latencies for intra-domain scaling similar to the whole-chip scaling latencies of Intel XScale and Transmeta LongRun technologies. Using applications from the MediaBench, Olden, and SPEC2000 benchmark suites, we obtain an average energy-delay product improvement of 20% with MCD compared to a modest 3% savings from voltage scaling a single clock and voltage system.
Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor
- In Proceedings of the International Symposium on Computer Architecture
, 2003
"... A Multiple Clock Domain (MCD) processor addresses the challenges of clock distribution and power dissipation by dividing a chip into several (coarse-grained) clock domains, allowing frequency and voltage to be reduced in domains that are not currently on the application’s critical path. Given a reco ..."
Abstract
-
Cited by 72 (9 self)
- Add to MetaCart
(Show Context)
A Multiple Clock Domain (MCD) processor addresses the challenges of clock distribution and power dissipation by dividing a chip into several (coarse-grained) clock domains, allowing frequency and voltage to be reduced in domains that are not currently on the application’s critical path. Given a reconfiguration mechanism capable of choosing appropriate times and values for voltage/frequency scaling, an MCD processor has the potential to achieve significant energy savings with low performance degradation. Early work on MCD processors evaluated the potential for energy savings by manually inserting reconfiguration instructions into applications, or by employing an oracle driven by off-line analysis of (identical) prior program runs. Subsequent work developed a hardware-based on-line mechanism that averages 75–85 % of the energy-delay improvement achieved via off-line analysis. In this paper we consider the automatic insertion of reconfiguration instructions into applications, using profiledriven binary rewriting. Profile-based reconfiguration introduces the need for “training runs ” prior to production use of a given application, but avoids the hardware complexity of on-line reconfiguration. It also has the potential to yield significantly greater energy savings. Experimental results (training on small data sets and then running on larger, alternative data sets) indicate that the profile-driven approach is more stable than hardware-based reconfiguration, and yields virtually all of the energy-delay improvement achieved via off-line analysis. 1.
Dynamically managing the communication-parallelism trade-off in future clustered processors
- IN PROCEEDINGS OF INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2003
"... Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, there ..."
Abstract
-
Cited by 57 (15 self)
- Add to MetaCart
(Show Context)
Clustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, thereby allowing more aggressive use of instructionlevel parallelism (ILP), the inter-cluster communication increases as data values get spread across a wider area. As a result of the emergence of this trade-off between communication and parallelism, a subset of the total on-chip clusters is optimal for performance. To match the hardware to the application’s needs, we use a robust algorithm to dynamically tune the clustered architecture. The algorithm, which is based on program metrics gathered at periodic intervals, achieves an 11 % performance improvement on average over the best statically defined architecture. We also show that the use of additional hardware and reconfiguration at basic block boundaries can achieve average improvements of 15%. Our results demonstrate that reconfiguration provides an effective solution to the communication and parallelism trade-off inherent in the communicationbound processors of the future.
Slack: Maximizing Performance Under Technological Constraints
, 2002
"... Many emerging processor microarchitectures seek to manage technological constraints (e.g., wire delay, power, and circuit complexity) by resorting to nonuniform designs that provide resources at multiple quality levels (e.g., fast/slow bypass paths, multi-speed functional units, and grid architectur ..."
Abstract
-
Cited by 55 (3 self)
- Add to MetaCart
Many emerging processor microarchitectures seek to manage technological constraints (e.g., wire delay, power, and circuit complexity) by resorting to nonuniform designs that provide resources at multiple quality levels (e.g., fast/slow bypass paths, multi-speed functional units, and grid architectures). In such designs, the constraint problem becomes a control problem, and the challenge becomes designing a control policy that mitigates the performance penalty of the non-uniformity. Given the increasing importance of non-uniform control policies, we believe it is appropriate to examine them in their own right. To this end, we develop slack for use in creating control policies that match program execution behavior to machine design. Intuitively, the slack of a dynamic instruction i is the number of cycles i can be delayed with no effect on execution time. This property makes slack a natural candidate for hiding non-uniform latencies. We make three contributions in our exploration of slack. First, we formally define slack, distinguish three variants (local, global and apportioned), and perform a limit study to show that slack is prevalent in our SPEC2000 workload. Second, we show how to predict slack in hardware. Third, we illustrate how to create a control policy based on slack for steering instructions among fast (high power) and slow (lower power) pipelines.
Reducing Power with Dynamic Critical Path Information
, 2001
"... Recent research has shown that dynamic information regarding instruction criticality can be used to increase microprocessor performance. Critical path information can also be used in processors to achieve a better balance of power and performance. This paper uses the output of a dynamic critical pat ..."
Abstract
-
Cited by 51 (2 self)
- Add to MetaCart
Recent research has shown that dynamic information regarding instruction criticality can be used to increase microprocessor performance. Critical path information can also be used in processors to achieve a better balance of power and performance. This paper uses the output of a dynamic critical path predictor to decrease the power consumption of key portions of the processor without incurring a corresponding decrease in performance. The optimizations include effective use of functional units with different power and latency characteristics and decreased issue logic power. 1.
Joint Local and Global Hardware Adaptations for Energy
- ACM SIGARCH Computer Architecture News
, 2002
"... This work concerns algorithms to control energy-driven ar-chitecture adaptations for multimedia pplications, with-out and with dynamic voltage scaling (DVS). We identify a broad design space for adaptation control algorithms based on two attributes: (1) when to adapt or temporal granular-ity and (2) ..."
Abstract
-
Cited by 50 (8 self)
- Add to MetaCart
This work concerns algorithms to control energy-driven ar-chitecture adaptations for multimedia pplications, with-out and with dynamic voltage scaling (DVS). We identify a broad design space for adaptation control algorithms based on two attributes: (1) when to adapt or temporal granular-ity and (2) what structures to adapt or spatial granularity. For each attribute, adaptation may beglobal or local. Our previous work developed a temporally and spatially global algorithm. It invokes adaptation at the granularity of a full frame of a multimedia pplication (temporally global) and considers the entire hardware configuration at a time (spa-tially global). It exploits inter-frame xecution time vari-ability, slowing computation just enough to eliminate idle time before the real-time deadline. This paper explores temporally and spatially local algorithms and their integration with the previous global algorithm. The local algorithms invoke architectural daptation within an application frame to exploit intra-frame xecution vari-ability, and attempt o save energy without affecting execu-tion time. We consider local algorithms previously studied for non-real-time applications as well as propose new algo-rithms. We find that, for systems without and with DVS, the local algorithms are effective in saving energy for multi-media applications, but the new integrated global and local algorithm is best for the systems and applications studied. 1.
Hierarchical scheduling windows
- In Proceedings of the International Symposium on Microarchitecture
, 2002
"... Abstract ..."
(Show Context)
Using Interaction Costs for Microarchitectural Bottleneck Analysis
- ABSTRACT APPEARS IN 36TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO ’03)
, 2003
"... Attacking bottlenecks in modern processors is difficult because many microarchitectural events overlap with each other. This parallelism makes it difficult to both (a) assign a cost to an event (e.g., to one of two overlapping cache misses) and (b) assign blame for each cycle (e.g., for a cycle wher ..."
Abstract
-
Cited by 36 (2 self)
- Add to MetaCart
Attacking bottlenecks in modern processors is difficult because many microarchitectural events overlap with each other. This parallelism makes it difficult to both (a) assign a cost to an event (e.g., to one of two overlapping cache misses) and (b) assign blame for each cycle (e.g., for a cycle where many, overlapping resources are active). This paper introduces a new model for understanding event costs to facilitate processor design and optimization. First, we observe that everything in a machine (instructions, hardware structures, events) can interact in only one of two ways (in parallel or serially). We quantify these interactions by defining interaction cost, which can be zero (independent, no interaction), positive (parallel), or negative (serial). Second, we illustrate the value of using interaction costs in processor design and optimization. Finally, we propose performance-monitoring hardware for measuring interaction costs that is suitable for modern processors.
Application-Aware Prioritization Mechanisms for On-Chip Networks
"... Network-on-Chips (NoCs) are likely to become a critical shared resource in future many-core processors. The challenge is to develop policies and mechanisms that enable multiple applications to efficiently and fairly share the network, to improve system performance. Existing local packet scheduling p ..."
Abstract
-
Cited by 31 (14 self)
- Add to MetaCart
(Show Context)
Network-on-Chips (NoCs) are likely to become a critical shared resource in future many-core processors. The challenge is to develop policies and mechanisms that enable multiple applications to efficiently and fairly share the network, to improve system performance. Existing local packet scheduling policies in the routers fail to fully achieve this goal, because they treat every packet equally, regardless of which application issued the packet. This paper proposes prioritization policies and architectural extensions to NoC routers that improve the overall application-level throughput, while ensuring fairness in the network. Our prioritization policies are application-aware, distinguishing applications based on the stall-time criticality of their packets. The idea is to divide processor execution time into phases, rank applications within a phase based on stall-time criticality, and have all routers in the network prioritize packets based on their applications ’ ranks. Our scheme also includes techniques that ensure starvation freedom and enable the enforcement of system-level application priorities. We evaluate the proposed prioritization policies on a 64-core CMP with an 8x8 mesh NoC, using a suite of 35 diverse applications. For a representative set of case studies, our proposed policy increases average system throughput by 25.6 % over age-based arbitration and 18.4 % over round-robin arbitration. Averaged over 96 randomlygenerated multiprogrammed workload mixes, the proposed policy improves system throughput by 9.1 % over the best existing prioritization policy, while also reducing application-level unfairness.