Results 1 - 10
of
278
Limits of instruction-level parallelism
, 1991
"... research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There two other research laboratories located in Palo Al ..."
Abstract
-
Cited by 403 (7 self)
- Add to MetaCart
research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There two other research laboratories located in Palo Alto, the Network Systems
Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power
- in Proceedings of the 28th International Symposium on Computer Architecture
, 2001
"... Power dissipation is increasingly important in CPUs ranging from those intended for mobile use, all the way up to highperformance processors for high-end servers. While the bulk of the power dissipated is dynamic switching power, leakage power is also beginning to be a concern. Chipmakers expect tha ..."
Abstract
-
Cited by 280 (26 self)
- Add to MetaCart
(Show Context)
Power dissipation is increasingly important in CPUs ranging from those intended for mobile use, all the way up to highperformance processors for high-end servers. While the bulk of the power dissipated is dynamic switching power, leakage power is also beginning to be a concern. Chipmakers expect that in future chip generations, leakage’s proportion of total chip power will increase significantly. This paper examines methods for reducing leakage power within the cache memories of the CPU. Because caches comprise much of a CPU chip’s area and transistor counts, they are reasonable targets for attacking leakage. We discuss policies and implementations for reducing cache leakage by invalidating and “turning off ” cache lines when they hold data not likely to be reused. In particular, our approach is targeted at the generational nature of cache line usage. That is, cache lines typically have a flurry of frequent use when first brought into the cache, and then have a period of “dead time ” before they are evicted. By devising effective, low-power ways of deducing dead time, our results show that in many cases we can reduce L1 cache leakage energy by 4x in SPEC2000 applications without impacting performance. Because our decay-based techniques have notions of competitive on-line algorithms at their roots, their energy usage can be theoretically bounded at within a factor of two of the optimal oraclebased policy. We also examine adaptive decay-based policies that make energy-minimizing policy choices on a per-application basis by choosing appropriate decay intervals individually for each cache line. Our proposed adaptive policies effectively reduce L1 cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance. 1
Selective value prediction
- In 26th Annual International Symposium on Computer Architecture
, 1999
"... Value Prediction is a relatively new technique to increase instruction-level parallelism by breaking true data dependence chains. A value prediction architecture produces values, which may be later consumed by instructions that execute speculatively using the predicted value. This paper examines sel ..."
Abstract
-
Cited by 138 (13 self)
- Add to MetaCart
(Show Context)
Value Prediction is a relatively new technique to increase instruction-level parallelism by breaking true data dependence chains. A value prediction architecture produces values, which may be later consumed by instructions that execute speculatively using the predicted value. This paper examines selective techniques for using value prediction in the presence of predictor capacity constraints and reasonable misprediction penalties. We examine prediction and confidence mechanisms in light of these constraints, and we minimize capacity conflicts through instruction filtering. The latter technique filters which instructions put values into the value prediction table. We examine filtering techniques based on instruction type, as well as giving priority to instructions belonging to the longest data dependence path in the processor’s active instruction window. We apply filtering both to the producers of predicted values and the consumers. In addition, we examine the benefit of using different confidence levels for instructions using predicted values on the longest dependence path. 1
A Modified Approach to Data Cache Management
- In Proceedings of the 28th Annual International Symposium on Microarchitecture
, 1995
"... As processor performance continues to improve, more emphasis must be placed on the performance of the memory system. In this paper, a detailed characterization of data cache behavior for individual load instructions is given. We show that by selectively applying cache line allocation according the c ..."
Abstract
-
Cited by 127 (2 self)
- Add to MetaCart
As processor performance continues to improve, more emphasis must be placed on the performance of the memory system. In this paper, a detailed characterization of data cache behavior for individual load instructions is given. We show that by selectively applying cache line allocation according the characteristics of individual load instructions, overall performance can be improved for both the data cache and the memory system. This approach can improve some aspects of memory performance by as much as 60 percent on existing executables. 1. Introduction The average data access time is a measure of the time it takes to read a data item from memory. Since most programs need to access data, minimizing this term is crucial to achieving high performance. Unfortunately, access time to off-chip memory (measured in processor clock cycles) has increased dramatically as the disparity between main memory access times and processor clock speeds widen. Since there is no indication that dynamic memor...
Reducing indirect function call overhead in c++ programs
- In POPL ’94: Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages
, 1994
"... Modern computer architectures increasingly depend on mechanisms that estimate fhture control flow decisions to increase performance. Mechanisms such as speculative execution and prefetching are becoming standard architectural mechanisms that rely on control flow prediction to prefetch and speculativ ..."
Abstract
-
Cited by 120 (5 self)
- Add to MetaCart
Modern computer architectures increasingly depend on mechanisms that estimate fhture control flow decisions to increase performance. Mechanisms such as speculative execution and prefetching are becoming standard architectural mechanisms that rely on control flow prediction to prefetch and speculatively execute future instructions. At the same time, computer programmers are increasingly turning to object-oriented languages to increase their productivity. These languages commonly use run time dispatching to implement object polymorphism. Dispatching is usually implemented using an indirect finction call, which presents challenges to existing control flow prediction techniques. We have measured the occurrence of indirect function calls in a collection of C++ programs. We show that, although it is more important to predict branches accurately, indirect call prediction is also an important factor in some programs and will grow in importance with the growth of object-oriented programming. We examine the improvement offered by compile-time optimization and static and dynamic prediction techniques, and demonstrate how compilers can use existing branch prediction mechanisms to improve performance in C++ programs. Using these methods with the programs we examined, the number of instructions between mispredicted breaks in control can be doubled on existing computers.
The YAGS branch prediction scheme
- In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture
, 1998
"... The importance of an accurate branch prediction mechanism has been well documented. Since the introduction of gshare [1] and the observation that aliasing in the PHT is a major factor in reducing prediction accuracy [2,3,4,5], several schemes have been proposed to reduce aliasing in the PHT [6, 7, 8 ..."
Abstract
-
Cited by 113 (0 self)
- Add to MetaCart
(Show Context)
The importance of an accurate branch prediction mechanism has been well documented. Since the introduction of gshare [1] and the observation that aliasing in the PHT is a major factor in reducing prediction accuracy [2,3,4,5], several schemes have been proposed to reduce aliasing in the PHT [6, 7, 8, 9]. All these schemes are aimed at maximizing the prediction accuracy with the fewest resources. In this paper we introduce Yet Another Global Scheme (YAGS) — a new scheme to reduce the aliasing in the PHT — that combines the strong points of several previous schemes. YAGS introduces tags into the PHT that allows it to be reduced without sacrificing key branch outcome information. The size reduction more than offsets the cost of the tags. Our experimental results show that YAGS gives better
Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy
- in the Presence of Context Switches”, in Proceedings ISCA
, 1996
"... Pipeline stalls due to conditional branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined, supers calar processors. Many branch predictors have been proposed to help aileviate this problem, including the Two-Level Adaptive Branch Predict ..."
Abstract
-
Cited by 109 (2 self)
- Add to MetaCart
(Show Context)
Pipeline stalls due to conditional branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined, supers calar processors. Many branch predictors have been proposed to help aileviate this problem, including the Two-Level Adaptive Branch Predictor, and more recently, two-component hybrid branch predictors. In a less idealized environment, such as a time-shared system, code of interest involves context switches. Con-text switches, even at fairly large intervals, can seriously degrade the performance of many of the most accurate branch prediction schemes. In this paper, we introduce a new hybrid branch predictor and show that it is more accurate (for a given cost) than any previously published scheme, especially if the branch histories are periodically jlushed due to the presence of contezt switches.
Increasing the Instruction Fetch Rate via Multiple Branch Prediction and a Branch Address Cache
, 1993
"... High performance computer implementation today is increasingly directed toward parallelism in the hardware. Superscalar machines, where the hardware can issue more than one instruction each cycle, are being adopted by more implementations. As the trend toward wider issue rates continues, so too must ..."
Abstract
-
Cited by 109 (5 self)
- Add to MetaCart
(Show Context)
High performance computer implementation today is increasingly directed toward parallelism in the hardware. Superscalar machines, where the hardware can issue more than one instruction each cycle, are being adopted by more implementations. As the trend toward wider issue rates continues, so too must the ability to fetch more instructions each cycle. Although compilers can improve the situation by increasing the size of basic blocks, hardware mechanisms to fetch multiple possibly non-consecutive basic blocks are also needed. Viable mechanisms for fetching multiple non-consecutive basic blocks have not been previously investigated. We present a mechanism for predicting multiple branches and fetching multiple non-consecutive basic blocks each cycle which is both viable and effective. We measured the effectiveness of the mechanism in terms of the IPC f, the number of instructions fetched per clock for a machine front-end. For one, two, and three basic blocks, the IPC f of integer benchmark...
Trading Conflict and Capacity Aliasing in Conditional Branch Predictors
- In Proceedings of the 24th International Symposium on Computer Architecture
, 1997
"... As modern microprocessors employ deeper pipelines and issue multiple instructions per cycle, they are becoming increasingly dependent on accurate branch prediction. Because hardware resources for branch-predictor tables are invariably limited, it is not possible to hold all relevant branch history f ..."
Abstract
-
Cited by 95 (8 self)
- Add to MetaCart
(Show Context)
As modern microprocessors employ deeper pipelines and issue multiple instructions per cycle, they are becoming increasingly dependent on accurate branch prediction. Because hardware resources for branch-predictor tables are invariably limited, it is not possible to hold all relevant branch history for all active branches at the same time, especially for large workloads consisting of multiple processes and operating-system code. The problem that results, commonly referred to as aliasing in the branch-predictor tables, is in many ways similar to the misses that occur in finite-sized hardware caches. In this paper we propose a new classification for branch aliasing based on the three-Cs model for caches, and show that conflict aliasing is a significant source of mispredictions. Unfortunately, the obvious method for removing conflicts -- adding tags and associativity to the predictor tables -- is not a cost-effective solution. To address this problem, we propose the skewed branch predict...
Target prediction for indirect jumps
- In Proc. ZSCA-24
, 1997
"... As the issue rate and pipeline depth of high perfor-mance superscalar processors increase, the amount of spec-ulative work issued also increases. Because speculative work must be thrown away in the event of a branch mispredic-tion, wide-issue, deeply pipelined processors must employ accurate branch ..."
Abstract
-
Cited by 93 (3 self)
- Add to MetaCart
(Show Context)
As the issue rate and pipeline depth of high perfor-mance superscalar processors increase, the amount of spec-ulative work issued also increases. Because speculative work must be thrown away in the event of a branch mispredic-tion, wide-issue, deeply pipelined processors must employ accurate branch predictors to effectively exploit their perfor-mance potential. Many existing branch prediction schemes are capable of accurately predicting the direction of condi-tional branches. However, these schemes are ineffective in predicting the targets of indirect jumps achieving, on aver-age, a prediction accuracy rate of 51.8 % for the SPECint95 benchmarks. In this paper, we propose a new prediction mechanism, the target cache, for predicting indirect jump targets. For the perl and gcc benchmarks, this mechanism reduces the indirect jump misprediction rate by 93.4 % and 63.3 % and the overall execution time by 14 % and 5%. 1