Results 1 - 10
of
40
Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches
- In Proceedings of the 23rd Annual International Symposium on Computer Architecture
, 1996
"... Pipeline stalls due to conditional branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined, superscalar processors. Many branch predictors have been proposed to help alleviate this problem, including the Two-Level Adaptive Branch Predicto ..."
Abstract
-
Cited by 93 (2 self)
- Add to MetaCart
Pipeline stalls due to conditional branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined, superscalar processors. Many branch predictors have been proposed to help alleviate this problem, including the Two-Level Adaptive Branch Predictor, and more recently, twocomponent hybrid branch predictors. In a less idealized environment, such as a time-shared system, code of interest involves context switches. Context switches, even at fairly large intervals, can seriously degrade the performance of many of the most accurate branch prediction schemes. In this paper, we introduce a new hybrid branch predictor and show that it is more accurate (for a given cost) than any previously published scheme, especially if the branch histories are periodically flushed due to the presence of context switches. Keywords: branch prediction, context switch, superscalar, speculative execution 1 Introduction Branch prediction accuracy is a major pe...
Trading Conflict and Capacity Aliasing in Conditional Branch Predictors
- In Proceedings of the 24th International Symposium on Computer Architecture
, 1997
"... As modern microprocessors employ deeper pipelines and issue multiple instructions per cycle, they are becoming increasingly dependent on accurate branch prediction. Because hardware resources for branch-predictor tables are invariably limited, it is not possible to hold all relevant branch history f ..."
Abstract
-
Cited by 81 (7 self)
- Add to MetaCart
As modern microprocessors employ deeper pipelines and issue multiple instructions per cycle, they are becoming increasingly dependent on accurate branch prediction. Because hardware resources for branch-predictor tables are invariably limited, it is not possible to hold all relevant branch history for all active branches at the same time, especially for large workloads consisting of multiple processes and operating-system code. The problem that results, commonly referred to as aliasing in the branch-predictor tables, is in many ways similar to the misses that occur in finite-sized hardware caches. In this paper we propose a new classification for branch aliasing based on the three-Cs model for caches, and show that conflict aliasing is a significant source of mispredictions. Unfortunately, the obvious method for removing conflicts -- adding tags and associativity to the predictor tables -- is not a cost-effective solution. To address this problem, we propose the skewed branch predict...
Branch prediction, instruction-window size, and cache size: Performance tradeoffs and simulation techniques
- IEEE Transactions on Computers
, 1999
"... Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Tradeoffs among instruction-window size, branch-prediction accuracy, and instruction- and datacache size can change as these parameters move ..."
Abstract
-
Cited by 57 (18 self)
- Add to MetaCart
Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Tradeoffs among instruction-window size, branch-prediction accuracy, and instruction- and datacache size can change as these parameters move through different domains. For example, modeling unrealistic caches can under- or over-state the benefits of better prediction or a larger instruction window. Avoiding such pitfalls requires understanding how all these parameters interact. Because such methodological mistakes are common, this paper provides a comprehensive set of SimpleScalar simulation results from SPECint95 programs, showing the interactions among these major structures. In addition to presenting this database of simulation results, major mechanisms driving the observed tradeoffs are described. The paper also considers appropriate simulation techniques when sampling full-length runs with the SPEC reference inputs. In particular, the results show that branch mispredictions limit the benefits of larger instruction windows, that better branch prediction and better instruction cache behavior have synergistic effects, and that the benefits of larger instruction windows and larger data caches trade off and have overlapping effects. In addition, simulations of only 50 million instructions can yield representative results if these short windows are carefully selected.
Power Issues Related to Branch Prediction
, 2001
"... This paper explores the role of branch predictor organization in power/energy/performance tradeoffs for processor design. We find that as a general rule, to reduce overall energy consumption in the processor it is worthwhile to spend more power in the branch predictor if this results in more accurat ..."
Abstract
-
Cited by 54 (12 self)
- Add to MetaCart
This paper explores the role of branch predictor organization in power/energy/performance tradeoffs for processor design. We find that as a general rule, to reduce overall energy consumption in the processor it is worthwhile to spend more power in the branch predictor if this results in more accurate predictions that improve running time. Two techniques, however, provide substantial reductions in power dissipation without harming accuracy. Banking reduces the portion of the branch predictor that is active at any one time. And a new on-chip structure, the prediction probe detector (PPD), can use pre-decode bits to entirely eliminate unnecessary predictor and BTB accesses. Despite the extra power that must be spent accessing the PPD, it reduces local predictor power and energy dissipation by about 45% and overall processor power and energy dissipation by 5--6%. 1.
Improving Prediction for Procedure Returns with Return-Address-Stack Repair Mechanisms
- IN PROCEEDINGS OF THE 31ST ANNUAL ACM/IEEE INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1998
"... This paper evaluates several mechanisms for repairing the return-address stack after branch mispredictions. The return-address stack is a small but important structure for achieving better control-flow prediction accuracy and therefore better performance. But wrong-path execution after misprediction ..."
Abstract
-
Cited by 50 (11 self)
- Add to MetaCart
This paper evaluates several mechanisms for repairing the return-address stack after branch mispredictions. The return-address stack is a small but important structure for achieving better control-flow prediction accuracy and therefore better performance. But wrong-path execution after mispredictions frequently corrupts the return-address stack, making repair mechanisms necessary. If the processor implements multipath execution---simultaneously executing both sides of a branch---the contention among different paths makes the problem more severe. For conventional, single-path processors, this paper proposes saving both the top-of-stack pointer and the top-ofstack contents for later restoration in case of a misprediction. This simple technique achieves nearly 100% hit rates and improves performance by up to 8.7% compared to a stack with no repair mechanism. For multipath processors, providing each path with its own return-address stack completely eliminates contention, improving perform...
Dataflow Analysis of Branch Mispredictions and Its Application to Early Resolution of Branch Outcomes
, 1998
"... The goal of this study is twofold: to analyze in detail the nature of conditional branch mispredictions in correlationbased branch predictors, and, based on this analysis, to reduce the impact of branch mispredictions on processor performance by decreasing the branch resolution delay instead of impr ..."
Abstract
-
Cited by 42 (1 self)
- Add to MetaCart
The goal of this study is twofold: to analyze in detail the nature of conditional branch mispredictions in correlationbased branch predictors, and, based on this analysis, to reduce the impact of branch mispredictions on processor performance by decreasing the branch resolution delay instead of improving the branch prediction accuracy. We classify conditional branches with the highest number of mispredictions according to the nature of their branch condition analytical expression. Based on these expressions, we can analyze and even precisely explain the origin of mispredictions in many cases. Moreover, we find that many such branches belong to small sets of blocks inside loops, and within such sets we find that some of the branch expressions have regularity properties. We show how to exploit this regularity property by anticipating the branch outcome, where anticipation is a combination of value prediction and normal dataflow execution. We investigate a hardware mechanism to implement...
Speculative Updates of Local and Global Branch History: A Quantitative Analysis
- Journal of Instruction-Level Parallelism
, 1998
"... In today's wide-issue processors, even small branch-misprediction rates introduce substantial performance penalties. Worse yet, inadequate branch prediction creates a bottleneck at the fetch stage, restricting other opportunities for improving performance. The choice of how to predict conditional- ..."
Abstract
-
Cited by 34 (8 self)
- Add to MetaCart
In today's wide-issue processors, even small branch-misprediction rates introduce substantial performance penalties. Worse yet, inadequate branch prediction creates a bottleneck at the fetch stage, restricting other opportunities for improving performance. The choice of how to predict conditional-branch outcomes is the primary lever on prediction accuracy. But the choice of when to update the predictor with branch outcomes is a second powerful lever, and the subject of this paper. In history-based predictors like gshare, many mispredictions result from commit-time update of the history: typical pipelined processors predict branches in the fetch stage, but update the predictor in the commit stage, making the predictor's state temporarily outof -date. As pipelines grow longer---in particular, when branches can spend many cycles in the instruction window waiting to issue---this problem becomes worse. Prior work on this subject has discussed the need for speculative update in a gl...
Improving Branch Predictors by Correlating on Data Values
, 1999
"... Branch predictors typically use combinations of branch PC bits and branch histories to make predictions. Recent improvements in branch predictors have come from reducing the effect of interference, i.e. multiple branches mapping to the same table entries. In contrast, the branch difference predictor ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
Branch predictors typically use combinations of branch PC bits and branch histories to make predictions. Recent improvements in branch predictors have come from reducing the effect of interference, i.e. multiple branches mapping to the same table entries. In contrast, the branch difference predictor (BDP) uses data values as additional information to improve the accuracy of conditional branch predictors. The BDP maintains a history of differences between branch source register operands, and feeds these into the prediction process. An important component of the BDP is a rare event predictor (REP) which reduces learning time and table interference. An REP is a cache-like structure designed to store patterns whose predictions differ from the norm. Initially, ideal interference-free predictors are evaluated to determine how data values improve correlation. Next, execution driven simulations of complete designs realize this potential. The BDP reduces the misprediction rate of five SPEC95 ...
Control-Flow Speculation through Value Prediction for Superscalar Processors
, 1998
"... In this paper, we introduce a new branch predictor that predicts the outcomes of branches by predicting the value of their inputs and performing an early computation of their results according to the predicted values. The design of a hybrid predictor comprising our branch predictor and a correlating ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
In this paper, we introduce a new branch predictor that predicts the outcomes of branches by predicting the value of their inputs and performing an early computation of their results according to the predicted values. The design of a hybrid predictor comprising our branch predictor and a correlating branch predictor is presented. We also propose a new selector that chooses the most reliable prediction for each branch. This selector is based on the path followed to reach the branch. Results for immediate updates show that for a processor that already has a value prediction unit, our hybrid predictor, with a size of 4KB, achieves the same miss ratio as a conventional hybrid predictor of 64KB. The reduction in misprediction penalty is about 40% for all predictor sizes. Furthermore, if the cost of the value predictor is considered as an additional cost of the hybrid predictor, our proposal still reduces the miss ratio with respect to a conventional hybrid predictor for all different predic...
The Relative Importance of Memory Latency, Bandwidth, and Branch Limits to Performance
- In The Workshop on Mixing Logic and DRAM: Chips that Compute and Remember
, 1997
"... This study investigates the relative importance of memory latency, memory bandwidth, and branch predictability in determining limits to processor performance. We use an aggressive simulation model with few other limits to study the performance of SPEC92 benchmarks. Our basic machine model assumes a ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
This study investigates the relative importance of memory latency, memory bandwidth, and branch predictability in determining limits to processor performance. We use an aggressive simulation model with few other limits to study the performance of SPEC92 benchmarks. Our basic machine model assumes a dynamically scheduled processor with a 16536 entry instruction window. Up to 16536 instructions of any type can be issued each cycle, subject to data dependencies. In systems with unlimited memory bandwidth and perfect branch predictability, we find that memory latency is not a significant limit to performance until it exceeds 100 to 200 cycles. Memory bandwidth is not usually a significant limit either. In systems with memory latency of 16 cycles and perfect branch predictability, many applications require less than 6 bytes per cycle, while all but one perform well if 100 bytes per cycle are available. Based on current trends in the semiconductor industry and current research in packaging ...

