Results 11 - 20
of
60
Branch prediction, instruction-window size, and cache size: Performance tradeoffs and simulation techniques
- IEEE Transactions on Computers
, 1999
"... Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Tradeoffs among instruction-window size, branch-prediction accuracy, and instruction- and datacache size can change as these parameters move ..."
Abstract
-
Cited by 57 (18 self)
- Add to MetaCart
Design parameters interact in complex ways in modern processors, especially because out-of-order issue and decoupling buffers allow latencies to be overlapped. Tradeoffs among instruction-window size, branch-prediction accuracy, and instruction- and datacache size can change as these parameters move through different domains. For example, modeling unrealistic caches can under- or over-state the benefits of better prediction or a larger instruction window. Avoiding such pitfalls requires understanding how all these parameters interact. Because such methodological mistakes are common, this paper provides a comprehensive set of SimpleScalar simulation results from SPECint95 programs, showing the interactions among these major structures. In addition to presenting this database of simulation results, major mechanisms driving the observed tradeoffs are described. The paper also considers appropriate simulation techniques when sampling full-length runs with the SPEC reference inputs. In particular, the results show that branch mispredictions limit the benefits of larger instruction windows, that better branch prediction and better instruction cache behavior have synergistic effects, and that the benefits of larger instruction windows and larger data caches trade off and have overlapping effects. In addition, simulations of only 50 million instructions can yield representative results if these short windows are carefully selected.
Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing
, 1998
"... The increasing widths of superscalar processors are placing greater demands upon the fetch mechanism. The trace cache meets these demands by placing logically contiguous instructions in physically contiguous storage. As a result, the trace cache delivers instructions at a high rate by supplying mult ..."
Abstract
-
Cited by 55 (6 self)
- Add to MetaCart
The increasing widths of superscalar processors are placing greater demands upon the fetch mechanism. The trace cache meets these demands by placing logically contiguous instructions in physically contiguous storage. As a result, the trace cache delivers instructions at a high rate by supplying multiple fetch blocks each cycle. In this paper, we examine two techniques to improve the number of instructions delivered each cycle by the trace cache. The first technique, branch promotion, dynamically converts strongly biased branches into branches with static predictions. Because these promoted branches require no dynamic prediction, the branch predictor suffers less from the negative effects of interference. Branch promotion unlocks the potential of the second technique: trace packing. With trace packing, trace segments are packed with as many instructions as will fit, without regard to naturally occurring fetch block boundaries. With both techniques, the effective fetch rate of the trace ...
Execution Characteristics of Desktop Applications on Windows NT
, 1998
"... This paper examines the performance of desktop applications running on the Microsoft Windows NT operating system on Intel x86 processors, and contrasts these applications to the programs in the integer SPEC95 benchmark suite. We present measurements of basic instruction set and program characteristi ..."
Abstract
-
Cited by 53 (2 self)
- Add to MetaCart
This paper examines the performance of desktop applications running on the Microsoft Windows NT operating system on Intel x86 processors, and contrasts these applications to the programs in the integer SPEC95 benchmark suite. We present measurements of basic instruction set and program characteristics, and detailed simulation results of the way these programs use the memory system and processor branch architecture. We show that the desktop applications have similar characteristics to the integer SPEC95 benchmarks for many of these metrics. However, compared to the integer SPEC95 applications, desktop applications have larger instruction working sets, execute instructions in a greater number of unique functions, cross DLL boundaries frequently, and execute a greater number of indirect calls.
An analysis of correlation and predictability: What makes two-level branch predictors work
- In Proceedings of the 25th Annual International Symposium on Computer Architecture
, 1998
"... Pipeline flushes due to branch mispredictions is one of the most serious problems facing the designer of a deeply pipelined, superscalar processor. Many branch predictors have been proposed to help alleviate this problem, including two-level adaptive branch predictors and hybrid branch predictors. N ..."
Abstract
-
Cited by 52 (4 self)
- Add to MetaCart
Pipeline flushes due to branch mispredictions is one of the most serious problems facing the designer of a deeply pipelined, superscalar processor. Many branch predictors have been proposed to help alleviate this problem, including two-level adaptive branch predictors and hybrid branch predictors. Numerous studies have shown which predictors and configurations best predict the branches in a given set of benchmarks. Some studies have also investigated effects, such as pattern history table interference, that can be detrimental to the performance of these predictors. However, little research has been done on which characteristics of branch behavior make predictors perform well. In this paper, we investigate and quantify reasons why branches are predictable. We show that some of this predictability is not captured by the two-level adaptive branch predictors. An understanding of the predictability of branches may lead to insights ultimately resulting in better or less complex predictors. We also investigate and quantify what fraction of the branches in each benchmark is predictable using each of the methods described in this paper. 1.
Value Locality And Speculative Execution
, 1997
"... This thesis introduces a program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detectio ..."
Abstract
-
Cited by 51 (1 self)
- Add to MetaCart
This thesis introduces a program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detection and enforcement of control- and data-flow dependences between instructions to expose more instruction-level parallelism without violating program correctness. Value locality is a program attribute that describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Most modern processors already exploit value locality through the use of control speculation (i.e. branch prediction), which seeks to predict the future values of condition code bits and branch-target addresses based on previously-seen values. Experimental results indicate that value locality exists for condition codes and branch target addresses, and for general-purpose ...
Improving Prediction for Procedure Returns with Return-Address-Stack Repair Mechanisms
- IN PROCEEDINGS OF THE 31ST ANNUAL ACM/IEEE INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1998
"... This paper evaluates several mechanisms for repairing the return-address stack after branch mispredictions. The return-address stack is a small but important structure for achieving better control-flow prediction accuracy and therefore better performance. But wrong-path execution after misprediction ..."
Abstract
-
Cited by 50 (11 self)
- Add to MetaCart
This paper evaluates several mechanisms for repairing the return-address stack after branch mispredictions. The return-address stack is a small but important structure for achieving better control-flow prediction accuracy and therefore better performance. But wrong-path execution after mispredictions frequently corrupts the return-address stack, making repair mechanisms necessary. If the processor implements multipath execution---simultaneously executing both sides of a branch---the contention among different paths makes the problem more severe. For conventional, single-path processors, this paper proposes saving both the top-of-stack pointer and the top-ofstack contents for later restoration in case of a misprediction. This simple technique achieves nearly 100% hit rates and improves performance by up to 8.7% compared to a stack with no repair mechanism. For multipath processors, providing each path with its own return-address stack completely eliminates contention, improving perform...
A Language for Describing Predictors and its Application to Automatic Synthesis
- In Proceedings of the 24th International Symposium on Computer Architecture
, 1997
"... As processor architectures have increased their reliance on speculative execution to improve performance, the importance of accurate prediction of what to execute speculatively has increased. Furthermore, the types of values predicted have expanded from the ubiquitous branch and call/return targets ..."
Abstract
-
Cited by 44 (0 self)
- Add to MetaCart
As processor architectures have increased their reliance on speculative execution to improve performance, the importance of accurate prediction of what to execute speculatively has increased. Furthermore, the types of values predicted have expanded from the ubiquitous branch and call/return targets to the prediction of indirect jump targets, cache ways and data values. In general, the prediction process is one of identifying the current state of the system, and making a prediction for some as yet uncomputed value based on that state. Prediction accuracy is improved by learning what is a good prediction for that state using a feedback process at the time the predicted value is actually computed. While there have been a number of efforts to formally characterize this process, we have taken the approach of providing a simple algebraic-style notation that allows one to express this state identification and feedback process. This notation allows one to describe a wide variety of predictors ...
Variable Length Path Branch Prediction
- In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... ing with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org. Variable Length Path Branch Predictio ..."
Abstract
-
Cited by 41 (2 self)
- Add to MetaCart
ing with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org. Variable Length Path Branch Prediction Jared Stark Marius Evers Yale N. Patt Department of Electrical Engineering and Computer Science The University of Michigan Ann Arbor, Michigan 48109-2122 fstarkj,olaf,pattg@eecs.umich.edu Abstract Accurate branch prediction is required to achieve high performance in deeply pipelined, wide-issue processors. Recent studies have shown that conditional and indirect (or computed) branch targets can be accurately predicted by recording the path, which consists of the target addresses of recent branches, leading up to the branch. In current path based branch predictors, the N most recent target addresses are hashed together to form an index into a table, where N is some fixed integer. The inde...
Examination of a Memory Access Classification Scheme for Pointer-Intensive and Numeric Programs
, 1996
"... In recent work, we described a data prefetch mechanism for pointer-intensive and numeric computations, and presented some aggregate measurements on a suite of benchmarks to quantify its performance potential [MH95]. The basis for this device is a simple classification of memory access patterns in pr ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
In recent work, we described a data prefetch mechanism for pointer-intensive and numeric computations, and presented some aggregate measurements on a suite of benchmarks to quantify its performance potential [MH95]. The basis for this device is a simple classification of memory access patterns in programs that we introduced earlier [HM94]. In this paper we take a close look at two codes from our suite, an English parser called Link-Gram, and the circuit simulation program spice2g6, and present a detailed analysis of them in the context of our model. Focusing on just two programs allows us to display a wider range of data, and discuss relevant code fragments extracted from their source distributions. Results from this study provide a deeper understanding of our memory access classification scheme, and suggest additional optimizations for future data prefetch mechanisms. Keywords: CPU architecture, data cache, memory access pattern classification, instruction profiling, memory latency t...
Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters
- PLDI'03
, 2003
"... Interpreters designed for efficiency execute a huge number of indirect branches and can spend more than half of the execution time in indirect branch mispredictions. Branch target buffers are the best widely available form of indirect branch prediction; however, their prediction accuracy for existin ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
Interpreters designed for efficiency execute a huge number of indirect branches and can spend more than half of the execution time in indirect branch mispredictions. Branch target buffers are the best widely available form of indirect branch prediction; however, their prediction accuracy for existing interpreters is only 2%–50%. In this paper we investigate two methods for improving the prediction accuracy of BTBs for interpreters: replicating virtual machine (VM) instructions and combining sequences of VM instructions into superinstructions. We investigate static (interpreter buildtime) and dynamic (interpreter run-time) variants of these techniques and compare them and several combinations of these techniques. These techniques can eliminate nearly all of the dispatch branch mispredictions, and have other benefits, resulting in speedups by a factor of up to 3.17 over efficient threaded-code interpreters, and speedups by a factor of up to 1.3 over techniques relying on superinstructions alone.

