Results 11 - 20
of
56
Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters
- PLDI'03
, 2003
"... Interpreters designed for efficiency execute a huge number of indirect branches and can spend more than half of the execution time in indirect branch mispredictions. Branch target buffers are the best widely available form of indirect branch prediction; however, their prediction accuracy for existin ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
Interpreters designed for efficiency execute a huge number of indirect branches and can spend more than half of the execution time in indirect branch mispredictions. Branch target buffers are the best widely available form of indirect branch prediction; however, their prediction accuracy for existing interpreters is only 2%–50%. In this paper we investigate two methods for improving the prediction accuracy of BTBs for interpreters: replicating virtual machine (VM) instructions and combining sequences of VM instructions into superinstructions. We investigate static (interpreter buildtime) and dynamic (interpreter run-time) variants of these techniques and compare them and several combinations of these techniques. These techniques can eliminate nearly all of the dispatch branch mispredictions, and have other benefits, resulting in speedups by a factor of up to 3.17 over efficient threaded-code interpreters, and speedups by a factor of up to 1.3 over techniques relying on superinstructions alone.
Predicting Data Cache Misses in Non-Numeric Applications Through Correlation Profiling
- In MICRO-30
, 1997
"... To maximize the benefit and minimize the overhead of software-based latency tolerance techniques, we would like to apply them precisely to the set of dynamic references that suffer cache misses. Unfortunately, the information provided by the state-of-theart cache miss profiling technique (summary pr ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
To maximize the benefit and minimize the overhead of software-based latency tolerance techniques, we would like to apply them precisely to the set of dynamic references that suffer cache misses. Unfortunately, the information provided by the state-of-theart cache miss profiling technique (summary profiling) is inadequate for references with intermediate miss ratios---it results in either failing to hide latency, or else inserting unnecessary overhead. To overcome this problem, we propose and evaluate a new technique--- correlation profiling---which improves predictability by correlating the caching behavior with the associated dynamic context. Our experimental results demonstrate that roughly half of the 22 non-numeric applications we study can potentially enjoy significant reductions in memory stall time by exploiting at least one of the three forms of correlation profiling we consider. 1 Introduction As the disparity between processor and memory speeds continues to grow, memory l...
The Predictability of Branches in Libraries
- IN 28TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1995
"... Profile-based optimizations are being used with increasing frequency. Profile ..."
Abstract
-
Cited by 33 (6 self)
- Add to MetaCart
Profile-based optimizations are being used with increasing frequency. Profile
Increasing the size of atomic instruction blocks using control flow assertions
- In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture
, 2000
"... For a variety of reasons, branch-less regions of instructions are desirable for high-performance execution. In this paper, we propose a means for increasing the dynamic length of branch-less regions of instructions for the purposes of dynamic program optimization. We call these atomic regions frames ..."
Abstract
-
Cited by 33 (12 self)
- Add to MetaCart
For a variety of reasons, branch-less regions of instructions are desirable for high-performance execution. In this paper, we propose a means for increasing the dynamic length of branch-less regions of instructions for the purposes of dynamic program optimization. We call these atomic regions frames and we construct them by replacing original branch instructions with assertions. Assertion instructions check if the original branching conditions still hold. If they hold, no action is taken. If they do not, then the entire region is undone. In this manner, an assertion has no explicit control flow. We demonstrate that using branch correlation to decide when a branch should be converted into an assertion results in atomic regions that average over 100 instructions in length, with a probability of completion of 97%, and that constitute over 80 % of the dynamic instruction stream. We demonstrate both static and dynamic means for constructing frames. When frames are built dynamically using finite sized hardware, they average 80 instructions in length and have good caching properties. 1
Path-based Compilation
, 1998
"... Many compilers use profiles of programs to direct the focus and degree of performance optimizations. Profiles are statistics from program runs, usually collected at individual points in the program text, e.g., branches, call sites, or memory accesses. But optimizations based on individual sample poi ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
Many compilers use profiles of programs to direct the focus and degree of performance optimizations. Profiles are statistics from program runs, usually collected at individual points in the program text, e.g., branches, call sites, or memory accesses. But optimizations based on individual sample points in the program miss an important detail of program behavior: how pieces of the program relate to each other dynamically. A path profile collects statistics over paths (sequences of points) in the program, linking the statistics to the dynamic behavior. By instrumenting and collecting path profiles through a program, we can exploit this dynamic behavior, improving performance more than point profiling techniques have allowed. This thesis shows how to collect path profiles efficiently, then applies the path profiles to two optimizations, static correlated branch prediction and path-based superblock scheduling. These two optimizations address different performance aspects of modern machine...
Achieving High Levels of Instruction-Level Parallelism With Reduced Hardware Complexity
, 1997
"... instruction-level parallelism, VLIW processors, superscalar processors, overlapped execution, out-of-order execution, speculative execution, branch prediction, instruction scheduling, compile-time speculation, predicated execution, data speculation, HPL PlayDoh Instruction-level parallel processing ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
instruction-level parallelism, VLIW processors, superscalar processors, overlapped execution, out-of-order execution, speculative execution, branch prediction, instruction scheduling, compile-time speculation, predicated execution, data speculation, HPL PlayDoh Instruction-level parallel processing (ILP) has established itself as the only viable approach for achieving the goal of providing continuously increasing performance without having to fundamentally re-write the application. ILP processors differ in their strategies for deciding exactly when, and on which functional unit, an operation should be executed. The alternatives lie somewhere on a spectrum depending on the extent to which these decisions are made by the compiler rather than by the hardware and on the manner in which information regarding parallelism is communicated by the compiler to the hardware via the program. HPL PlayDoh is a research architecture that has been defined to support research in ILP, with a bias towards VLIW processing. The overall objective of this research effort is to develop a suite of architectural features and compiler techniques that will enable a secondgeneration of VLIW processors to achieve high levels of ILP, across both scientific and non-scientific computations, but with hardware that is simple compared to out-of-order superscalar processors. The basic approach is to provide the program (compiler) more control over capabilities that, in superscalar processors, are typically microarchitectural (i.e., controlled by the hardware) by raising them to the architectural level.
Static correlated branch prediction
- ACM Transactions on Programming Languages and Systems
, 1999
"... Recent work in history-based branch prediction uses novel hardware structures to capture branch correlation and increase branch prediction accuracy. Branch correlation occurs when the outcome of a conditional branch can be accurately predicted by observing the outcomes of previously executed branche ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Recent work in history-based branch prediction uses novel hardware structures to capture branch correlation and increase branch prediction accuracy. Branch correlation occurs when the outcome of a conditional branch can be accurately predicted by observing the outcomes of previously executed branches in the dynamic instruction stream. In this article, we show how to instrument a program so that it is practical to collect run-time statistics that indicate where branch correlation occurs, and we then show how to use these statistics to transform the program so that its static branch prediction accuracy is improved. The run-time information that we gather is called a path profile, and it summarizes how often each executed sequence of program points occurs in the program trace. Our path profiles are more general than those previously proposed. The code transformation that we present is called static correlated branch prediction (SCBP). It exhibits better branch prediction accuracy than previously thought possible for static prediction techniques. Furthermore, through the use of an overpruning heuristic, we show that it is possible to determine automatically an appropriate trade-off between code expansion and branch predictability so that our transformation improves the performance of multiple-issue, deeply pipelined microprocessors
Value Profiling for Instructions and Memory Locations
, 1998
"... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A. Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 II Related Work . . . . . . . . . . . . ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 A. Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 II Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 A. Value Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1. Load Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 B. Compiler Analysis for Dynamic Compilation . . . . . . . . . . . . . . . . . 6 C. Code Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 III Value Profiling Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 A. TNV Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1. Replacement Policy for Top N Value Table . . . . . . . . . . . . . . . . 10 B. Methodology . . . . . . . . . . . . . . . . . . . . ...
rePLay: A Hardware Framework for Dynamic Program Optimization
- IEEE Transactions on Computers
, 1999
"... In this paper, we propose a new framework for enhancing application performance through execution-guided optimization. The rePLay Framework uses information gathered at run-time to optimize an application's instruction stream. Some of these optimizations persist temporarily for only a single executi ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
In this paper, we propose a new framework for enhancing application performance through execution-guided optimization. The rePLay Framework uses information gathered at run-time to optimize an application's instruction stream. Some of these optimizations persist temporarily for only a single execution, others persist between runs. The heart of the rePLay Framework is a trace-cache like device called the frame cache, used to store optimized regions of the original executable. These regions, called frames, are large, single-entry, single-exit regions spanning many basic blocks in the program's dynamic instruction stream. Optimizations are performed on these frames by a flexible optimizer contained within the processor. A rePLay configuration with a 256-entry frame cache, using realistically-sized frame constructor and frame sequencer achieves an average frame size of 88 instructions with 68% coverage of the dynamic istream, an average frame completion rate of 97.81%, and a frame predict...
The Structure and Performance of Efficient Interpreters
- Journal of Instruction-Level Parallelism
, 2003
"... Interpreters designed for high general-purpose performance typically perform a large number of indirect branches (3.2%-13% of all executed instructions in our benchmarks). These branches consume... ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Interpreters designed for high general-purpose performance typically perform a large number of indirect branches (3.2%-13% of all executed instructions in our benchmarks). These branches consume...

