Results 1 - 10
of
102
Dynamo: A Transparent Dynamic Optimization System
- ACM SIGPLAN Notices
, 2000
"... We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT ..."
Abstract
-
Cited by 347 (1 self)
- Add to MetaCart
We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of --O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their --O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.
A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History
- in Proceedings of the 20th Annual International Symposium on Computer Architecture
, 1993
"... Recent attention to speculative execution as a mechanism for increasing performance of single instruction streams has demanded substantially better branch prediction than what has been previously available. We [1, 2] and Pan, So, and Rahmeh [4] have both proposed variations of the same aggressive dy ..."
Abstract
-
Cited by 234 (8 self)
- Add to MetaCart
Recent attention to speculative execution as a mechanism for increasing performance of single instruction streams has demanded substantially better branch prediction than what has been previously available. We [1, 2] and Pan, So, and Rahmeh [4] have both proposed variations of the same aggressive dynamic branch predictor for handling those needs. We call the basic model Two-Level Adaptive Branch Prediction; Pan, So, and Rahmeh call it Correlation Branch Prediction. In this paper, we adopt the terminology of [2] and show that there are really nine variations of the same basic model. We compare the nine variations with respect to the amount of history information kept. We study the effects of different branch history lengths and pattern history table configurations. Finally, we evaluate the cost effectiveness of the nine variations. 1 Introduction With the current movement toward deeper pipelines and wider issue rates, extremely high branch prediction accuracy becomes critical because a...
Branch Prediction For Free
, 1993
"... Many compilers rely on branch prediction to improve program performance by identifying frequently executed regions and by aiding in scheduling instructions. Profile-based predictors require a time-consuming and inconvenient compile-profile-compile cycle in order to make predictions. We present a pro ..."
Abstract
-
Cited by 144 (8 self)
- Add to MetaCart
Many compilers rely on branch prediction to improve program performance by identifying frequently executed regions and by aiding in scheduling instructions. Profile-based predictors require a time-consuming and inconvenient compile-profile-compile cycle in order to make predictions. We present a program-based branch predictor that performs well for a large and diverse set of programs written in C and Fortran. In addition to using natural loop analysis to predict branches that control the iteration of loops, we focus on heuristics for predicting non-loop branches, which dominate the dynamic branch count of many programs. The heuristics are simple and require little program analysis, yet they are effective in terms of coverage and miss rate. Although program-based prediction does not equal the accuracy of profile-based prediction, we believe it reaches a sufficiently high level to be useful. Additional type and semantic information available to a compiler would enhance our heuristics. #...
Reducing indirect function call overhead in c++ programs
- In POPL ’94: Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages
, 1994
"... Modern computer architectures increasingly depend on mechanisms that estimate fhture control flow decisions to increase performance. Mechanisms such as speculative execution and prefetching are becoming standard architectural mechanisms that rely on control flow prediction to prefetch and speculativ ..."
Abstract
-
Cited by 112 (5 self)
- Add to MetaCart
Modern computer architectures increasingly depend on mechanisms that estimate fhture control flow decisions to increase performance. Mechanisms such as speculative execution and prefetching are becoming standard architectural mechanisms that rely on control flow prediction to prefetch and speculatively execute future instructions. At the same time, computer programmers are increasingly turning to object-oriented languages to increase their productivity. These languages commonly use run time dispatching to implement object polymorphism. Dispatching is usually implemented using an indirect finction call, which presents challenges to existing control flow prediction techniques. We have measured the occurrence of indirect function calls in a collection of C++ programs. We show that, although it is more important to predict branches accurately, indirect call prediction is also an important factor in some programs and will grow in importance with the growth of object-oriented programming. We examine the improvement offered by compile-time optimization and static and dynamic prediction techniques, and demonstrate how compilers can use existing branch prediction mechanisms to improve performance in C++ programs. Using these methods with the programs we examined, the number of instructions between mispredicted breaks in control can be doubled on existing computers.
A Modified Approach to Data Cache Management
- In Proceedings of the 28th Annual International Symposium on Microarchitecture
, 1995
"... As processor performance continues to improve, more emphasis must be placed on the performance of the memory system. In this paper, a detailed characterization of data cache behavior for individual load instructions is given. We show that by selectively applying cache line allocation according the c ..."
Abstract
-
Cited by 96 (2 self)
- Add to MetaCart
As processor performance continues to improve, more emphasis must be placed on the performance of the memory system. In this paper, a detailed characterization of data cache behavior for individual load instructions is given. We show that by selectively applying cache line allocation according the characteristics of individual load instructions, overall performance can be improved for both the data cache and the memory system. This approach can improve some aspects of memory performance by as much as 60 percent on existing executables. 1. Introduction The average data access time is a measure of the time it takes to read a data item from memory. Since most programs need to access data, minimizing this term is crucial to achieving high performance. Unfortunately, access time to off-chip memory (measured in processor clock cycles) has increased dramatically as the disparity between main memory access times and processor clock speeds widen. Since there is no indication that dynamic memor...
Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches
- In Proceedings of the 23rd Annual International Symposium on Computer Architecture
, 1996
"... Pipeline stalls due to conditional branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined, superscalar processors. Many branch predictors have been proposed to help alleviate this problem, including the Two-Level Adaptive Branch Predicto ..."
Abstract
-
Cited by 93 (2 self)
- Add to MetaCart
Pipeline stalls due to conditional branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined, superscalar processors. Many branch predictors have been proposed to help alleviate this problem, including the Two-Level Adaptive Branch Predictor, and more recently, twocomponent hybrid branch predictors. In a less idealized environment, such as a time-shared system, code of interest involves context switches. Context switches, even at fairly large intervals, can seriously degrade the performance of many of the most accurate branch prediction schemes. In this paper, we introduce a new hybrid branch predictor and show that it is more accurate (for a given cost) than any previously published scheme, especially if the branch histories are periodically flushed due to the presence of context switches. Keywords: branch prediction, context switch, superscalar, speculative execution 1 Introduction Branch prediction accuracy is a major pe...
A Comparative Analysis of Schemes for Correlated Branch Prediction
, 1995
"... Modern high-performance architectures require extremely accurate branch prediction to overcome the performance limitations of conditional branches. We present a framework that categorizes branch prediction schemes by the way in which they partition dynamic branches and by the kind of predictor that ..."
Abstract
-
Cited by 90 (4 self)
- Add to MetaCart
Modern high-performance architectures require extremely accurate branch prediction to overcome the performance limitations of conditional branches. We present a framework that categorizes branch prediction schemes by the way in which they partition dynamic branches and by the kind of predictor that they use. The framework allows us to compare and contrast branch prediction schemes, and to analyze why they work. We use the framework to show how a static correlated branch prediction scheme increases branch bias and thus improves overall branch prediction accuracy. We also use the framework to identify the fundamental differences between static and dynamic correlated branch prediction schemes. This study shows that there is room to improve the prediction accuracy of existing branch prediction schemes.
The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference
, 1997
"... Deeply pipelined, superscalar processors require accurate branch prediction to achieve high performance. Two-level branch predictors have been shown to achieve high prediction accuracy. It has also been shown that branch interference is a major contributor to the number of branches mispredicted by t ..."
Abstract
-
Cited by 83 (1 self)
- Add to MetaCart
Deeply pipelined, superscalar processors require accurate branch prediction to achieve high performance. Two-level branch predictors have been shown to achieve high prediction accuracy. It has also been shown that branch interference is a major contributor to the number of branches mispredicted by two-level predictors. This paper presents a new method to reduce the interference problem called agree prediction, which reduces the chance that two branches aliasing the same PHT entry will interfere negatively. We evaluate the performance of this scheme using full traces (both user and supervisor) of the SPECint95 benchmarks. The result is a reduction in the misprediction rate of gcc ranging from 8.62% with a 64K-- entry PHT up to 33.3% with a 1K--entry PHT. Keywords: branch prediction, superscalar, speculative execution, two-level branch prediction. 1 Introduction The link between changes in branch misprediction rate and changes in performance has been well documented [1--4, 6]. Yeh and...
Reducing Branch Costs via Branch Alignment
- In Six International Conference on Architectural Support for Programming Languages and Operating Systems
, 1994
"... Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned with branch prediction reported small or minimal improvements. ..."
Abstract
-
Cited by 80 (13 self)
- Add to MetaCart
Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned with branch prediction reported small or minimal improvements. As wide-issue architectures become increasingly popular the importance of reducing branch costs will increase, and branch alignment is one mechanism which can effectively reduce these costs. In this paper, we propose an improved branch alignment algorithm that takes into consideration the architectural cost model and the branch prediction architecture when performing the basic block reordering. We show that branch alignment algorithms can improve a broad range of static and dynamicbranch prediction architectures. We also show that a programs performance can be improved by approximately 5% even whenusing recently proposed,highly accurate branch prediction architectures. The programs are compi...
The Microarchitecture of Superscalar Processors
, 1995
"... Superscalar processing is the latest in a long series of innovations aimed at producing ever-faster microprocessors. By exploiting instruction-level parallelism, superscalar processors are capable of executing more than one instruction in a clock cycle. This paper discusses the microarchitecture of ..."
Abstract
-
Cited by 76 (0 self)
- Add to MetaCart
Superscalar processing is the latest in a long series of innovations aimed at producing ever-faster microprocessors. By exploiting instruction-level parallelism, superscalar processors are capable of executing more than one instruction in a clock cycle. This paper discusses the microarchitecture of superscalar processors. We begin with a discussion of the general problem solved by superscalar processors: converting an ostensibly sequential program into a more parallel one. The principles underlying this process, and the constraints that must be met, are discussed. The paper then provides a description of the specific implementation techniques used in the important phases of superscalar processing. The major phases include: i) instruction fetching and conditional branch processing, ii) the determination of data dependences involving register values, iii) the initiation, or issuing, of instructions for parallel execution, iv) the communication of data values through memory via loads and stores, and v) committing the process state in correct order so that precise interrupts can be supported. Examples of recent superscalar microprocessors, the MIPS R10000, the DEC 21164, and the AMD K5 are used to illustrate a variety of superscalar methods.

