Results 1 - 10
of
10
Predicting Indirect Branches via Data Compression
, 1998
"... Branch prediction is a key mechanism used to achieve high performance on multiple issue, deeply pipelined processors. By predicting the branch outcome at the instruction fetch stage of the pipeline, superscalar processors are better able to exploit Instruction Level Parallelism (ILP) by providing a ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
Branch prediction is a key mechanism used to achieve high performance on multiple issue, deeply pipelined processors. By predicting the branch outcome at the instruction fetch stage of the pipeline, superscalar processors are better able to exploit Instruction Level Parallelism (ILP) by providing a larger window of instructions. However, when a branch is mispredicted, instructions from the mispredicted path must be discarded. Therefore, branch prediction accuracy is critical to achieve high performance. Existing branch prediction schemes can accurately predict the direction of conditional branches, but they have difficulty predicting the correct targets of indirect branches. Indirect branches occur frequently in Object-Oriented Languages (OOL), as well as in Dynamically-Linked Libraries (DLLs), two programming environments rapidly increasing in popularity. In addition, certain language constructs such as multi-way control transfers (e.g., switches), and architectural features such as 6...
The Structure and Performance of Efficient Interpreters
- Journal of Instruction-Level Parallelism
, 2003
"... Interpreters designed for high general-purpose performance typically perform a large number of indirect branches (3.2%-13% of all executed instructions in our benchmarks). These branches consume... ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Interpreters designed for high general-purpose performance typically perform a large number of indirect branches (3.2%-13% of all executed instructions in our benchmarks). These branches consume...
Indirect Branch Prediction using Data Compression Techniques
- Journal of Instruction Level Parallelism
, 1999
"... Branch prediction is a key mechanism used to achieve high performance on multiple issue, deeply pipelined processors. By predicting the branch outcome at the instruction fetch stage of a pipeline, superscalar processors become able to exploit Instruction Level Parallelism (ILP) by providing a lar ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Branch prediction is a key mechanism used to achieve high performance on multiple issue, deeply pipelined processors. By predicting the branch outcome at the instruction fetch stage of a pipeline, superscalar processors become able to exploit Instruction Level Parallelism (ILP) by providing a larger window of instructions. However, when a branch is mispredicted, instructions from the mispredicted path must be discarded. Therefore, branch prediction accuracy is critical to achieve high performance. Existing branch prediction schemes can accurately predict the direction of conditional branches, but have difficulties predicting the correct targets of indirect branches. Indirect branches occur frequently in Object-Oriented Languages (OOL), as well as in Dynamically-Linked Libraries (DLLs), two programming environments rapidly increasing in popularity. In addition, certain language constructs such as multi-way control transfers (e.g., switches), and architectural features such as ...
Dynamic Branch Decoupled Architecture
- 1999 IEEE International Conference on Computer Design
, 1999
"... We propose an alternative approach to branch resolution based on the earlier work on decoupled memory architectures. Branch decoupling is a technique to decouple a single instruction stream program into two streams. One stream is solely dedicated to resolving branches as early as possible (both the ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We propose an alternative approach to branch resolution based on the earlier work on decoupled memory architectures. Branch decoupling is a technique to decouple a single instruction stream program into two streams. One stream is solely dedicated to resolving branches as early as possible (both the branch condition and the branch target) . The resolved branch targets are consumed by the other computing stream through a queue. We have proposed a compiler based, static branch decoupling methodology earlier. In this paper, we propose a dynamic branch decoupled (DBD) architecture. Simulations show a speedup of 25:6% for SPEC95 integer benchmarks and 6:1% for SPEC95 FP benchmarks over a 2-level adaptive branch predictor. The average number of branch penalty cycles per instruction for DBD reduces to :0475 compared to :0835 for the 2-level branch predictor. 1 Introduction The instruction-level-parallel processors rely upon dynamic branch prediction techniques [11] to improve performance. Se...
IPC in the 10s via Resource Flow Computing with Levo
, 2001
"... High ILP (Instruction Level Parallelism) exists in typical integer codes. Our goal is to create a machine called Levo that will realize this potential ILP in the face of real hardware limitations. Levo is modeled and evaluated at three different levels of abstraction. FastLevo models the potentia ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
High ILP (Instruction Level Parallelism) exists in typical integer codes. Our goal is to create a machine called Levo that will realize this potential ILP in the face of real hardware limitations. Levo is modeled and evaluated at three different levels of abstraction. FastLevo models the potential of Levo via high-level trace-driven simulation embodying key hardware constraints. LevoSim is a detailed cycle-accurate execution-driven simulator; it gives us precise performance data on the Levo architecture. Lastly, HDLevo is a synthesizable VHDL model of the key elements of Levo; it gives us detailed Levo hardware cost estimates. Levo has many novel architectural features, including the use of time tags with instructions to implicitly enforce data and control dependencies. Levo provides a registerless data path, in that there is no central register file or bottleneck. A resource-flow model of computation is used in which instructions execute whenever resources are available, regardless of true data or control dependencies. Levo obtains this high IPC with scalability in our hardware implementation. FastLevo shows IPCs in the 10s over a wide range of Levo configurations on a subset of the SPECint2000 benchmarks allowing for relaxed memory and control dependencies, though real machine hardware and data dependencies. With HDLevo we obtain a hardware cost for the Levo core (no PEs, memory or I-fetch) of less than 59 million transistors. 1
Improving the Accuracy of Indirect Branch Predication via Branch Classification
, 1998
"... Providing accurate branch prediction is critical to effectively exploit superscalar execution. While most modern processors employ speculative execution to overcome the branch hazard problem, some number of the instructions will have to be discarded when a branch misprediction occurs. Even though ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Providing accurate branch prediction is critical to effectively exploit superscalar execution. While most modern processors employ speculative execution to overcome the branch hazard problem, some number of the instructions will have to be discarded when a branch misprediction occurs. Even though existing branch prediction schemes can accurately predict the direction of conditional branches, they still have difficulty predicting the correct targets of indirect branches. This type of branch occurs more frequently in languages used in Object-Oriented Programming (OOP), as well as in Dynamically-Linked Libraries (DLLs), two programming environment rapidly increasing in popularity. In this paper, we investigate the performance of several predictors used to predict the targets of indirect branches. We present indirect branch classification as a mechanism to characterize the behavior of indirect branches. We then propose hybrid predictors utilizing static and profile-guided branch c...
Verification of ILP Speedups in the 10's for Disjoint Eager Execution
, 1997
"... Disjoint Eager Execution (DEE) has demonstrated Instruction Level Parallelism (ILP) speedups of a factor of 30 in simulations[15]. Theoretical and practical arguments verifying these gains are presented. This includes the use of Amdahl's Law, analysis of the partial trace of a detailed simulation o ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Disjoint Eager Execution (DEE) has demonstrated Instruction Level Parallelism (ILP) speedups of a factor of 30 in simulations[15]. Theoretical and practical arguments verifying these gains are presented. This includes the use of Amdahl's Law, analysis of the partial trace of a detailed simulation of a SPECint92 benchmark, and analysis of the effect on cycle time of the scheduling hardware of the proposed Levo DEE-realization prototype. The static tree heuristic used to economically realize DEE in Levo is also examined. In particular, the shape of the tree is verified via formal proofs, and the variations in the dimensions of the tree with respect to available resources and branch predictor accuracy are modeled and studied. This work was supported in part by the Intel Corporation. This paper has been submitted for publication to the 30th International Symposium on Microarchitecture (Dec. 1997). Versions of Figures 1 and 2 have previously appeared in [15] and [16]. 1 1 Introduction ...
Overview of the Levo High-ILP Computer
, 1997
"... This document gives an overview of the microarchitecture of the Levo prototype computer. Both the data and control paths of Levo are described, with the function of all major subsystems given. Levo's operation is outlined. Anticipated absolute dimensions are given, assuming the 32 Processing Element ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This document gives an overview of the microarchitecture of the Levo prototype computer. Both the data and control paths of Levo are described, with the function of all major subsystems given. Levo's operation is outlined. Anticipated absolute dimensions are given, assuming the 32 Processing Element (PE) single-chip version, called Levo-32-chip. 1 Introduction In order to significantly improve performance of standard general-purpose processors, such as Intel x86 microprocessors, Instruction Level Parallelism (ILP) must be exploited. ILP involves the execution of machine instructions in parallel. It is fine-grain, occurring within an iteration of a loop, for example. In order to enhance ILP, many methods are commonly used[12], such as data dependency reduction and branch prediction. In the latter, when a CPU encounters a branch, it uses special hardware that predicts whether the branch will be taken or not taken (its sign), based on the branch's prior execution history; execution the...
Optimization strategies for a java virtual machine interpreter on the cell broadband engine
- In CF ’08: Proceedings of the 5th international conference on Computing frontiers
, 2008
"... Virtual machines (VMs) such as the Java VM are a popular format for running architectureneutral code in a managed runtime. Such VMs are typically implemented using a combination of interpretation and just-in-time compilation (JIT). A significant challenge for the portability of VM code is the growin ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Virtual machines (VMs) such as the Java VM are a popular format for running architectureneutral code in a managed runtime. Such VMs are typically implemented using a combination of interpretation and just-in-time compilation (JIT). A significant challenge for the portability of VM code is the growing popularity of multi-core architectures with specialized processing cores aimed at computation-intensive applications such as media processing. Such cores differ greatly in architectural design compared to traditional desktop processors. One such processing core is the Cell Broadband Engine’s (Cell BE) Synergistic Processing Element (SPE). An SPE is a light weight VLIW processor core with a SIMD vector instruction set. In this paper we investigate some popular interpreter optimizations and introduce new optimizations exploiting the special hardware properties offered by the Cell BE’s SPE. 1
An Experimental Study of Sorting and Branch Prediction
"... Sorting is one of the most important and well studied problems in Computer Science. Many good algorithms are known which offer various trade-offs in efficiency, simplicity, memory use, and other factors. However, these algorithms do not take into account features of modern computer architectures tha ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Sorting is one of the most important and well studied problems in Computer Science. Many good algorithms are known which offer various trade-offs in efficiency, simplicity, memory use, and other factors. However, these algorithms do not take into account features of modern computer architectures that significantly influence performance. Caches and branch predictors are two such features, and while there has been a significant amount of research into the cache performance of general purpose sorting algorithms, there has been little research on their branch prediction properties. In this paper we empirically examine the behaviour of the branches in all the most common sorting algorithms. We also consider the interaction of cache optimization on the predictability of the branches in these algorithms. We find insertion sort to have the fewest branch mispredictions of any comparison-based sorting algorithm, that bubble and shaker sort operate in a fashion which makes their branches highly unpredictable, that the unpredictability of shellsort’s branches improves its caching behaviour and that several cache optimizations have little effect on mergesort’s branch mispredictions. We find also that optimizations to quicksort – for example the choice of pivot – have a strong influence on the predictability of its branches. We point out a simple way of removing branch instructions from a classic heapsort implementation, and show also that unrolling a loop in a cache optimized heapsort implementation improves the predicitability of its branches. Finally, we note that when sorting random data two-level adaptive branch predictors are usually no better than simpler bimodal predictors. This is despite the fact that two-level adaptive predictors are almost always superior to bimodal predictors in general.

