Results 1 -
4 of
4
Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters
- PLDI'03
, 2003
"... Interpreters designed for efficiency execute a huge number of indirect branches and can spend more than half of the execution time in indirect branch mispredictions. Branch target buffers are the best widely available form of indirect branch prediction; however, their prediction accuracy for existin ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
Interpreters designed for efficiency execute a huge number of indirect branches and can spend more than half of the execution time in indirect branch mispredictions. Branch target buffers are the best widely available form of indirect branch prediction; however, their prediction accuracy for existing interpreters is only 2%–50%. In this paper we investigate two methods for improving the prediction accuracy of BTBs for interpreters: replicating virtual machine (VM) instructions and combining sequences of VM instructions into superinstructions. We investigate static (interpreter buildtime) and dynamic (interpreter run-time) variants of these techniques and compare them and several combinations of these techniques. These techniques can eliminate nearly all of the dispatch branch mispredictions, and have other benefits, resulting in speedups by a factor of up to 3.17 over efficient threaded-code interpreters, and speedups by a factor of up to 1.3 over techniques relying on superinstructions alone.
The Structure and Performance of Efficient Interpreters
- Journal of Instruction-Level Parallelism
, 2003
"... Interpreters designed for high general-purpose performance typically perform a large number of indirect branches (3.2%-13% of all executed instructions in our benchmarks). These branches consume... ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Interpreters designed for high general-purpose performance typically perform a large number of indirect branches (3.2%-13% of all executed instructions in our benchmarks). These branches consume...
MICRO-PROCESSOR BENCHMARKS BY
"... This work examines non-traditional techniques for workload characterization and classification analysis of different benchmark suites used in computer system performance analysis. In current computer systems research, benchmarks are often used to evaluate performance to support micro-architecture de ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This work examines non-traditional techniques for workload characterization and classification analysis of different benchmark suites used in computer system performance analysis. In current computer systems research, benchmarks are often used to evaluate performance to support micro-architecture design. In this work, two different benchmark suites: SPEC CPU2000 and Berkeley Multimedia are simulated using the Simple-Scalar 3.0 iv processor simulator tool set and their performance is characterized. We examine IPC behavior and instruction mix composition, perform feature correlation using a large set of performance metrics and extract the principal components of the metrics dataset. Additionally, C5.0 is used to construct a classifier to classify the benchmark suite. The objective of this work is to find characteristics that distinguish different classes of workloads. If a processor can automatically recognize the type of workload according to its distinguishing characteristics,
the many enlightening discussions I had with Dr. Richard Harris and Dr. Andres Márquez. Automatic Code-Generation Techniques for Micro-Threaded RISC Architectures
"... 2006 There has been an ever-widening gap between processor and memory speeds, resulting in a `memory wall ' where the time for memory accesses dominates performance. To counter this, architectures that use many very small threads that allow multiple memory accesses to occur in parallel have been und ..."
Abstract
- Add to MetaCart
2006 There has been an ever-widening gap between processor and memory speeds, resulting in a `memory wall ' where the time for memory accesses dominates performance. To counter this, architectures that use many very small threads that allow multiple memory accesses to occur in parallel have been under investigation. Examples of these architectures are the CARE (Compiler Aided Reorder Engine) architecture, micro-threading architectures and cellular architectures, such as the IBM Cyclops family, implementing using processors-in-memory (PIM), which is the main architecture discussed in this thesis. PIM architectures achieve high performance by increasing the bandwidth of the processor to memory communication and reducing that latency, via the use of many processors physically close to the main memory. These massively parallel architectures may have sophisticated memory models, and the I contend that there is an open question regarding what may be the ideal approach to implementing parallelism, via using many threads, from the programmer's

