Results 1 - 10
of
10
Reducing indirect function call overhead in c++ programs
- In POPL ’94: Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages
, 1994
"... Modern computer architectures increasingly depend on mechanisms that estimate fhture control flow decisions to increase performance. Mechanisms such as speculative execution and prefetching are becoming standard architectural mechanisms that rely on control flow prediction to prefetch and speculativ ..."
Abstract
-
Cited by 112 (5 self)
- Add to MetaCart
Modern computer architectures increasingly depend on mechanisms that estimate fhture control flow decisions to increase performance. Mechanisms such as speculative execution and prefetching are becoming standard architectural mechanisms that rely on control flow prediction to prefetch and speculatively execute future instructions. At the same time, computer programmers are increasingly turning to object-oriented languages to increase their productivity. These languages commonly use run time dispatching to implement object polymorphism. Dispatching is usually implemented using an indirect finction call, which presents challenges to existing control flow prediction techniques. We have measured the occurrence of indirect function calls in a collection of C++ programs. We show that, although it is more important to predict branches accurately, indirect call prediction is also an important factor in some programs and will grow in importance with the growth of object-oriented programming. We examine the improvement offered by compile-time optimization and static and dynamic prediction techniques, and demonstrate how compilers can use existing branch prediction mechanisms to improve performance in C++ programs. Using these methods with the programs we examined, the number of instructions between mispredicted breaks in control can be doubled on existing computers.
Quantifying behavioral differences between C and C++ programs
- JOURNAL OF PROGRAMMING LANGUAGES
, 1994
"... Improving the performance of C programs has been a topic of great interest for many years. Both hardware technology and compiler optimization research has been applied in an effort to make C programs execute faster. In many application domains, the C++ language is replacing C as the programming lang ..."
Abstract
-
Cited by 83 (15 self)
- Add to MetaCart
Improving the performance of C programs has been a topic of great interest for many years. Both hardware technology and compiler optimization research has been applied in an effort to make C programs execute faster. In many application domains, the C++ language is replacing C as the programming language of choice. In this paper, we measure the empirical behavior of a group of significant C and C++ programs and attempt to identify and quantify behavioral differences between them. Our goal is to determine whether optimization technology that has been successful for C programs will also be successful in C++ programs. We furthermore identify behavioral characteristics of C++ programs that suggest optimizations that should be applied in those programs. Our results show that C++ programs exhibit behavior that is significantly different than C programs. These results should be of interest to compiler writers and architecture designers who are designing systems to execute object-oriented programs.
Fast Accurate Instruction Fetch and Branch Prediction
- IN 21ST ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1994
"... Accurate branch prediction is critical to performance; mispredicted branches mean that ten's of cycles may be wasted in superscalar architectures. Architectures combining very effective branch prediction mechanisms coupled with modified branch target buffers (BTB's) have been proposed for wide-issue ..."
Abstract
-
Cited by 65 (12 self)
- Add to MetaCart
Accurate branch prediction is critical to performance; mispredicted branches mean that ten's of cycles may be wasted in superscalar architectures. Architectures combining very effective branch prediction mechanisms coupled with modified branch target buffers (BTB's) have been proposed for wide-issue processors. These mechanisms require considerable processor resources. Concurrently, the larger address space of 64-bit architectures introduce new obstacles and opportunities. A larger address space means branch target buffers become more expensive. In this paper, we show how a combination of less expensive mechanisms can achieve better performance than BTB's. This combination relies on a number of design choices describedin the paper. We used trace-driven simulation to show that our proposed design, which uses fewer resources, offers better performancethan previously proposed alternatives for most programs, and indicate how to further improve this design.
A comprehensive instruction fetch mechanism for a processor supporting speculative execution
- IN PROCEEDINGS OF THE 25TH ANNUAL INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1992
"... A superscalar processor supporting speculative execution requires an instruction fetch mechanism that can provide instruction fetch addresses as nearly correct as possible and as soon as possible in order to reduce the likelihood of throwing away speculative work. In this paper we propose a comprehe ..."
Abstract
-
Cited by 63 (5 self)
- Add to MetaCart
A superscalar processor supporting speculative execution requires an instruction fetch mechanism that can provide instruction fetch addresses as nearly correct as possible and as soon as possible in order to reduce the likelihood of throwing away speculative work. In this paper we propose a comprehensive instruction fetch mechanism to satisfy that need. Implementation issues are identified, possible solutions and designs for resolving those issues are simulated, and the results of these simulations are presented. A metric for measuring the average penalty of executing a branch instruction is introduced and used to evaluate the performance of our instruction fetch mechanism. We achieve an average performance of 4.19 IPC on the original SPEC benchmarks in a machine which can execute ve instructions ideally by using the proposed mechanism.
Comparing Software and Hardware Schemes For Reducing the Cost of Branches
- in Proceedings of the 16th International Symposium on Computer Architecture
, 1989
"... Pipelining has become a common technique to increase throughput of the instruction fetch, instruction decode, and instruction execution portions of modern computers. Branch instructions disrupt the flow of instructions through the the pipeline, increasing the overall execution cost of branch instruc ..."
Abstract
-
Cited by 48 (12 self)
- Add to MetaCart
Pipelining has become a common technique to increase throughput of the instruction fetch, instruction decode, and instruction execution portions of modern computers. Branch instructions disrupt the flow of instructions through the the pipeline, increasing the overall execution cost of branch instructions. Three schemes to reduce the cost of branches are presented in the context of a general pipeline model. Ten realistic Unix domain programs are used to directly compare the cost and performance of the three schemes and the results are in favor of the software-based scheme. For example, the software-based scheme has a cost of 1.65 cycles/branch vs. a cost of 1.68 cycles/branch of the best hardware scheme for a highly pipelined processor (11-stage pipeline). The results are 1.19 (software scheme) vs. 1.23 cycles/branch (best hardware scheme) for a moderately pipelined processor (5stage pipeline). 1 Introduction The pipelining of modern computer designs causes problems for the execution o...
A comprehensive instruction fetch mechanism for a processor supporting speculative execution
- in Proceedings of the 25th Annual International Symposium on Microarchitecture
, 1992
"... A superscalar processor supporting speculative execution requires an instruction fetch mechanism that can provide instruction fetch addresses as nearly correct as possible and as soon as possible in order to reduce the likelihood of throwing away speculative work. In this paper we proposeacomprehens ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
A superscalar processor supporting speculative execution requires an instruction fetch mechanism that can provide instruction fetch addresses as nearly correct as possible and as soon as possible in order to reduce the likelihood of throwing away speculative work. In this paper we proposeacomprehensive instruction fetch mechanism to satisfy that need. Implementation issues are identi ed, possible solutions and designs for resolving those issues are simulated, and the results of these simulations are presented. A metric for measuring the average penalty of executing a branch instruction is introduced and used to evaluate the performance of our instruction fetch mechanism. We achieve an average performance of 4.19 IPC on the original SPEC benchmarks in a machine which can execute ve instructions ideally by using the proposed mechanism. 1
Dynamic Schemes for Speculative Execution of Code
- Proc. of the 6th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
, 1998
"... Speculative execution of code is becoming a key technique for enhancing the performance of pipeline processors. We study schemes that predict the execution path of a program based on the history of branch executions. Building on previous work, we present a model for analyzing the effective speedup f ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Speculative execution of code is becoming a key technique for enhancing the performance of pipeline processors. We study schemes that predict the execution path of a program based on the history of branch executions. Building on previous work, we present a model for analyzing the effective speedup from pipelining using various schemes for speculative execution. We follow this with stochastic analyses of various speculative execution schemes. Finally, we conclude with simulations covering several of the settings we study. 1. Introduction We consider the problem of on-line instruction fetch on pipeline processors. Pipelining is commonly used on modern processors for speeding up the execution of programs. Typically, the execution of a single instruction in the code can be partitioned to steps, starting with instruction fetch. While traditional processors allow the execution of one instruction at a time, a pipeline processor starts the execution of an instruction as soon as the previous ...
Performance Optimization of Pipelined Caches
"... This paper formulates and shows how to solve the problem of selecting the cache size and depth of cache pipelining that maximizes the performance of a given instruction-set architecture. The solution combines trace-driven architectural simulations and the timing analysis of the physical implementati ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper formulates and shows how to solve the problem of selecting the cache size and depth of cache pipelining that maximizes the performance of a given instruction-set architecture. The solution combines trace-driven architectural simulations and the timing analysis of the physical implementation of the cache. Increasing cache size tends to improve performance but this improvement is limited because cache access time increases with its size. This trade-off results in an optimization problem we referred to as multilevel optimization, because it requires the simultaneous consideration of two levels of machine abstraction: the architectural level and the physical implementation level. The introduction of pipelining permits the use of larger caches without increasing their apparent access time, however the bubbles caused by load and branch delays limit this technique. In this paper we also show how multilevel optimization can be applied to pipelined systems if software- and hardware-b...
Branch Prediction Architectures for 64-bit Address Space
, 1993
"... Processor architectures will increasingly rely on issuing multiple instructions to make full use of available processor resources. When issuing multiple instructions on conventional processors, accurate branch prediction is critical to performance; mispredicted branches may mean that ten's of cycles ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Processor architectures will increasingly rely on issuing multiple instructions to make full use of available processor resources. When issuing multiple instructions on conventional processors, accurate branch prediction is critical to performance; mispredicted branches may mean that ten's of cycles may be wasted. Architectures combining very effective branch prediction mechanisms coupled with modified branch target buffers (BTB's) have been proposed for wide-issue processors. These mechanisms require considerable processor resources; proposals commonly suggest that 16 kilobytes of cache be devoted to branch history and prediction information. Concurrently, the larger address space of 64-bit architectures introduce new obstacles and opportunities. A larger address space means branch target buffers become more expensive, but other branch prediction techniques become more applicable. In this paper, we show how a combination of less expensive mechanisms can achieve better performance than...
O.A. Olukotun
"... This paper formulates and shows how to solve the problem of selecting the cache size and depth of cache pipelining that maximizes the performance of a given instruction-set architecture. The solution combines trace-driven architectural simulations and the timing analysis of the physical implementati ..."
Abstract
- Add to MetaCart
This paper formulates and shows how to solve the problem of selecting the cache size and depth of cache pipelining that maximizes the performance of a given instruction-set architecture. The solution combines trace-driven architectural simulations and the timing analysis of the physical implementation of the cache. Increasing cache size tends to improve performance but this improvement is limited because cache access time increases with its size. This trade-off results in an optimization problem we referred to as multilevel optimization, because it requires the simultaneous consideration of two levels of machine abstraction: the architectural level and the physical implementation level. The introduction of pipelining permits the use of larger caches without increasing their apparent access time, however the bubbles caused by load and branch delays limit this technique. In this paper we also show how multilevel optimization can be applied to pipelined systems if software- and hardware-b...

