Results 1 - 10
of
79
Instruction-Level Parallel Processing: History, Overview and Perspective
, 1992
"... Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a muc ..."
Abstract
-
Cited by 166 (0 self)
- Add to MetaCart
Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built, and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP have become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.
Adaptive Optimization in the Jalapeno JVM
- In ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA
, 2000
"... (*58()9$"2#$:0/,;58(03<10/2,>=?33@">"29 #A:0*/,B58(*C2"258/052,D3*>#$,,6-*0'/ 58@F,058*,+HG?!"*0"I"252J58K0/ ,6-*0'/ 030"6N*IO40"58DP)"58QF,058SRUT6252,D<0!2T6252,V52!8("9 "W5X3,06*9E,'Y58(*03C:0'/ X3,06*9E,'Y58(*03C 1622 *'\,20/2XD3Q#$,U-0/269EU,/52,X"58QF,0'58,+ I,2/2-K58X^528-3L2T6252,_0/252/,5 ..."
Abstract
-
Cited by 149 (10 self)
- Add to MetaCart
(*58()9$"2#$:0/,;58(03<10/2,>=?33@">"29 #A:0*/,B58(*C2"258/052,D3*>#$,,6-*0'/ 58@F,058*,+HG?!"*0"I"252J58K0/ ,6-*0'/ 030"6N*IO40"58DP)"58QF,058SRUT6252,D<0!2T6252,V52!8("9 "W5X3,06*9E,'Y58(*03C:0'/ X3,06*9E,'Y58(*03C 1622 *'\,20/2XD3Q#$,U-0/269EU,/52,X"58QF,0'58,+ I,2/2-K58X^528-3L2T6252,_0/252/,58('4-*0'2,Y 0C#$,058Z#>58,0@=`58a02T/2*(*C/,':b(/,058c+ \",25C0d@"3,152058[#;58!*03e0/252,/58( 5805f8(""52<00"58>b(3589$3,3*"*58QF058C-02,;"(3T Y2520'58258/,03@20'Q"3+ ] D,Q"...
SPIRAL: Code Generation for DSP Transforms
- PROCEEDINGS OF THE IEEE SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION
, 2005
"... Abstract — Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performance-critical domain of linear digital sig ..."
Abstract
-
Cited by 96 (25 self)
- Add to MetaCart
Abstract — Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performance-critical domain of linear digital signal processing (DSP) transforms. For a specified transform, SPIRAL automatically generates high performance code that is tuned to the given platform. SPIRAL formulates the tuning as an optimization problem, and exploits the domain-specific mathematical structure of transform algorithms to implement a feedback-driven optimizer. Similar to a human expert, for a specified transform, SPIRAL “intelligently ” generates and explores algorithmic and implementation choices to find the best match to the computer’s microarchitecture. The “intelligence” is provided by search and learning techniques that exploit the structure of the algorithm and implementation space to guide the exploration and optimization. SPIRAL generates high performance code for a broad set of DSP transforms including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by SPIRAL competes with, and sometimes outperforms, the best available human tuned transform library code. Index Terms — library generation, code optimization, adaptation, automatic performance tuning, high performance computing, linear signal transform, discrete Fourier transform, FFT, discrete cosine transform, wavelet, filter, search, learning, genetic and evolutionary algorithm, Markov decision process I.
ADAPTIVE OPTIMIZATION FOR SELF: RECONCILING HIGH PERFORMANCE WITH EXPLORATORY PROGRAMMING
, 1994
"... Object-oriented programming languages confer many benefits, including abstraction, which lets the programmer hide
the details of an object’s implementation from the object’s clients. Unfortunately, crossing abstraction boundaries
often incurs a substantial run-time overhead in the form of frequent p ..."
Abstract
-
Cited by 95 (6 self)
- Add to MetaCart
Object-oriented programming languages confer many benefits, including abstraction, which lets the programmer hide
the details of an object’s implementation from the object’s clients. Unfortunately, crossing abstraction boundaries
often incurs a substantial run-time overhead in the form of frequent procedure calls. Thus, pervasive use of abstraction,
while desirable from a design standpoint, may be impractical when it leads to inefficient programs.
Aggressive compiler optimizations can reduce the overhead of abstraction. However, the long compilation times
introduced by optimizing compilers delay the programming environment‘s responses to changes in the program.
Furthermore, optimization also conflicts with source-level debugging. Thus, programmers are caught on the horns of
two dilemmas: they have to choose between abstraction and efficiency, and between responsive programming environments
and efficiency. This dissertation shows how to reconcile these seemingly contradictory goals by performing
optimizations lazily.
Four new techniques work together to achieve high performance and high responsiveness:
• Type feedback achieves high performance by allowing the compiler to inline message sends based on information
extracted from the runtime system. On average, programs run 1.5 times faster than the previous SELF system;
compared to a commercial Smalltalk implementation, two medium-sized benchmarks run about three times faster.
This level of performance is obtained with a compiler that is both simpler and faster than previous SELF compilers.
• Adaptive optimization achieves high responsiveness without sacrificing performance by using a fast nonoptimizing
compiler to generate initial code while automatically recompiling heavily used parts of the program
with an optimizing compiler. On a previous-generation workstation like the SPARCstation-2, fewer than 200
pauses exceeded 200 ms during a 50-minute interaction, and 21 pauses exceeded one second. On a currentgeneration
workstation, only 13 pauses exceed 400 ms.
• Dynamic deoptimization shields the programmer from the complexity of debugging optimized code by
transparently recreating non-optimized code as needed. No matter whether a program is optimized or not, it can
always be stopped, inspected, and single-stepped. Compared to previous approaches, deoptimization allows more
debugging while placing fewer restrictions on the optimizations that can be performed.
• Polymorphic inline caching generates type-case sequences on-the-fly to speed up messages sent from the same
call site to several different types of object. More significantly, they collect concrete type information for the
optimizing compiler.
With better performance yet good interactive behavior, these techniques make exploratory programming possible
both for pure object-oriented languages and for application domains requiring higher ultimate performance, reconciling
exploratory programming, ubiquitous abstraction, and high performance.
Marmot: An Optimizing Compiler for Java
, 1998
"... The Marmot system is a research platform for studying the implementation of high level programming languages. It currently comprises an optimizing native-code compiler, runtime system, and libraries for a large subset of Java. Marmot integrates well-known representation, optimization, code generat ..."
Abstract
-
Cited by 63 (6 self)
- Add to MetaCart
The Marmot system is a research platform for studying the implementation of high level programming languages. It currently comprises an optimizing native-code compiler, runtime system, and libraries for a large subset of Java. Marmot integrates well-known representation, optimization, code generation, and runtime techniques with a few Java-specific features to achieve competitive performance. This paper contains a description of the Marmot system design, along with highlights of our experience applying and adapting traditional implementation techniques to Java. A detailed performance evaluation assesses both Marmot's overall performance relative to other Java and C++ implementations and the relative costs of various Java language features in Marmot-compiled code. Our experience with Marmot has demonstrated that well-known compilation techniques can produce very good performance for static Java applications---comparable or superior to other Java systems, and approaching that o...
Evidence-based Static Branch Prediction using Machine Learning
- ACM Transactions on Programming Languages and Systems
, 1996
"... Correctly predicting the direction that branches will take is increasingly important in today's wideissue computer architectures. The name program-based branch prediction is given to static branch prediction techniques that base their prediction on a program's structure. In this paper, we investigat ..."
Abstract
-
Cited by 56 (6 self)
- Add to MetaCart
Correctly predicting the direction that branches will take is increasingly important in today's wideissue computer architectures. The name program-based branch prediction is given to static branch prediction techniques that base their prediction on a program's structure. In this paper, we investigate a new approach to program-based branch prediction that uses a body of existing programs to predict the branch behavior in a new program. We call this approach to program-based branch prediction evidence-based static prediction, or ESP. The main idea of ESP is that the behavior of a corpus of programs can be used to infer the behavior of new programs. In this paper, we use neural networks and decision trees to map static features associated with each branch to the probability that the branch will be taken. ESP shows significant advantages over other prediction mechanisms. Specifically, it is a program-based technique, it is effective across a range of programming languages and programming s...
Reconciling responsiveness with performance in pure object-oriented languages
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1996
"... Dynamically-dispatched calls often limit the performance of object-oriented programs since object-oriented programming encourages factoring code into small, reusable units, thereby increasing the frequency of these expensive operations. Frequent calls not only slow down execution with the dispatch o ..."
Abstract
-
Cited by 55 (0 self)
- Add to MetaCart
Dynamically-dispatched calls often limit the performance of object-oriented programs since object-oriented programming encourages factoring code into small, reusable units, thereby increasing the frequency of these expensive operations. Frequent calls not only slow down execution with the dispatch overhead per se, but more importantly they hinder optimization by limiting the range and effectiveness of standard global optimizations. In particular, dynamicallydispatched calls prevent standard interprocedural optimizations that depend on the availability of a static call graph. The SELF implementation described here offers two novel approaches to optimization. Type feedback speculatively inlines dynamically-dispatched calls based on profile information that predicts likely receiver classes. Adaptive optimization reconciles optimizing compilation with interactive performance by incrementally optimizing only the frequently-executed parts of a program. When combined, these two techniques result in a system that can execute programs significantly faster than previous systems while retaining much of the interactiveness of an interpreted system.
Eliminating virtual function calls in C++ programs
, 1996
"... We have designed and implemented an optimizing source-to-source C++ compiler that reduces the frequency of virtual function calls. This technical report describes our preliminary experience with this system. The prototype implementation demonstrates the value of OO-specific optimization of C++. Desp ..."
Abstract
-
Cited by 54 (0 self)
- Add to MetaCart
We have designed and implemented an optimizing source-to-source C++ compiler that reduces the frequency of virtual function calls. This technical report describes our preliminary experience with this system. The prototype implementation demonstrates the value of OO-specific optimization of C++. Despite some limitations of our system, and despite the low frequency of virtual function calls in some of the programs, optimization improves the performance of a suite of two small and six large C++ applications totalling over 90,000 lines of code by a median of 20% over the original programs and reduces the number of virtual function calls by a median factor of 5. For more call-intensive versions of the same programs, performance improved by a median of 40 % and the number of virtual calls dropped by a factor of 21. Our measurements indicate that inlining does not necessarily lead to large increases in code size, and that for most programs, the instruction cache miss ratio does not increase significantly.
Towards Better Inlining Decisions Using Inlining Trials
- In Proceedings of the 1994 ACM Conference on LISP and Functional Programming
, 1994
"... Inlining trials are a general mechanism for making better automatic decisions about whether a routine is profitable to inline. Unlike standard source-level inlining heuristics, an inlining trial captures the effects of optimizations applied to the body of the inlined routine when calculating the cos ..."
Abstract
-
Cited by 53 (6 self)
- Add to MetaCart
Inlining trials are a general mechanism for making better automatic decisions about whether a routine is profitable to inline. Unlike standard source-level inlining heuristics, an inlining trial captures the effects of optimizations applied to the body of the inlined routine when calculating the costs and benefits of inlining. The results of inlining trials are stored in a persistent database to be reused when making future inlining decisions at similar call sites. Type group analysis can determine the amount of available static information exploited during compilation, and the results of analyzing the compilation of an inlined routine help decide when a future call site would lead to substantially the same generated code as a given inlining trial. We have implemented inlining trials and type group analysis in an optimizing compiler for SELF, and by making wiser inlining decisions we were able to cut compilation time and compiled code space with virtually no loss of execution speed. We...

