Results 1 - 10
of
88
Exploiting hardware performance counters with flow and context sensitive profiling
- ACM Sigplan Notices
, 1997
"... A program pro le attributes run-time costs to portions of a program's execution. Most pro ling systems su er from two major de ciencies: rst, they only apportion simple metrics, such as execution frequency or elapsed time to static, syntactic units, such as procedures or statements; second, they agg ..."
Abstract
-
Cited by 189 (9 self)
- Add to MetaCart
A program pro le attributes run-time costs to portions of a program's execution. Most pro ling systems su er from two major de ciencies: rst, they only apportion simple metrics, such as execution frequency or elapsed time to static, syntactic units, such as procedures or statements; second, they aggressively reduce the volume of information collected and reported, although aggregation can hide striking di erences in program behavior. This paper addresses both concerns by exploiting the hardware counters available in most modern processors and by incorporating two concepts from data ow analysis { ow and context sensitivity{to report more context for measurements. This paper extends our previous work on e cient path pro ling to ow sensitive pro ling, which associates hardware performance metrics with a path through a procedure. In addition, it describes a data structure, the calling context tree, that e ciently captures calling contexts for procedure-level measurements. Our measurements show that the SPEC95 benchmarks execute a small number (3{28) of hot paths that account for 9{98 % of their L1 data cache misses. Moreover, these hot paths are concentrated in a few routines, which have complex dynamic behavior. 1
Optimization of Instruction Fetch Mechanisms for High Issue Rates
- In 22nd Annual International Symposium on Computer Architecture
, 1995
"... Recent superscalar processors issue four instructions per cycle. These processors are also powered by highly-parallel superscalar cores. The potential performance can only be exploited when fed by high instruction bandwidth. This task is the responsibility of the instruction fetch unit. Accurate bra ..."
Abstract
-
Cited by 115 (4 self)
- Add to MetaCart
Recent superscalar processors issue four instructions per cycle. These processors are also powered by highly-parallel superscalar cores. The potential performance can only be exploited when fed by high instruction bandwidth. This task is the responsibility of the instruction fetch unit. Accurate branch prediction and low I-cache miss ratios are essential for the efficient operation of the fetch unit. Several studies on cache design and branch prediction address this problem. However, these techniques are not sufficient. Even in the presence of efficient cache designs and branch prediction, the fetch unit must continuously extract multiple, non-sequential instructions from the instruction cache, realign these in the proper order, and supply them to the decoder. This paper explores solutions to this problem and presents several schemes with varying degrees of performance and cost. The most-general scheme, the collapsing buffer, achieves near-perfect performance and consistently aligns in...
ADAPTIVE OPTIMIZATION FOR SELF: RECONCILING HIGH PERFORMANCE WITH EXPLORATORY PROGRAMMING
, 1994
"... Object-oriented programming languages confer many benefits, including abstraction, which lets the programmer hide
the details of an object’s implementation from the object’s clients. Unfortunately, crossing abstraction boundaries
often incurs a substantial run-time overhead in the form of frequent p ..."
Abstract
-
Cited by 95 (6 self)
- Add to MetaCart
Object-oriented programming languages confer many benefits, including abstraction, which lets the programmer hide
the details of an object’s implementation from the object’s clients. Unfortunately, crossing abstraction boundaries
often incurs a substantial run-time overhead in the form of frequent procedure calls. Thus, pervasive use of abstraction,
while desirable from a design standpoint, may be impractical when it leads to inefficient programs.
Aggressive compiler optimizations can reduce the overhead of abstraction. However, the long compilation times
introduced by optimizing compilers delay the programming environment‘s responses to changes in the program.
Furthermore, optimization also conflicts with source-level debugging. Thus, programmers are caught on the horns of
two dilemmas: they have to choose between abstraction and efficiency, and between responsive programming environments
and efficiency. This dissertation shows how to reconcile these seemingly contradictory goals by performing
optimizations lazily.
Four new techniques work together to achieve high performance and high responsiveness:
• Type feedback achieves high performance by allowing the compiler to inline message sends based on information
extracted from the runtime system. On average, programs run 1.5 times faster than the previous SELF system;
compared to a commercial Smalltalk implementation, two medium-sized benchmarks run about three times faster.
This level of performance is obtained with a compiler that is both simpler and faster than previous SELF compilers.
• Adaptive optimization achieves high responsiveness without sacrificing performance by using a fast nonoptimizing
compiler to generate initial code while automatically recompiling heavily used parts of the program
with an optimizing compiler. On a previous-generation workstation like the SPARCstation-2, fewer than 200
pauses exceeded 200 ms during a 50-minute interaction, and 21 pauses exceeded one second. On a currentgeneration
workstation, only 13 pauses exceed 400 ms.
• Dynamic deoptimization shields the programmer from the complexity of debugging optimized code by
transparently recreating non-optimized code as needed. No matter whether a program is optimized or not, it can
always be stopped, inspected, and single-stepped. Compared to previous approaches, deoptimization allows more
debugging while placing fewer restrictions on the optimizations that can be performed.
• Polymorphic inline caching generates type-case sequences on-the-fly to speed up messages sent from the same
call site to several different types of object. More significantly, they collect concrete type information for the
optimizing compiler.
With better performance yet good interactive behavior, these techniques make exploratory programming possible
both for pure object-oriented languages and for application domains requiring higher ultimate performance, reconciling
exploratory programming, ubiquitous abstraction, and high performance.
Static Branch Frequency and Program Profile Analysis
- In 27th International Symposium on Microarchitecture
, 1994
"... : Program profiles identify frequently executed portions of a program, which are the places at which optimizations offer programmers and compilers the greatest benefit. Compilers, however, infrequently exploit program profiles, because profiling a program requires a programmer to instrument and run ..."
Abstract
-
Cited by 62 (1 self)
- Add to MetaCart
: Program profiles identify frequently executed portions of a program, which are the places at which optimizations offer programmers and compilers the greatest benefit. Compilers, however, infrequently exploit program profiles, because profiling a program requires a programmer to instrument and run the program. An attractive alternative is for the compiler to statically estimate program profiles. . This paper presents several new techniques for static branch prediction and profiling. The first technique combines multiple predictions of a branch's outcome into a prediction of the probability that the branch is taken. Another technique uses these predictions to estimate the relative execution frequency (i.e., profile) of basic blocks and controlflow edges within a procedure. A third algorithm uses local frequency estimates to predict the global frequency of calls, procedure invocations, and basic block and control-flow edge executions. Experiments on the SPEC92 integer benchmarks and Uni...
Software profiling for hot path prediction: less is more
- SIGPLAN Not
"... Recently, there has been a growing interest in exploiting profile information in adaptive systems such as just-in-time compilers, dynamic optimizers and, binary translators. In this paper, we show that sophisticated software profiling schemes that provide highly accurate information in an offline se ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
Recently, there has been a growing interest in exploiting profile information in adaptive systems such as just-in-time compilers, dynamic optimizers and, binary translators. In this paper, we show that sophisticated software profiling schemes that provide highly accurate information in an offline setting are ill-suited for these dynamic code generation systems. We experimentally demonstrate that hot path predictions must be made early in order to control the rising cost of missed opportunity that result from the prediction delay. We also show that existing sophisticated path profiling schemes, if used in an online setting, offer no prediction advantages over simpler schemes that exhibit much lower runtime overheads. Based on these observation we developed a new low-overhead software profiling scheme for hot path prediction. Using an abstract metric we compare our scheme to path profile based prediction and show that our scheme achieves comparable prediction quality. In our second set of experiments we include runtime overhead and evaluate the performance of our scheme in a realistic application: Dynamo, a dynamic optimization system. The results show that our prediction scheme clearly outperforms path profile based prediction and thus confirm that less profiling as exhibited in our scheme will actually lead to more effective hot path prediction. 1.
Accurate Static Branch Prediction by Value Range Propagation
, 1995
"... The ability to predict at compile time the likelihood of a particular branch being taken provides valuable information for several optimizations, including global instruction scheduling, code layout, function inlining, interprocedural register allocation and many high level optimizations. Previous a ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
The ability to predict at compile time the likelihood of a particular branch being taken provides valuable information for several optimizations, including global instruction scheduling, code layout, function inlining, interprocedural register allocation and many high level optimizations. Previous attempts at static branch prediction have either used simple heuristics, which can be quite inaccurate, or put the burden onto the programmer by using execution profiling data or source code hints. This paper presents a new approach to static branch prediction called value range propagation. This method tracks the weighted value ranges of variables through a program, much like constant propagation. These value ranges may be either numeric or symbolic in nature. Branch prediction is then performed by simply consulting the value range of the appropriate variable. Heuristics are used as a fallback for cases where the value range of the variable cannot be determined statically. In the process, va...
SPAID: Software Prefetching in Pointer- and Call-Intensive Environments
- In Proceedings of the 28th annual international symposium on Microarchitecture
, 1995
"... Software prefetching, typically in the context of numericor loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic--SPAID--for utilizing prefetch instructions in ..."
Abstract
-
Cited by 58 (3 self)
- Add to MetaCart
Software prefetching, typically in the context of numericor loop-intensive benchmarks, has been proposed as one remedy for the performance bottleneck imposed on computer systems by the cost of servicing cache misses. This paper proposes a new heuristic--SPAID--for utilizing prefetch instructions in pointer- and call-intensive environments. We use trace-driven cache simulation of a number of pointer- and call-intensive benchmarks to evaluate the benefits and implementation trade-offs of SPAID. Our results indicate that a significant proportion of the cost of data cache misses can be eliminated or reduced with SPAID without unduly increasing memory traffic. 1. Introduction It is well known that processor clock speeds are increasing exponentially over time, while memory speeds are not increasing nearly as rapidly [RD94]. The computing industry has reached the point where system performance is dominated by the cost of servicing cache misses. To address this problem, several instruction s...
Superblock formation using static program analysis
- in Proceedings of the 26th Annual IEEE/ACM International Symposium on Microarchitecture (Micro-26
, 1993
"... Compile-time code transformations which expose instruction-level parallelism (ILP) typically take into account the constraints imposed byallexecution scenarios in the program. However, there areaddi tional opportunities to increase ILP along some execution sequences if the constraints from alternati ..."
Abstract
-
Cited by 46 (9 self)
- Add to MetaCart
Compile-time code transformations which expose instruction-level parallelism (ILP) typically take into account the constraints imposed byallexecution scenarios in the program. However, there areaddi tional opportunities to increase ILP along some execution sequences if the constraints from alternative execution sequences can be ignored. Traditionally, pro le information has been used to identify important execution sequences for aggressive compiler optimization and scheduling. This paper presents a set of static program analysis heuristics used in the IMPACT compiler to identify execution sequences for aggressive optimization. We show that the static program analysis heuristics identify execution sequences without hazardous conditions that tend to prohibit compiler optimizations. As a result, the static program analysis approach often achieves optimization results comparable to pro le information in spite of its inferior branch prediction accuracies. This observation makes a strong case for using static program analysis with or without pro le information to facilitate aggressive compiler optimization and scheduling.
Online Feedback-Directed Optimization of Java
, 2002
"... This paper describes the implementation of an online feedback-directed optimization system. The system is fully automatic; it requires no prior... ..."
Abstract
-
Cited by 45 (3 self)
- Add to MetaCart
This paper describes the implementation of an online feedback-directed optimization system. The system is fully automatic; it requires no prior...
The Importance of Prepass Code Scheduling for Superscalar and Superpipelined Processors
- IEEE Transactions on Computers
, 1994
"... Superscalar and superpipelined processors utilize parallelism to achieve peak performance that can be several times higher than that of conventional scalar processors. In order for this potential to be translated into the speedup of real programs, the compiler must be able to schedule instructions s ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
Superscalar and superpipelined processors utilize parallelism to achieve peak performance that can be several times higher than that of conventional scalar processors. In order for this potential to be translated into the speedup of real programs, the compiler must be able to schedule instructions so that the parallel hardware is effectively utilized. Previous work has shown that prepass code scheduling helps to produce a better schedule for scientific programs. But the importance of prescheduling has never been demonstrated for control-intensive non-numeric programs. These programs are significantly different from the scientific programs because they contain frequent branches. The compiler must do global scheduling in order to find enough independent instructions. In this paper, the code optimizer and scheduler of the IMPACT-I C compiler is described. Within this framework, we study the importance of prepass code scheduling for a set of production C programs. It is shown that, in cont...

