Results 1 - 10
of
44
A Hardware-Driven Profiling Scheme for Identifying Program Hot Spots to Support Runtime Optimization
- In Proceedings of the 26th Annual International Symposium on Computer Architecture
, 1999
"... This paper presents a novel hardware-based approach for identifying, profiling, and monitoring hot spots in order to support runtime optimization of generalpurpose programs. The proposed approach consists of a set of tightly coupled hardware tables and control logic modules that are placed in the re ..."
Abstract
-
Cited by 68 (4 self)
- Add to MetaCart
This paper presents a novel hardware-based approach for identifying, profiling, and monitoring hot spots in order to support runtime optimization of generalpurpose programs. The proposed approach consists of a set of tightly coupled hardware tables and control logic modules that are placed in the retirement stage of a processor pipeline removed from the critical path. The features of the proposed design include rapid detection of program hot spots after changes in execution behavior, runtime-tunable selection criteria for hot spot detection, and negligible overhead during application execution. Experiments using several SPEC95 benchmarks, as well as several large WindowsNT applications, demonstrate the promise of the proposed design. 1 Introduction Optimizing compilers can gain significant performance benefits by performing code transformations based on a program's runtime profile. Traditionally, profiles are collected by running an instrumented version of the executable. However, bec...
FX!32 - A Profile-Directed Binary Translator
- IEEE Micro
, 1998
"... Because Digital s Alpha architecture provides the world's fastest processors, many applications, especially those requiring high processor performance, have been ported to it. However, many other applications are available only under the x86 architecture. We designed Digital FX!32 to make the comple ..."
Abstract
-
Cited by 66 (0 self)
- Add to MetaCart
Because Digital s Alpha architecture provides the world's fastest processors, many applications, especially those requiring high processor performance, have been ported to it. However, many other applications are available only under the x86 architecture. We designed Digital FX!32 to make the complete set of applications, both native and x86, available to Alpha. The goal for the software is to provide fast and transparent execution of x86 Win32 applications on Alpha systems. FX!32 achieves its goal by transparently running those applications at speeds comparable to high-performance x86 platforms. Digital FX!32 is a software utility that enables x86 Win32 applications to be run on Windows NT/Alpha platforms. Once FX!32 has been installed, almost all x86 applications can be run on Alpha without special commands and with excellent performance.
Interprocedural Conditional Branch Elimination
, 1997
"... The existence of statically detectable correlation among conditional branches enables their elimination, an optimization that has a number of benefits. This paper presents techniques to determine whether an interprocedural execution path leading to a conditional branch exists along which the branch ..."
Abstract
-
Cited by 66 (15 self)
- Add to MetaCart
The existence of statically detectable correlation among conditional branches enables their elimination, an optimization that has a number of benefits. This paper presents techniques to determine whether an interprocedural execution path leading to a conditional branch exists along which the branch outcome is known at compile time, and then to eliminate the branch along this path through code restructuring. The technique consists of a demand driven interprocedural analysis that determines whether a specific branch outcome is correlated with prior statements or branch outcomes. The optimization is performed using a code restructuring algorithm that replicates code to separate out the paths with correlation. When the correlated path is affected by a procedure call, the restructuring is based on procedure entry splitting and exit splitting. The entry splitting transformation creates multiple entries to a procedure, and the exit splitting transformation allows a procedure to return control...
Complete Removal of Redundant Expressions
, 1998
"... Partial redundancy elimination (PRE), the most important component of global optimizers, generalizes the removal of common subexpressions and loop-invariant computations. Because existing PRE implementations are based on code motion, they fail to completely remove the redundancies. In fact, we obser ..."
Abstract
-
Cited by 64 (13 self)
- Add to MetaCart
Partial redundancy elimination (PRE), the most important component of global optimizers, generalizes the removal of common subexpressions and loop-invariant computations. Because existing PRE implementations are based on code motion, they fail to completely remove the redundancies. In fact, we observed that 73% of loop-invariant statements cannot be eliminated from loops by code motion alone. In dynamic terms, traditional PRE eliminates only half of redundancies that are strictly partial. To achieve a complete PRE, control flow restructuring must be applied. However, the resulting code duplication may cause code size explosion. This paper focuses on achieving a complete PRE while incurring an acceptable code growth. First, we present an algorithm for complete removal of partial redundancies, based on the integration of code motion and control flow restructuring. In contrast to existing complete techniques, we resort to restructuring merely to remove obstacles to code motion, rather th...
Estimating Cache Misses and Locality Using Stack Distances
, 2003
"... Cache behavior modeling is an important part of modern optimizing compilers. In this paper we present a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms. Our algorithm computes the stack histograms symbolically, using data de ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
Cache behavior modeling is an important part of modern optimizing compilers. In this paper we present a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms. Our algorithm computes the stack histograms symbolically, using data dependence distance vectors and is totally accurate when dependence distances are uniformly generated. The stack histogram models accurately fully associative caches with LRU replacement policy, and provides a very good approximation for set-associative caches and programs with non-constant dependence distances.
Instruction Path Coprocessors
- In Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... This paper presents the concept of an Instruction Path Coprocessor (I-COP), which is a programmable on-chip coprocessor, with its own instruction set, that operates on the core processor's instructions to transform them into an internal format that can be more efficiently executed. It is located ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
This paper presents the concept of an Instruction Path Coprocessor (I-COP), which is a programmable on-chip coprocessor, with its own instruction set, that operates on the core processor's instructions to transform them into an internal format that can be more efficiently executed. It is located off the critical path of the core processor to ensure that it does not negatively impact the core processor's cycle time. An I-COP is highly versatile and can be used to implement different types of instruction transformations to enhance the IPC of the core processor. We study four potential applications of the I-COP to demonstrate the feasibility of this concept and investigate the design issues of such a coprocessor. A prototype instruction set for the I-COP is presented along with an implementation framework that facilitates achieving high I-COP performance. Initial results indicate that the I-COP is able to efficiently implement the trace cache fill unit, register move optimizatio...
A framework for unrestricted whole-program optimization
- In ACM SIGPLAN 2006 Conference on Programming Language Design and Implementation
, 2006
"... Procedures have long been the basic units of compilation in conventional optimization frameworks. However, procedures are typically formed to serve software engineering rather than optimization goals, arbitrarily constraining code transformations. Techniques, such as aggressive inlining and interpro ..."
Abstract
-
Cited by 24 (8 self)
- Add to MetaCart
Procedures have long been the basic units of compilation in conventional optimization frameworks. However, procedures are typically formed to serve software engineering rather than optimization goals, arbitrarily constraining code transformations. Techniques, such as aggressive inlining and interprocedural optimization, have been developed to alleviate this problem, but, due to code growth and compile time issues, these can be applied only sparingly. This paper introduces the Procedure Boundary Elimination (PBE) compilation framework, which allows unrestricted whole-program optimization. PBE allows all intra-procedural optimizations and analyses to operate on arbitrary subgraphs of the program, regardless of the original procedure boundaries and without resorting to inlining. In order to control compilation time, PBE also introduces novel extensions of region formation and encapsulation. PBE enables targeted code specialization, which recovers the specialization benefits of inlining while keeping code growth in check. This paper shows that PBE attains better performance than inlining with half the code growth.
Path-based Compilation
, 1998
"... Many compilers use profiles of programs to direct the focus and degree of performance optimizations. Profiles are statistics from program runs, usually collected at individual points in the program text, e.g., branches, call sites, or memory accesses. But optimizations based on individual sample poi ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
Many compilers use profiles of programs to direct the focus and degree of performance optimizations. Profiles are statistics from program runs, usually collected at individual points in the program text, e.g., branches, call sites, or memory accesses. But optimizations based on individual sample points in the program miss an important detail of program behavior: how pieces of the program relate to each other dynamically. A path profile collects statistics over paths (sequences of points) in the program, linking the statistics to the dynamic behavior. By instrumenting and collecting path profiles through a program, we can exploit this dynamic behavior, improving performance more than point profiling techniques have allowed. This thesis shows how to collect path profiles efficiently, then applies the path profiles to two optimizations, static correlated branch prediction and path-based superblock scheduling. These two optimizations address different performance aspects of modern machine...
A region-based compilation technique for a Java just-in-time compiler
- In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation
, 2003
"... Method inlining and data flow analysis are two major optimization components for effective program transformations, however they often suffer from the existence of rarely or never executed code contained in the target method. One major problem lies in the assumption that the compilation unit is part ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Method inlining and data flow analysis are two major optimization components for effective program transformations, however they often suffer from the existence of rarely or never executed code contained in the target method. One major problem lies in the assumption that the compilation unit is partitioned at method boundaries. This paper describes the design and implementation of a region-based compilation technique in our dynamic compilation system, in which the compiled regions are selected as code portions without rarely executed code. The key part of this technique is the region selection, partial inlining, and region exit handling. For region selection, we employ both static heuristics and dynamic profiles to identify rare sections of code. The region selection process and method inlining decision are interwoven, so that method inlining exposes other targets for region selection, while the region selection in the inline target conserves the inlining budget, leading to more method inlining. Thus the inlining process can be performed for parts of a method, not for the entire body of the method. When the program attempts to exit from a region boundary, we trigger recompilation and then rely on on-stack replacement to continue the execution from the corresponding entry point in the recompiled code. We have implemented these techniques in our Java JIT compiler, and conducted a comprehensive evaluation. The experimental results show that the approach of region-based compilation achieves approximately 5 % performance improvement on average, while reducing the compilation overhead by 20 to 30%, in comparison to the traditional functionbased compilation techniques.
A Study of Exception Handling and Its Dynamic Optimization in Java
- In Proceedings of ACM SIGPLAN Conference on Object-oriented Programing Systems, Languages and Applications (OOPSLA’01
, 2001
"... Optimizing exception handling is critical for programs that frequently throw exceptions. We observed that there are many such exception-intensive programs in various categories of Java programs. There are two commonly used exception handling techniques, stack unwinding and stack cutting. Stack unwi ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Optimizing exception handling is critical for programs that frequently throw exceptions. We observed that there are many such exception-intensive programs in various categories of Java programs. There are two commonly used exception handling techniques, stack unwinding and stack cutting. Stack unwinding optimizes the normal path, while stack cutting optimizes the exception handling path. However, there has been no single exception handling technique to optimize both paths. We propose a new technique called Exception-Directed Optimization (Edo), which optimizes exception-intensive programs without slowing down exception-minimal programs. Edo, a feedbackdirected dynamic optimization, consists of three steps, exception path profiling, exception path inlining, and throw elimination. Exception path profiling attempts to detect hot exception paths. Exception path inlining compiles the catching method in a hot exception path, inlining the rest of methods in the path. Throw elimination replaces a throw with the explicit control flow to the corresponding catch. We implemented Edo in IBM's production Justin -Time compiler, and obtained the experimental results, which show that, in SPECjvm98, it improved performance of exceptionintensive programs by up to 18.3% without a#ecting performance of exception-minimal programs at all. Categories and Subject Descriptors D.3 [Software]: Programming Languages; D.3.4 [Programming Languages]: Processors---incremental compilers, optimization, runtime environment General Terms Performance, Experimentation, Languages Keywords Feedback-directed dynamic optimization, dynamic compilers, exception handling, inlining Copyright c # 2001 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or a...

