Results 1 - 10
of
43
Fine-grained dynamic instrumentation of commodity operating system kernels
, 1999
"... We have developed a technology, fine-grained dynamic instrumentation of commodity kernels, which can splice (insert) dynamically generated code before almost any machine code instruction of a completely unmodified running commodity operating system kernel. This technology is well-suited to performan ..."
Abstract
-
Cited by 107 (5 self)
- Add to MetaCart
We have developed a technology, fine-grained dynamic instrumentation of commodity kernels, which can splice (insert) dynamically generated code before almost any machine code instruction of a completely unmodified running commodity operating system kernel. This technology is well-suited to performance profiling, debugging, code coverage, security auditing, runtime code optimizations, and kernel extensions. We have designed and implemented a tool called KernInst that performs dynamic instrumentation on a stock production Solaris kernel running on an UltraSPARC. On top of KernInst, we have implemented a kernel performance profiling tool, and used it to understand kernel and application performance under a Web proxy server workload. We used this information to make two changes (one to the kernel, one to the proxy) that cumulatively reduce the percentage of elapsed time that the proxy spends opening disk cache files from 40 % to 7%. 1
Evaluating Iterative Compilation
, 2002
"... This paper describes a platform independent optimisation approach based on feedback-directed program restructuring. We have developed two strategies that search the optimisation space by means of profiling to find the best possible program variant. These strategies have no a priori knowledge of the ..."
Abstract
-
Cited by 43 (10 self)
- Add to MetaCart
This paper describes a platform independent optimisation approach based on feedback-directed program restructuring. We have developed two strategies that search the optimisation space by means of profiling to find the best possible program variant. These strategies have no a priori knowledge of the target machine and can be run on any platform. In this paper our approach is evaluated on three full SPEC benchmarks, rather than the kernels evaluated in earlier studies where the optimisation space is relatively small. This approach was evaluated on six di#erent platforms, where it is shown that we obtain on average a 20.5% reduction in execution time compared to the native compiler with full optimisation. By using training data instead of reference data for the search procedure, we can reduce compilation time and still give on average a 16.5% reduction in time when running on reference data. We show that our approach is able to give similar significant reductions in execution time over a state of the art high level restructurer based on static analysis and a platform specific profile feedback directed compiler that employs the same transformations as our iterative system. 1.
A dynamic optimization framework for a Java just-in-time compiler
, 2001
"... The high performance implementation of Java Virtual Machines (JVM) and Just-In-Time (JIT) compilers is directed toward adaptive compilation optimizations on the basis of online runtime profile in-formation. This paper describes the design and implementation of a dynamic optimization framework in a p ..."
Abstract
-
Cited by 42 (7 self)
- Add to MetaCart
The high performance implementation of Java Virtual Machines (JVM) and Just-In-Time (JIT) compilers is directed toward adaptive compilation optimizations on the basis of online runtime profile in-formation. This paper describes the design and implementation of a dynamic optimization framework in a production-level Java JIT compiler. Our approach is to employ a mixed mode interpreter and a three level optimizing compiler, supporting quick, full, and spe-cial optimization, each of which has a different set of tradeoffs be-tween compilation overhead and execution speed. A lightweight sampling profiler operates continuously during the entire program's execution. When necessary, detailed information on runtime behav-ior is collected by dynamically generating instrumentation code which can be installed to and uninstalled from the specified recom-pilation target code. Value profiling with this instrumentation mechanism allows fully automatic code specialization to be per-formed on the basis of specific parameter values or global data at the highest optimization level. The experimental results show that our approach offers high performance and a low code expansion ra-tio in both program startup and steady state measurements in com-parison to the compile-only approach, and that the code specializa-tion can also contribute modest pertbrmance improvements. 1.
Quantifying the Impact of Input Data Sets on Program Behavior and its Applications
- Journal of Instruction-Level Parallelism
, 2003
"... Having a representative workload of the target domain of a microprocessor is extremely important throughout its design. The composition of a workload involves two issues: (i) which benchmarks to select and (ii) which input data sets to select per benchmark. Unfortunately, it is impossible to select ..."
Abstract
-
Cited by 38 (15 self)
- Add to MetaCart
Having a representative workload of the target domain of a microprocessor is extremely important throughout its design. The composition of a workload involves two issues: (i) which benchmarks to select and (ii) which input data sets to select per benchmark. Unfortunately, it is impossible to select a huge number of benchmarks and respective input sets due to the large instruction counts per benchmark and due to limitations on the available simulation time. In this paper, we use statistical data analysis techniques such as principal components analysis (PCA) and cluster analysis to efficiently explore the workload space. Within this workload space, different input data sets for a given benchmark can be displayed, a distance can be measured between program-input pairs that gives us an idea about their mutual behavioral differences and representative input data sets can be selected for the given benchmark. This methodology is validated by showing that program-input pairs that are close to each other in this workload space indeed exhibit similar behavior. The final goal is to select a limited set of representative benchmark-input pairs that span the complete workload space. Next to workload composition, we discuss two other possible applications, namely getting insight in the impact of input data sets on program behavior and evaluating the representativeness of sampled traces. 1.
LLVM: An Infrastructure for Multi-Stage Optimization
, 2002
"... Modern programming languages and software engineering principles are causing increasing problems for compiler systems. Traditional approaches, which use a simple compile-link-execute model, are unable to provide adequate application performance under the demands of the new conditions. Traditional ap ..."
Abstract
-
Cited by 31 (6 self)
- Add to MetaCart
Modern programming languages and software engineering principles are causing increasing problems for compiler systems. Traditional approaches, which use a simple compile-link-execute model, are unable to provide adequate application performance under the demands of the new conditions. Traditional approaches to interprocedural and profile-driven compilation can provide the application performance needed, but require infeasible amounts of compilation time to build the application. This thesis presents LLVM, a design and implementation of a compiler infrastructure which supports a unique multi-stage optimization system. This system is designed to support extensive interprocedural and profile-driven optimizations, while being efficient enough for use in commercial compiler systems. The LLVM virtual instruction set is the glue that holds the system together. It is a low-level representation, but with high-level type information. This provides the benefits of a low-level representation (compact representation, wide variety of available transformations, etc.) as well as providing high-level information to support aggressive interprocedural optimizations at link- and post-link time. In particular, this system is designed to support optimization in the field, both at run-time and during otherwise unused idle time on the machine. This thesis also describes an implementation of this compiler design, the LLVM compiler infrastructure, proving that the design is feasible. The LLVM compiler infrastructure is a maturing and efficient system, which we show is a good host for a variety of research. More information about LLVM can be found on its web site at: http://llvm.cs.uiuc.edu/
A Survey of Adaptive Optimization in Virtual Machines
- PROCEEDINGS OF THE IEEE, 93(2), 2005. SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION
, 2004
"... Virtual machines face significant performance challenges beyond those confronted by traditional static optimizers. First, portable program representations and dynamic language features, such as dynamic class loading, force the deferral of most optimizations until runtime, inducing runtime optimiza ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
Virtual machines face significant performance challenges beyond those confronted by traditional static optimizers. First, portable program representations and dynamic language features, such as dynamic class loading, force the deferral of most optimizations until runtime, inducing runtime optimization overhead. Second, modular
Code Cache Management Schemes for Dynamic Optimizers
, 2002
"... A dynamic optimizer is a software-based system that performs code modifications at runtime, and several such systems have been proposed over the past several years. These systems typically perform optimization on the level of an instruction trace, and most use caching mechanisms to store recently op ..."
Abstract
-
Cited by 25 (4 self)
- Add to MetaCart
A dynamic optimizer is a software-based system that performs code modifications at runtime, and several such systems have been proposed over the past several years. These systems typically perform optimization on the level of an instruction trace, and most use caching mechanisms to store recently optimized portions of code. Since the dynamic optimizers produce variable-length code traces that are modified copies of portions of the original executable, a code cache management scheme must deal with the difficult problem of caching objects that vary in size and cannot be subdivided without adding extra jump instructions. Because of these constraints, many dynamic optimizers have chosen unsophisticated schemes, such as flushing the entire cache when it becomes full. Flushing minimizes the overhead of cache management but tends to discard many useful traces. This paper evaluates several alternative cache management schemes that identify and remove only enough traces to make room for a new trace. We find that by treating the code cache as a circular buffer, we can reduce the code cache miss rate by half of that achieved by flushing. Furthermore, this approach adds very little bookkeeping overhead and avoids the problems associated with code cache fragmentation. These characteristics are extremely important in a dynamic system since more complex strategies will do more harm than good if the overhead is too high.
NWSLite: A Light-Weight Prediction Utility for Mobile Devices
, 2004
"... Computation off-loading, i.e., remote execution, has been shown to be effective for extending the computational power and battery life of resource-restricted devices, e.g., hand-held, wearable, and pervasive computers. Remote execution systems must predict the cost of executing both locally and remo ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Computation off-loading, i.e., remote execution, has been shown to be effective for extending the computational power and battery life of resource-restricted devices, e.g., hand-held, wearable, and pervasive computers. Remote execution systems must predict the cost of executing both locally and remotely to determine when offloading will be most beneficial. These costs however, are dependent upon the execution behavior of the task being considered and the highly-variable performance of the underlying resources, e.g., CPU (local and remote), bandwidth, and network latency. As such, remote execution systems must employ sophisticated, prediction techniques that accurately guide computation off-loading. Moreover, these techniques must be efficient, i.e., they cannot consume significant resources, e.g., energy, execution time, etc., since they are performed on the mobile device.
Dynamic Binary Translation for Accumulator-Oriented Architectures
, 2003
"... A dynamic binary translation system for a co-designed virtual machine is described and evaluated. The underlying hardware directly executes an accumulator-oriented instruction set that exposes instruction dependence chains (strands) to a distributed microarchitecture containing a simple instruction ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
A dynamic binary translation system for a co-designed virtual machine is described and evaluated. The underlying hardware directly executes an accumulator-oriented instruction set that exposes instruction dependence chains (strands) to a distributed microarchitecture containing a simple instruction pipeline and issue logic. To support conventional program binaries, a source instruction set (Compaq Alpha in our study) is dynamically translated to the target accumulator instruction set. The binary translator identifies chains of inter-instruction dependences and assigns them to dependence-carrying accumulators. Because the underlying superscalar microarchitecture is capable of dynamic instruction scheduling, the binary translation system does not perform aggressive optimizations or re-schedule code; this significantly reduces binary translation overhead.
Workload Design: Selecting Representative Program-Input Pairs," presented at PACT '02
- Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
, 2002
"... Having a representative workload of the target domain of a microprocessor is extremely important throughout its design. The composition of a workload involves two issues: (i) which benchmarks to select and (ii) which input data sets to select per benchmark. Unfortunately, it is impossible to select ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Having a representative workload of the target domain of a microprocessor is extremely important throughout its design. The composition of a workload involves two issues: (i) which benchmarks to select and (ii) which input data sets to select per benchmark. Unfortunately, it is impossible to select a huge number of benchmarks and respective input sets due to the large instruction counts per benchmark and due to limitations on the available simulation time. In this paper, we use statistical data analysis techniques such as principal components analysis (PCA) and cluster analysis to efficiently explore the workload space. Within this workload space, different input data sets for a given benchmark can be displayed, a distance can be measured between program-input pairs that gives us an idea about their mutual behavioral differences and representative input data sets can be selected for the given benchmark. This methodology is validated by showing that program-input pairs that are close to each other in this workload space indeed exhibit similar behavior. The final goal is to select a limited set of representative benchmark-input pairs that span the complete workload space. Next to workload composition, there are a number of other possible applications, namely getting insight in the impact of input data sets on program behavior and profile-guided compiler optimizations. 1

