Results 1 - 10
of
25
ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors
, 1997
"... Profile data is valuable for identifying performance bottlenecks and guiding optimizations. Periodic sampling of a processor's performance monitoring hardware is an effective, unobtrusive way to obtain detailed profiles. Unfortunately, existing hardware simply counts events, such as cache misses and ..."
Abstract
-
Cited by 118 (2 self)
- Add to MetaCart
Profile data is valuable for identifying performance bottlenecks and guiding optimizations. Periodic sampling of a processor's performance monitoring hardware is an effective, unobtrusive way to obtain detailed profiles. Unfortunately, existing hardware simply counts events, such as cache misses and branch mispredictions, and cannot accurately attribute these events to instructions, especially on out-of-order machines. We propose an alternative approach, called ProfileMe, that samples instructions. As a sampled instruction moves through the processor pipeline, a detailed record of all interesting events and pipeline stage latencies is collected. ProfileMe also support paired sampling, which captures information about the interactions between concurrent instructions, revealing information about useful concurrency and the utilization of various pipeline stages while an instruction is in flight. We describe an inexpensive hardware implementation of ProfileMe, outline a variety of software...
A Hardware-Driven Profiling Scheme for Identifying Program Hot Spots to Support Runtime Optimization
- In Proceedings of the 26th Annual International Symposium on Computer Architecture
, 1999
"... This paper presents a novel hardware-based approach for identifying, profiling, and monitoring hot spots in order to support runtime optimization of generalpurpose programs. The proposed approach consists of a set of tightly coupled hardware tables and control logic modules that are placed in the re ..."
Abstract
-
Cited by 68 (4 self)
- Add to MetaCart
This paper presents a novel hardware-based approach for identifying, profiling, and monitoring hot spots in order to support runtime optimization of generalpurpose programs. The proposed approach consists of a set of tightly coupled hardware tables and control logic modules that are placed in the retirement stage of a processor pipeline removed from the critical path. The features of the proposed design include rapid detection of program hot spots after changes in execution behavior, runtime-tunable selection criteria for hot spot detection, and negligible overhead during application execution. Experiments using several SPEC95 benchmarks, as well as several large WindowsNT applications, demonstrate the promise of the proposed design. 1 Introduction Optimizing compilers can gain significant performance benefits by performing code transformations based on a program's runtime profile. Traditionally, profiles are collected by running an instrumented version of the executable. However, bec...
System Support for Automatic Profiling and Optimization
"... The Morph system provides a framework for automatic collection and management of profile information and application of profile-driven optimizations. In this paper, we focus on the operating system support that is required to collect and manage profile information on an end-user’s workstation in an ..."
Abstract
-
Cited by 59 (6 self)
- Add to MetaCart
The Morph system provides a framework for automatic collection and management of profile information and application of profile-driven optimizations. In this paper, we focus on the operating system support that is required to collect and manage profile information on an end-user’s workstation in an automatic, continuous, and transparent manner. Our implementation for a Digital Alpha machine running Digital UNIX 4.0 achieves run-time overheads of less than 0.3 % during profile collection. Through the application of three code layout optimizations, we further show that Morph can use statistical profiles to improve application performance. With appropriate system support, automatic profiling and optimization is both possible and effective.
A Programmable Co-processor for Profiling
- IN PROCEEDINGS OF THE 7TH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA-7
, 2001
"... Aggressive program optimization requires accurate profile information, but such accuracy requires many samples to be collected. We explore a novel profiling architecture that reduces the overhead of collecting each sample by including a programmable co-processor that analyzes a stream of profile sam ..."
Abstract
-
Cited by 42 (1 self)
- Add to MetaCart
Aggressive program optimization requires accurate profile information, but such accuracy requires many samples to be collected. We explore a novel profiling architecture that reduces the overhead of collecting each sample by including a programmable co-processor that analyzes a stream of profile samples generated by a microprocessor. From this stream of samples, the co-processor can detect correlations between instructions (e.g., memory dependence profiling) as well as those between different dynamic instances of the same instruction (e.g., value profiling). The profiler's programmable nature allows a broad range of data to be extracted, post-processed, and formatted, as well as provides the flexibility to tailor the profiling application to the program under test. Because the co-processor is specialized for profiling, it can execute profiling applications more efficiently than a general-purpose processor. The co-processor should not significantly impact the cost or performance of the ...
Continuous Program Optimization: A Case Study
- ACM Transactions on Programming Languages and Systems
, 2003
"... This paper presents a system that provides code generation at load-time and continuous program optimization at run-time. First, the architecture of the system is presented. Then, two optimization techniques are discussed that were developed specifically in the context of continuous optimization. The ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
This paper presents a system that provides code generation at load-time and continuous program optimization at run-time. First, the architecture of the system is presented. Then, two optimization techniques are discussed that were developed specifically in the context of continuous optimization. The first of these optimizations continually adjusts the storage layouts of dynamic data structures to maximize data cache locality, while the second performs profile-driven instruction re-scheduling to increase instruction-level parallelism. These two optimizations have very di#erent cost/benefit ratios, presented in a series of benchmarks. The paper concludes with an outlook to future research directions and an enumeration of some remaining research problems. The empirical results presented in this paper make a case in favor of continuous optimization, but indicate that it needs to be applied judiciously. In many situations, the costs of dynamic optimizations outweigh their benefit, so that no break-even point is ever reached. In favorable circumstances, on the other hand, speed-ups of over 120% have been observed. It appears as if the main beneficiaries of continuous optimization are shared libraries, which at di#erent times can be optimized in the context of the currently dominant client application.
An Architectural Framework for Run-Time Optimization
- IEEE Transactions on Computers
, 2001
"... Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level parallelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Run-time optimization promises to provide an even ..."
Abstract
-
Cited by 30 (2 self)
- Add to MetaCart
Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level parallelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Run-time optimization promises to provide an even higher level of performance by adaptively applying aggressive code transformations on a larger scope. This paper presents a new hardware mechanism for generating and deploying run-time optimized code. The mechanism can be viewed as a filtering system, that resides in the retirement stage of the processor pipeline, accepts an instruction execution stream as input, and produces instruction profiles and sets of linked, optimized traces as output. The code deployment mechanism uses an extension to the branch prediction mechanism to migrate execution into the new code without modifying the original code. These new components do not add delay to the execution of the program except during short bursts of reoptimization. This technique provides a strong platform for run-time optimization because the hot execution regions are extracted, optimized, and written to main memory for execution and because these regions persist across context switches. The current design of the framework supports a suite of optimizations including partial function inlining (even into shared libraries), code straightening optimizations, loop unrolling, and peephole optimizations. 1
Relational Profiling: Enabling Thread-Level Parallelism in Virtual Machines
, 2000
"... Virtual machine service threads can perform many tasks in parallel with program execution such as garbage collection, dynamic compilation, and profile collection and analysis. Hardware-assisted profiling is essential for providing service threads with needed information in a flexible and efficient w ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
Virtual machine service threads can perform many tasks in parallel with program execution such as garbage collection, dynamic compilation, and profile collection and analysis. Hardware-assisted profiling is essential for providing service threads with needed information in a flexible and efficient way. A relational profiling architecture (RPA) is proposed for meeting this goal.
Rapid Profiling via Stratified Sampling
, 2001
"... Sophisticated binary translators and dynamic optimizers demand a program profiler with low overhead, high accuracy, and the ability to collect a variety of profile types. A profiling scheme that achieves these goals is proposed. Conceptually, the hardware compresses a stream of profile data by count ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
Sophisticated binary translators and dynamic optimizers demand a program profiler with low overhead, high accuracy, and the ability to collect a variety of profile types. A profiling scheme that achieves these goals is proposed. Conceptually, the hardware compresses a stream of profile data by counting identical events; the compressed profile data is passed to software for analysis. Compressing the high-bandwidth event stream greatly reduces software overhead. Because optimizations can tolerate some profiling errors, we allow the stream compressor to be lossy, thereby enabling a low-cost sampling-based hardware design. Because the hardware compressor is insensitive to the event content, it supports various profile types and can process multiple types simultaneously. Basic components of our framework are periodic and random samplers, counters, and hash functions. These components are composed to form a variety of stream compressors. One design is both simple and very effective: the input stream is hash-split into multiple substreams, each of which is fed into a simple periodic sampler that selects every kth event. This stratified periodic sampler performs better than conventional random sampling because it biases each substream towards a small number of unique events, thereby reducing sampling error, and allowing faster convergence to an accurate profile. For example, convergence to a given level of accuracy is about twice as fast for gcc. When sampling overhead is considered, the stratified periodic profiler achieves less than 3% error while incurring an overhead of only 3.5% for gcc.
Achieving High Performance via Co-Designed Virtual Machines
- In International Workshop on Innovative Architecture
, 1999
"... Introduction Today's virtual machines use a layer of software that allows programs compiled in one instruction set to be executed on a processor executing a (different) native instruction set. Virtual machines have become popular in recent years for providing platform independence; however, virtual ..."
Abstract
-
Cited by 25 (6 self)
- Add to MetaCart
Introduction Today's virtual machines use a layer of software that allows programs compiled in one instruction set to be executed on a processor executing a (different) native instruction set. Virtual machines have become popular in recent years for providing platform independence; however, virtual machines also open many new opportunities for enhancing performance. The co-design of virtual machine software and the underlying hardware microarchitecture will enable enhanced instruction level parallelism and more adaptable performance mechanisms than are possible when hardware and application software are separated by instruction set architectures as is traditionally done. In future high performance computers, a virtual instruction set architecture (V-ISA) will be the level for maintaining architectural compatibility. The V-ISA will be implemented with a virtual machine that blends software and hardware in a symbiotic manner via co-design. The hardware will support an implementationdep
Targeted Path Profiling: Lower Overhead Path Profiling for Staged Dynamic Optimization Systems
- In International Symposium on Code Generation and Optimization (CGO
, 2004
"... In this paper, we present a technique for reducing the overhead of collecting path profiles in the context of a dynamic optimizer. The key idea to our approach, called Targeted Path Profiling (TPP), is to use an edge profile to simplify the collection of a path profile. This notion of profileguided ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
In this paper, we present a technique for reducing the overhead of collecting path profiles in the context of a dynamic optimizer. The key idea to our approach, called Targeted Path Profiling (TPP), is to use an edge profile to simplify the collection of a path profile. This notion of profileguided profiling is a natural fit for dynamic optimizers, which typically optimize the code in a series of stages. TPP is an extension to the Ball-Larus Efficient Path Profiling algorithm. Its increased efficiency comes from two sources: (i) reducing the number of potential paths by not enumerating paths with cold edges, allowing array accesses to be substituted for more expensive hash table lookups, and (ii) not instrumenting regions where paths can be unambiguously derived from an edge profile. Our results suggest that on average the overhead of profile collection can be reduced by half (SPEC95) to almost two-thirds (SPEC2000) relative to the Ball-Larus algorithm with minimal impact on the information collected. 1.

