Results 1 - 10
of
28
Bursty Tracing: A Framework for Low-Overhead Temporal Profiling
- In 4th ACM Workshop on Feedback-Directed and Dynamic Optimization
, 2001
"... With processor speed increasing much more rapidly than memory access speed, memory system optimizations have the potential to significantly improve program performance. Unfortunately, cache-level optimizations often require detailed temporal information about a program's references to be effective. ..."
Abstract
-
Cited by 46 (9 self)
- Add to MetaCart
With processor speed increasing much more rapidly than memory access speed, memory system optimizations have the potential to significantly improve program performance. Unfortunately, cache-level optimizations often require detailed temporal information about a program's references to be effective. Traditional techniques for obtaining this information are too expensive to be practical in an on-line setting. We address this problem by describing and evaluating a framework for low-overhead temporal profiling. Our framework extends the Arnold-Ryder framework that uses instrumentation and counter-based sampling to collect frequency profiles with low overhead. Our framework samples bursts (sub-sequences) of the trace of all runtime events to construct a temporal program profile. Our bursty tracing profiler is built using Vulcan, an executable-editing tool for x86, and we evaluate it on optimized x86 binaries. Like the Arnold-Ryder framework, we have the advantages of not requiring operating system or hardware support and being deterministic. Unlike them, we are not limited to capturing temporal relationships on intraprocedural acyclic paths since our trace bursts can span procedure boundaries. In addition, our framework does not require access to program source or recompilation. A direct implementation of our extensions to the Arnold-Ryder framework results in profiling overhead of 6-35%. We describe techniques that reduce this overhead to 3-18%, making it suitable for use in an on-line setting.
Rapid Profiling via Stratified Sampling
, 2001
"... Sophisticated binary translators and dynamic optimizers demand a program profiler with low overhead, high accuracy, and the ability to collect a variety of profile types. A profiling scheme that achieves these goals is proposed. Conceptually, the hardware compresses a stream of profile data by count ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
Sophisticated binary translators and dynamic optimizers demand a program profiler with low overhead, high accuracy, and the ability to collect a variety of profile types. A profiling scheme that achieves these goals is proposed. Conceptually, the hardware compresses a stream of profile data by counting identical events; the compressed profile data is passed to software for analysis. Compressing the high-bandwidth event stream greatly reduces software overhead. Because optimizations can tolerate some profiling errors, we allow the stream compressor to be lossy, thereby enabling a low-cost sampling-based hardware design. Because the hardware compressor is insensitive to the event content, it supports various profile types and can process multiple types simultaneously. Basic components of our framework are periodic and random samplers, counters, and hash functions. These components are composed to form a variety of stream compressors. One design is both simple and very effective: the input stream is hash-split into multiple substreams, each of which is fed into a simple periodic sampler that selects every kth event. This stratified periodic sampler performs better than conventional random sampling because it biases each substream towards a small number of unique events, thereby reducing sampling error, and allowing faster convergence to an accurate profile. For example, convergence to a given level of accuracy is about twice as fast for gcc. When sampling overhead is considered, the stratified periodic profiler achieves less than 3% error while incurring an overhead of only 3.5% for gcc.
Phase-Aware Remote Profiling
- IN CONFERENCE ON CODE GENERATION AND OPTIMIZATION (CGO
, 2005
"... Recent advances in networking and embedded device technology have made the vision of ubiquitous computing a reality; users can access the Internet's vast offerings anytime and anywhere. Moreover, battery-powered devices such as personal digital assistants and web-enabled mobile phones have successfu ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
Recent advances in networking and embedded device technology have made the vision of ubiquitous computing a reality; users can access the Internet's vast offerings anytime and anywhere. Moreover, battery-powered devices such as personal digital assistants and web-enabled mobile phones have successfully emerged as new access points to the world's digital infrastructure. This ubiquity offers a new opportunity for software developers: users can now participate in the software development, optimization, and evolution process while they use their software. Such participation
LLVA: A Low-level Virtual Instruction Set Architecture
- IN MICRO-36
, 2003
"... A virtual instruction set architecture (V-ISA) implemented via a processor-specific software translation layer can provide great flexibility to processor designers. Recent examples such as Crusoe and DAISY, however, have used existing hardware instruction sets as virtual ISAs, which complicates tran ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
A virtual instruction set architecture (V-ISA) implemented via a processor-specific software translation layer can provide great flexibility to processor designers. Recent examples such as Crusoe and DAISY, however, have used existing hardware instruction sets as virtual ISAs, which complicates translation and optimization. In fact, there has been little research on specific designs for a virtual ISA for processors. This paper proposes a novel virtual ISA (LLVA) and a translation strategy for implementing it on arbitrary hardware. The instruction set is typed, uses an infinite virtual register set in Static Single Assignment form, and provides explicit control-flow and dataflow information, and yet uses low-level operations closely matched to traditional hardware. It includes novel mechanisms to allow more flexible optimization of native code, including a flexible exception model and minor constraints on self-modifying code. We propose a translation strategy that enables offline translation and transparent offline caching of native code and profile information, while remaining completely OS-independent. It also supports optimizations directly on the representation at install-time, runtime, and offline between executions. We show experimentally that the virtual ISA is compact, it is closely matched to ordinary hardware instruction sets, and permits very fast code generation, yet has enough high-level information to permit sophisticated program analyses and optimizations.
Warp processors
- ACM Transactions on Design Automation of Electronic Systems (TODAES
, 2006
"... We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so usi ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so using a special compiler, a warp processor achieves those improvements completely transparently and operates from a standard binary. A warp processor dynamically detects the binary’s critical regions, re-implements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region. While not all benchmarks can be improved using warp processing, many can, and the improvements are dramatically better than achievable by more traditional architecture improvements. The hardest part of warp processing is that of dynamically re-implementing code regions on an FPGA, requiring partitioning, decompilation, synthesis, placement, and routing tools, all having to execute with minimal computation time and data memory so as to coexist on-chip with the main processor. We describe our results of developing a warp processor. We developed a custom FPGA fabric specifically designed to enable lean place and route tools, and we developed extremely fast and efficient versions of partitioning, decompilation, synthesis, technology mapping, placement, and routing. Warp processors achieve overall application speedups of 6.3X with energy savings of 66 % across a set of embedded benchmark applications. We
DISE: Dynamic instruction stream editing
, 2002
"... Many people deserve thanks for helping me navigate through my PhD. First and foremost, I must thank my wife, Stephanie, for her loving support without which I certainly would not have succeeded. She is a wonderful companion, and I feel like the luckiest man on the planet to be married to her. I than ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Many people deserve thanks for helping me navigate through my PhD. First and foremost, I must thank my wife, Stephanie, for her loving support without which I certainly would not have succeeded. She is a wonderful companion, and I feel like the luckiest man on the planet to be married to her. I thank her for her patience through my many long work days, and for helping me stay sane through my many deadlines. My parents, Art and Nancy, were also extremely supportive throughout my six years in graduate school. I greatly appreciated their loving phone calls, emails, and visits. They have always been there for me. I also must thank, my brother, Ryan, my grandmother, Barbara, as well as Stephanie’s family. Their encouragement and loving support certainly helped me through my PhD. My advisor, E Christopher Lewis, is chiefly responsible for my academic and professional development. I have benefitted profusely from his guidance and support. I learned from E what it means to deeply understand a research problem, and to always consider the broader impact of my research. E is also an incredible teacher, breaking the most complicated concepts down into simple manageable pieces. I will try to emulate these skills
Runtime specialization with optimistic heap analysis
- In Proceedings of the ACM Conference on Object-Oriented Programming Systems, Languages and Applications
, 2005
"... We describe a highly practical program specializer for Java programs. The specializer is powerful, because it specializes optimistically, using (potentially transient) constants in the heap; it is precise, because it specializes using data structures that are only partially invariant; it is deployab ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
We describe a highly practical program specializer for Java programs. The specializer is powerful, because it specializes optimistically, using (potentially transient) constants in the heap; it is precise, because it specializes using data structures that are only partially invariant; it is deployable, because it is hidden in a JIT compiler and does not require any user annotations or offline preprocessing; it is simple, because it uses existing JIT compiler ingredients; and it is fast, because it specializes programs in under 1s. These properties are the result of (1) a new algorithm for selecting specializable code fragments, based on a notion of influence; (2) a precise store profile for identifying constant heap locations; and (3) an efficient invalidation mechanism for monitoring optimistic assumptions about heap constants. Our implementation of the specializer in the Jikes RVM has low overhead, selects specialization points that would be chosen manually, and produces speedups ranging from a factor of 1.2 to 6.4, comparable with annotationguided specializers.
Ubiquitous memory introspection
- In CGO ’07: Proceedings of the International Symposium on Code Generation and Optimization
, 2007
"... Modern memory systems play a critical role in the performance of applications, but a detailed understanding of the application behavior in the memory system is not trivial to attain. It requires time consuming simulations and detailed modeling of the memory hierarchy, often using long address traces ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Modern memory systems play a critical role in the performance of applications, but a detailed understanding of the application behavior in the memory system is not trivial to attain. It requires time consuming simulations and detailed modeling of the memory hierarchy, often using long address traces. It is increasingly possible to access hardware performance counters to count relevant events in the memory system, but the measurements are coarse-grained and better suited for performance summaries than providing instruction level feedback. The availability of a low cost, online, and accurate methodology for deriving finegrained memory behavior profiles can prove extremely useful for runtime analysis and optimization of programs. This paper presents a new methodology for Ubiquitous Memory Introspection (UMI). It is an online and lightweight methodology that uses fast mini-simulations to analyze short memory access traces recorded from frequently executed code regions. The simulations provide profiling results at varying granularities, down to that of a single instruction or address. UMI naturally complements runtime optimizations and enables new opportunities for online memory specific optimizations. We present a prototype runtime system implementing UMI. The prototype has an average runtime overhead of 14%. This overhead is only 1 % more than a state of the art binary instrumentation tool. We used 32 benchmarks, including the full suite of SPEC CPU2000 benchmarks, for evaluation. We show that the mini-simulations accurately reflect the cache performance of two existing memory systems, an Intel Pentium 4 and an AMD Athlon MP (K7). We also demonstrate that UMI predicts delinquent load instructions with an 88 % rate of accuracy for applications with a relatively high number of cache misses, and 61 % overall. The online profiling results are used at runtime to implement a simple software prefetching strategy that achieves an overall speedup of 64 % in the best case.
HPS: Hybrid Profiling Support
- In Conference on Parallel Architectures and Compilation Techniques (PACT
, 2005
"... Key to understanding and optimizing complex applications, is our ability to dynamically monitor executing programs with low overhead and high accuracy. Toward this end, we present HPS, a Hybrid Profiling Support system. HPS employs a hardware/software approach to program sampling that transparently, ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Key to understanding and optimizing complex applications, is our ability to dynamically monitor executing programs with low overhead and high accuracy. Toward this end, we present HPS, a Hybrid Profiling Support system. HPS employs a hardware/software approach to program sampling that transparently, efficiently, and dynamically samples an executing instruction stream. Our system is an extension and application of Dynamic Instruction Stream Editing (DISE), a hardware technique that macro-expands instructions in the pipeline decode stage at runtime. HPS toggles profiling to sample the executing program as required by the profile consumer, e.g. a dynamic optimizer. Our system requires few hardware resources and introduces no “basic ” overhead – the execution of instructions that triggers profiling. We use HPS to investigate the tradeoffs between overhead and accuracy of different profile types as well as different profiling schemes. In particular, we empirically evaluate hot data stream, hot call pair, and hot method identification using a number of parameterizations of bursty tracing, a popular sampling scheme used in dynamic optimization systems. 1
A Programmable Hardware Path Profiler
- In International Symposium on Code Generation and Optimization
, 2005
"... For aggressive path-based program optimizations to be profitable in cost-sensitive environments, accurate path profiles must be available at low overheads. In this paper, we propose a low-overhead, non-intrusive hardware path profiling scheme that can be programmed to detect several types of paths i ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
For aggressive path-based program optimizations to be profitable in cost-sensitive environments, accurate path profiles must be available at low overheads. In this paper, we propose a low-overhead, non-intrusive hardware path profiling scheme that can be programmed to detect several types of paths including acyclic, intra-procedural paths, extended paths and sub-paths for the Whole Program Path. The profiler consists of a path stack, which detects paths and generates a sequence of path descriptors using branch information from the processor pipeline, and a hot path table that collects a profile of hot paths for later use by a program optimizer. With assistance from the processor’s event detection logic, our profiler can track a host of architectural metrics along paths, enabling context-sensitive performance monitoring and bottleneck analysis. We illustrate the utility of our scheme by associating paths with a power metric that estimates power consumption in the cache hierarchy caused by instructions along the path. Experiments using programs from the SPECCPU 2000 benchmark suite show that our path profiler, occupying 7KB of hardware real-estate, collects accurate path profiles (average overlap of 88 % with a perfect profile) at negligible execution time overheads (0.6 % on average). 1.

