Results 1 - 10
of
18
Continuous Path and Edge Profiling
- IN IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2005
"... Microarchitectures increasingly rely on dynamic optimization to improve performance in ways that are difficult or impossible for ahead-of-time compilers. Dynamic optimizers in turn require continuous, portable, low cost, and accurate control-flow profiles to inform their decisions, but prior approac ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Microarchitectures increasingly rely on dynamic optimization to improve performance in ways that are difficult or impossible for ahead-of-time compilers. Dynamic optimizers in turn require continuous, portable, low cost, and accurate control-flow profiles to inform their decisions, but prior approaches have struggled to meet these goals simultaneously. This paper presents PEP, a
A Concurrent Dynamic Analysis Framework for Multicore Hardware
, 2009
"... Software has spent the bounty of Moore’s law by solving harder problems and exploiting abstractions, such as high-level languages, virtual machine technology, binary rewriting, and dynamic analysis. Abstractions make programmers more productive and programs more portable, but usually slow them down. ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Software has spent the bounty of Moore’s law by solving harder problems and exploiting abstractions, such as high-level languages, virtual machine technology, binary rewriting, and dynamic analysis. Abstractions make programmers more productive and programs more portable, but usually slow them down. Since Moore’s law is now delivering multiple cores instead of faster processors, future systems must either bear a relatively higher cost for abstractions or use some cores to help tolerate abstraction costs. This paper presents the design, implementation, and evaluation of a novel concurrent, configurable dynamic analysis framework that efficiently utilizes multicore cache architectures. It introduces Cache-friendly Asymmetric Buffering (CAB), a lock-free ring-buffer that implements efficient communication between application and analysis threads. We guide the design and implementation of our framework with a model of dynamic analysis overheads. The framework implements exhaustive and sampling event processing and is analysis-neutral. We evaluate the framework with five popular and diverse analyses, and show performance improvements even for lightweight, low-overhead analyses. Efficient inter-core communication is central to high performance parallel systems and we believe the CAB design gives insight to the subtleties and difficulties of attaining it for dynamic analysis and other parallel software.
Cache-Aware Cross-Profiling for Java Processors
- CASES'08
, 2008
"... Performance evaluation of embedded software is essential in an early development phase so as to ensure that the software will run on the embedded device’s limited computing resources. Prevailing approaches either require the deployment of the software on the embedded target, which can be tedious and ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
Performance evaluation of embedded software is essential in an early development phase so as to ensure that the software will run on the embedded device’s limited computing resources. Prevailing approaches either require the deployment of the software on the embedded target, which can be tedious and may be impossible in an early development phase, or rely on simulation, which can be very slow. In this paper, we introduce a customizable cross-profiling framework for embedded Java processors, including processors featuring a method cache. The developer profiles the embedded software in the host environment, completely decoupled from the target system, on any standard Java Virtual Machine, but the generated profiles represent the execution time metric of the target system. Our cross-profiling framework is based on bytecode instrumentation. We identify several pointcuts in the execution of bytecode that need to be instrumented in order to estimate the CPU cycle consumption on the target system. An evaluation using the JOP embedded Java processor as target confirms that our approach reconciles high profile accuracy with moderate overhead. Our cross-profiling framework also enables the rapid evaluation of the performance impact of possible optimizations, such as different caching strategies.
Efficient remote profiling for resource-constrained devices
- ACM Trans. Archit. Code Optim. (TACO
"... The widespread use of ubiquitous, mobile, and continuously-connected computing agents has inspired software developers to change the way they test, debug, and optimize software. Users now play an active role in the software evolution cycle by dynamically providing valuable feedback about the executi ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The widespread use of ubiquitous, mobile, and continuously-connected computing agents has inspired software developers to change the way they test, debug, and optimize software. Users now play an active role in the software evolution cycle by dynamically providing valuable feedback about the execution of a program to developers. Software developers can use this information to isolate bugs in, maintain, and improve the performance of a wide-range of diverse and complex embedded device applications. The collection of such feedback poses a major challenge to systems researchers since it must be performed without degrading a user’s experience with, or consuming the severely restricted resources of the mobile device. At the same time, the resource constraints of embedded devices prohibit the use of extant software profiling solutions. To achieve efficient remote profiling of embedded devices, we couple two efficient hardware/software program monitoring techniques: Hybrid Profiling Support(HPS) and Phase-Aware Sampling. HPS efficiently inserts profiling instructions into an executing program using a novel extension to Dynamic Instruction Stream Editing(DISE). Phase-aware sampling exploits the recurring behavior of programs to identify key opportunities during execution in order to collect profile information (i.e. sample). Our prior work on phase-aware sampling required code duplication to toggle
Accurate, Efficient, and Adaptive Calling Context Profiling
- PLDI '06
, 2006
"... Calling context profiles are used in many inter-procedural code optimizations and in overall program understanding. Unfortunately, the collection of profile information is highly intrusive due to the high frequency of method calls in most applications. Previously proposed calling-context profiling m ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Calling context profiles are used in many inter-procedural code optimizations and in overall program understanding. Unfortunately, the collection of profile information is highly intrusive due to the high frequency of method calls in most applications. Previously proposed calling-context profiling mechanisms consequently suffer from either low accuracy, high overhead, or both. We have developed a new approach for building the calling context tree at runtime, called adaptive bursting. By selectively inhibiting redundant profiling, this approach dramatically reduces overhead while preserving profile accuracy. We first demonstrate the drawbacks of previously proposed calling context profiling mechanisms. We show that a low-overhead solution using sampled stack-walking alone is less than 50 % accurate, based on degree of overlap with a complete calling-context tree. We also show that a static bursting approach collects a highly accurate profile, but causes an unacceptable application slowdown. Our adaptive solution achieves 85 % degree of overlap and provides an 88% hot-edge coverage when using a 0.1 hot-edge threshold, while dramatically reducing overhead compared to the static bursting approach.
Automatic Detection of Performance Design and Deployment Antipatterns in Component Based Enterprise Systems
"... The thesis is submitted to ..."
Correcting the Dynamic Call Graph Using Control Flow Constraints
- In International Conference on Compiler Construction
, 2007
"... Abstract. To reason about programs, dynamic optimizers and analysis tools use sampling to collect a dynamic call graph (DCG). However, sampling has not achieved high accuracy with low runtime overhead. As object-oriented programmers compose increasingly complex programs, inaccurate call graphs will ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. To reason about programs, dynamic optimizers and analysis tools use sampling to collect a dynamic call graph (DCG). However, sampling has not achieved high accuracy with low runtime overhead. As object-oriented programmers compose increasingly complex programs, inaccurate call graphs will inhibit analysis and optimizations. This paper demonstrates how to use static and dynamic control flow graph (CFG) constraints to improve the accuracy of the DCG. We introduce the frequency dominator (FDOM), a novel CFG relation that extends the dominator relation to expose static relative execution frequencies of basic blocks. We combine conservation of flow and dynamic CFG basic block profiles to further improve the accuracy of the DCG. Together these approaches add minimal overhead (1%) and achieve 85 % accuracy compared to a perfect call graph for SPEC JVM98 and DaCapo benchmarks. Compared to sampling alone, accuracy improves by 12 to 36%. These results demonstrate that static and dynamic control-flow information offer accurate information for efficiently improving the DCG. 1
Evaluating the Accuracy of Java Profilers
"... Performance analysts profile their programs to find methods that are worth optimizing: the “hot ” methods. This paper shows that four commonly-used Java profilers (xprof, hprof, jprofile, and yourkit) often disagree on the identity of the hot methods. If two profilers disagree, at least one must be ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Performance analysts profile their programs to find methods that are worth optimizing: the “hot ” methods. This paper shows that four commonly-used Java profilers (xprof, hprof, jprofile, and yourkit) often disagree on the identity of the hot methods. If two profilers disagree, at least one must be incorrect. Thus, there is a good chance that a profiler will mislead a performance analyst into wasting time optimizing a cold method with little or no performance improvement. This paper uses causality analysis to evaluate profilers and to gain insight into the source of their incorrectness. It shows that these profilers all violate a fundamental requirement for samplingbased profilers: to be correct, a sampling-based profiler must collect samples randomly. We show that a proof-of-concept profiler, which collects samples randomly, does not suffer from the above problems. Specifically, we show, using a number of case studies, that our profiler correctly identifies methods that are important to optimize; in some cases other profilers report that these methods are cold and thus not worth optimizing. C.4 [Measurement tech-
Inferred Call Path Profiling
, 2009
"... Prior work has found call path profiles to be useful for optimizers and programmer-productivity tools. Unfortunately, previous approaches for collecting path profiles are expensive: they need to either execute additional instructions (to track calls and returns) or they need to walk the stack. The s ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Prior work has found call path profiles to be useful for optimizers and programmer-productivity tools. Unfortunately, previous approaches for collecting path profiles are expensive: they need to either execute additional instructions (to track calls and returns) or they need to walk the stack. The state-of-the-art techniques for call path profiling slow down the program by 7 % (for C programs) and 20 % (for Java programs). This paper describes an innovative technique that collects minimal information from the running program and later (offline) infers the full call paths from this information. The key insight behind our approach is that readily available information during program execution—the height of the call stack and the identity of the current executing function—are good indicators of calling context. We call this pair a context identifier. Because more than one call path may have the same context identifier, we show how to disambiguate context identifiers by changing the sizes of function activation records. This disambiguation has no overhead in terms of executed instructions. We evaluate our approach on the SPEC CPU 2006 C++ and C benchmarks. We show that collecting context identifiers slows down programs by 0.17 % (geometric mean). We can map these context identifiers to the correct unique call path 80 % of the time for C++ programs and 95 % of the time for C programs.

