Results 1 -
8 of
8
Exploiting hardware performance counters with flow and context sensitive profiling
- ACM Sigplan Notices
, 1997
"... A program pro le attributes run-time costs to portions of a program's execution. Most pro ling systems su er from two major de ciencies: rst, they only apportion simple metrics, such as execution frequency or elapsed time to static, syntactic units, such as procedures or statements; second, they agg ..."
Abstract
-
Cited by 189 (9 self)
- Add to MetaCart
A program pro le attributes run-time costs to portions of a program's execution. Most pro ling systems su er from two major de ciencies: rst, they only apportion simple metrics, such as execution frequency or elapsed time to static, syntactic units, such as procedures or statements; second, they aggressively reduce the volume of information collected and reported, although aggregation can hide striking di erences in program behavior. This paper addresses both concerns by exploiting the hardware counters available in most modern processors and by incorporating two concepts from data ow analysis { ow and context sensitivity{to report more context for measurements. This paper extends our previous work on e cient path pro ling to ow sensitive pro ling, which associates hardware performance metrics with a path through a procedure. In addition, it describes a data structure, the calling context tree, that e ciently captures calling contexts for procedure-level measurements. Our measurements show that the SPEC95 benchmarks execute a small number (3{28) of hot paths that account for 9{98 % of their L1 data cache misses. Moreover, these hot paths are concentrated in a few routines, which have complex dynamic behavior. 1
The use of program profiling for software maintenance with applications to the year 2000 problem
- ACM Software Engineering Notes
, 1997
"... This paper describes new techniques to help with testing and debugging, using information obtained from path profiling. Apath profiler instruments a program so that the number of times each different loopfree path executes is accumulated during an execution run. With such an instrumented program, ea ..."
Abstract
-
Cited by 86 (5 self)
- Add to MetaCart
This paper describes new techniques to help with testing and debugging, using information obtained from path profiling. Apath profiler instruments a program so that the number of times each different loopfree path executes is accumulated during an execution run. With such an instrumented program, each run of the program generates a path spectrum for the execution—a distribution of the paths that were executed during that run. Apath spectrum is a finite, easily obtainable characterization of a program’s execution on adataset, and provides a behavior signature for a run of the program. Our techniques are based on the idea of comparing path spectra from different runs of the program. When different runs produce different spectra, the spectral differences can be used to identify paths in the program along which control diverges in the two runs. By choosing input datasets to hold all factors constant except one, the divergence can be attributed to this factor. The point of divergence itself may not be the cause of the underlying problem, but provides a starting place for a programmer to begin his exploration. One application of this technique is in the “Year 2000 Problem ” (i.e., the problem of fixing computer systems that use only 2-digit year fields in date-valued data). In this context, path-spectrum comparison provides a heuristic for identifying paths in a program that are good candidates for being date-dependent computations. The application of path-spectrum comparison to a number of other software-maintenance issues is also discussed.
Wisconsin Wind Tunnel II: A Fast and Portable Parallel Architecture Simulator
- IN WORKSHOP ON PERFORMANCE ANALYSIS AND ITS IMPACT ON DESIGN (PAID
, 1997
"... The design of future parallel computers requires rapid simulation of target designs running realistic workloads. These simulations have been accelerated using two techniques: direct execution and the use of a parallel host. Historically, these techniques have been considered to have poor portability ..."
Abstract
-
Cited by 39 (7 self)
- Add to MetaCart
The design of future parallel computers requires rapid simulation of target designs running realistic workloads. These simulations have been accelerated using two techniques: direct execution and the use of a parallel host. Historically, these techniques have been considered to have poor portability. This paper identifies and describes the implementation of four key operations necessary to make such simulation portable across a variety of parallel computers. These four operations are: calculation of target execution time, simulation of features of interest, communication of target messages, and synchronization of host processors. Portable implementations of these four operations have allowed us to easily run the Wisconsin Wind Tunnel II (WWT II)---a parallel, discrete-event, direct-execution simulator---across a wide range of platforms, such as desktop workstations, a SUN Enterprise server, a cluster of workstations, and a cluster of symmetric multiprocessing nodes. We plan to release ...
Targeted Path Profiling: Lower Overhead Path Profiling for Staged Dynamic Optimization Systems
- In International Symposium on Code Generation and Optimization (CGO
, 2004
"... In this paper, we present a technique for reducing the overhead of collecting path profiles in the context of a dynamic optimizer. The key idea to our approach, called Targeted Path Profiling (TPP), is to use an edge profile to simplify the collection of a path profile. This notion of profileguided ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
In this paper, we present a technique for reducing the overhead of collecting path profiles in the context of a dynamic optimizer. The key idea to our approach, called Targeted Path Profiling (TPP), is to use an edge profile to simplify the collection of a path profile. This notion of profileguided profiling is a natural fit for dynamic optimizers, which typically optimize the code in a series of stages. TPP is an extension to the Ball-Larus Efficient Path Profiling algorithm. Its increased efficiency comes from two sources: (i) reducing the number of potential paths by not enumerating paths with cold edges, allowing array accesses to be substituted for more expensive hash table lookups, and (ii) not instrumenting regions where paths can be unambiguously derived from an edge profile. Our results suggest that on average the overhead of profile collection can be reduced by half (SPEC95) to almost two-thirds (SPEC2000) relative to the Ball-Larus algorithm with minimal impact on the information collected. 1.
Selective Path Profiling
- In Workshop. on Program Analysis for Software Tools and Engineering (PASTE
, 2002
"... Recording dynamic information for only a subset of program entities can reduce monitoring overhead and can facilitate efficient monitoring of deployed software. Program entities, such as statements, can be monitored using probes that track the execution of those entities. Monitoring more complicated ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Recording dynamic information for only a subset of program entities can reduce monitoring overhead and can facilitate efficient monitoring of deployed software. Program entities, such as statements, can be monitored using probes that track the execution of those entities. Monitoring more complicated entities, such as paths or definition-use associations, requires more sophisticated techniques that track not only the execution of the desired entities but also the execution of other entities with which they interact. This paper presents an approach for monitoring subsets of one such program entity—acyclic paths in procedures. Our selective path profiling algorithm computes values for probes that guarantee that the sum of the assigned value along each acyclic path (path sum) in the subset is unique; acyclic paths not in the subset may or may not have unique path sums. The paper also presents the results of studies that compare the number of probes required for subsets of various sizes with the number of probes required for profiling all paths, computed using Ball and Larus ’ path profiling algorithm. Our results indicate that the algorithm performs well on many procedures by requiring only a small percentage of probes for monitoring the subset.
Efficient Path Profiling
- In Proceedings of the 29th Annual International Symposium on Microarchitecture
, 1996
"... A path profile determines how many times each acyclic path in a routine executes. This type of profiling subsumes the more common basic block and edge profiling, which only approximate path frequencies. Path profiles have many potential uses in program performance tuning, profile-directed compilatio ..."
Abstract
- Add to MetaCart
A path profile determines how many times each acyclic path in a routine executes. This type of profiling subsumes the more common basic block and edge profiling, which only approximate path frequencies. Path profiles have many potential uses in program performance tuning, profile-directed compilation, and software test coverage. This paper describes a new algorithm for path profiling. This simple, fast algorithm selects and places profile instrumentation to minimize run-time overhead. Instrumented programs run with overhead comparable to the best previous profiling techniques. On the SPEC95 benchmarks, path profiling overhead averaged 31%, as compared to 16% for efficient edge profiling. Path profiling also identifies longer paths than a previous technique, which predicted paths from edge profiles (average of 88, versus 34 instructions). Moreover, profiling shows that the SPEC95 train input datasets covered most of the paths executed in the ref datasets. This research supported by: W...
An Empirical Study of Tracing Techniques
"... Tracing is a dynamic analysis technique to continuously capture events of interest on a running program. The occurrence of a statement, the invocation of a function, and the trigger of a signal are examples of traced events. Software engineers employ traces to accomplish various tasks, ranging from ..."
Abstract
- Add to MetaCart
Tracing is a dynamic analysis technique to continuously capture events of interest on a running program. The occurrence of a statement, the invocation of a function, and the trigger of a signal are examples of traced events. Software engineers employ traces to accomplish various tasks, ranging from performance monitoring to failure analysis. Despite its capabilities, tracing can negatively impact the performance and general behavior of an application. In order to minimize that impact, traces are normally buffered and transferred to (slower) permanent storage at specific intervals. This scenario presents a delicate balance. Increased buffering can minimize the impact on the target program, but it increases the risk of losing valuable collected data in the event of a failure. Frequent disk transfers can ensure traced data integrity, but it risks a high impact on the target program. We conducted an experiment involving six tracing schemes and various buffer sizes to address these trade-offs. Our results highlight opportunities for tailored tracing schemes that would benefit failure analysis.
Static Analysis for Fast and Accurate Design Space Exploration of Caches
"... Application-specific system-on-chip platforms create the opportunity to customize the cache configuration for optimal performance with minimal chip estate. Simulation, in particular trace-driven simulation, is widely used to estimate cache hit rates. However, simulation is too slow to be deployed in ..."
Abstract
- Add to MetaCart
Application-specific system-on-chip platforms create the opportunity to customize the cache configuration for optimal performance with minimal chip estate. Simulation, in particular trace-driven simulation, is widely used to estimate cache hit rates. However, simulation is too slow to be deployed in the design space exploration, specially when it involves hundreds of design points and huge traces or long program execution. In this paper, we propose a novel static analysis technique for rapid and accurate design space exploration of instruction caches. Given the program control flow graph (CFG) annotated only with basic block and control flow edge execution counts, our analysis estimates the hit rates for multiple cache configurations in one pass. We achieve this by modeling the cache states at each node of the CFG in probabilistic manner and exploiting the structural similarities among related cache configurations. Experimental results indicate that our analysis is 24–3,855 times faster compared to the fastest known cache simulator while maintaining high accuracy (0.7 % average error), in predicting hit rates for popular embedded benchmarks.

