Results 1 - 10
of
10
Exploiting hardware performance counters with flow and context sensitive profiling
- ACM Sigplan Notices
, 1997
"... A program pro le attributes run-time costs to portions of a program's execution. Most pro ling systems su er from two major de ciencies: rst, they only apportion simple metrics, such as execution frequency or elapsed time to static, syntactic units, such as procedures or statements; second, they agg ..."
Abstract
-
Cited by 189 (9 self)
- Add to MetaCart
A program pro le attributes run-time costs to portions of a program's execution. Most pro ling systems su er from two major de ciencies: rst, they only apportion simple metrics, such as execution frequency or elapsed time to static, syntactic units, such as procedures or statements; second, they aggressively reduce the volume of information collected and reported, although aggregation can hide striking di erences in program behavior. This paper addresses both concerns by exploiting the hardware counters available in most modern processors and by incorporating two concepts from data ow analysis { ow and context sensitivity{to report more context for measurements. This paper extends our previous work on e cient path pro ling to ow sensitive pro ling, which associates hardware performance metrics with a path through a procedure. In addition, it describes a data structure, the calling context tree, that e ciently captures calling contexts for procedure-level measurements. Our measurements show that the SPEC95 benchmarks execute a small number (3{28) of hot paths that account for 9{98 % of their L1 data cache misses. Moreover, these hot paths are concentrated in a few routines, which have complex dynamic behavior. 1
Whodunit: Transactional profiling for multi-tier applications
- In Proc. of the 2nd European Conference on Computer Systems (EuroSys’07
, 2007
"... This paper is concerned with performance debugging of multitier applications, such as commonly found in servers and dynamic-content web sites. Existing tools and techniques for profiling such applications are not general enough to track and profile transactions in a generic multi-tier application. W ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
This paper is concerned with performance debugging of multitier applications, such as commonly found in servers and dynamic-content web sites. Existing tools and techniques for profiling such applications are not general enough to track and profile transactions in a generic multi-tier application. We propose transactional profiling that provides a general solution to this problem. We provide novel algorithms and techniques to track and profile transactions that flow through shared memory, events, stages or via interprocess communication using messages. We also measure interference among concurrent transactions. We describe the design and implementation of Whodunit, our prototype transactional profiler. We demonstrate the correctness of our proposed algorithm for tracking transaction flow through shared memory using Apache and MySQL. Using Whodunit we are able to track and profile transactions that flow through shared memory, events, stages or via message passing, and measure the interference among concurrent transactions. We illustrate the use of Whodunit in obtaining the transactional profile of web servers, a web proxy cache and a bookstore application.
Effective Performance Measurement and Analysis of Multithreaded Applications
"... Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a sharedmemory node populated with one or more multicore processors is a problem of growing practical importance. This paper makes three contributions to performance analysis of multith ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a sharedmemory node populated with one or more multicore processors is a problem of growing practical importance. This paper makes three contributions to performance analysis of multithreaded programs. First, we describe how to measure and attribute parallel idleness, namely, where threads are stalled and unable to work. This technique applies broadly to programming models ranging from explicit threading (e.g., Pthreads) to higher-level models such as Cilk and OpenMP. Second, we describe how to measure and attribute parallel overhead—when a thread is performing miscellaneous work other than executing the user’s computation. By employing a combination of compiler support and post-mortem analysis, we incur no measurement cost beyond normal profiling to glean this information. Using idleness and overhead metrics enables one to pinpoint areas of an application where concurrency should be increased (to reduce idleness), decreased (to reduce overhead), or where the present parallelization is hopeless (where idleness and overhead are both high). Third, we describe how to measure and attribute arbitrary performance metrics for high-level multithreaded programming models, such as Cilk. This requires bridging the gap between the expression of logical concurrency in programs and its realization at run-time as it is adaptively partitioned and scheduled onto a pool of threads. We have prototyped these ideas in the context of Rice University’s HPCTOOLKIT performance tools. We describe our approach, implementation, and experiences applying this approach to measure and attribute work, idleness, and overhead in executions of Cilk programs.
Scalability Analysis of SPMD Codes Using Expectations
- ICS'07
, 2007
"... We present a new technique for identifying scalability bottlenecks in executions of single-program, multiple-data (SPMD) parallel programs, quantifying their impact on performance, and associating this information with the program source code. Our performance analysis strategy involves three steps. ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
We present a new technique for identifying scalability bottlenecks in executions of single-program, multiple-data (SPMD) parallel programs, quantifying their impact on performance, and associating this information with the program source code. Our performance analysis strategy involves three steps. First, we collect call path profiles for two or more executions on different numbers of processors. Second, we use our expectations about how the performance of executions should differ, e.g., linear speedup for strong scaling or constant execution time for weak scaling, to automatically compute the scalability of costs incurred at each point in a program’s execution. Third, with the aid of an interactive browser, an application developer can explore a program’s performance in a top-down fashion, see the contexts in which poor scaling behavior arises, and understand exactly how much each scalability bottleneck dilates execution time. Our analysis technique is independent of the parallel programming model. We describe our experiences applying our technique to analyze parallel programs written in Co-array Fortran and Unified Parallel C, as well as message-passing programs based on MPI.
Binary Analysis for Measurement and Attribution of Program Performance
"... Modern programs frequently employ sophisticated modular designs. As a result, performance problems cannot be identified from costs attributed to routines in isolation; understanding code performance requires information about a routine’s calling context. Existing performance tools fall short in this ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Modern programs frequently employ sophisticated modular designs. As a result, performance problems cannot be identified from costs attributed to routines in isolation; understanding code performance requires information about a routine’s calling context. Existing performance tools fall short in this respect. Prior strategies for attributing context-sensitive performance at the source level either compromise measurement accuracy, remain too close to the binary, or require custom compilers. To understand the performance of fully optimized modular code, we developed two novel binary analysis techniques: 1) on-the-fly analysis of optimized machine code to enable minimally intrusive and accurate attribution of costs to dynamic calling contexts; and 2) post-mortem analysis of optimized machine code and its debugging sections to recover its program structure and reconstruct a mapping back to its source code. By combining the recovered static program structure with dynamic calling context information, we can accurately attribute performance metrics to calling contexts, procedures, loops, and inlined instances of procedures. We demonstrate that the fusion of this information provides unique insight into the performance of complex modular codes. This work is implemented in the HPC-TOOLKIT 1 performance tools.
Accurate, Efficient, and Adaptive Calling Context Profiling
- PLDI '06
, 2006
"... Calling context profiles are used in many inter-procedural code optimizations and in overall program understanding. Unfortunately, the collection of profile information is highly intrusive due to the high frequency of method calls in most applications. Previously proposed calling-context profiling m ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Calling context profiles are used in many inter-procedural code optimizations and in overall program understanding. Unfortunately, the collection of profile information is highly intrusive due to the high frequency of method calls in most applications. Previously proposed calling-context profiling mechanisms consequently suffer from either low accuracy, high overhead, or both. We have developed a new approach for building the calling context tree at runtime, called adaptive bursting. By selectively inhibiting redundant profiling, this approach dramatically reduces overhead while preserving profile accuracy. We first demonstrate the drawbacks of previously proposed calling context profiling mechanisms. We show that a low-overhead solution using sampled stack-walking alone is less than 50 % accurate, based on degree of overlap with a complete calling-context tree. We also show that a static bursting approach collects a highly accurate profile, but causes an unacceptable application slowdown. Our adaptive solution achieves 85 % degree of overlap and provides an 88% hot-edge coverage when using a 0.1 hot-edge threshold, while dramatically reducing overhead compared to the static bursting approach.
Analyzing Lock Contention in Multithreaded Applications
"... Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify and attribute lock contention is important for u ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify and attribute lock contention is important for understanding where a multithreaded program needs improvement. This paper proposes and evaluates three strategies for gaining insight into performance losses due to lock contention. First, we consider using a straightforward strategy based on call stack profiling to attribute idle time and show that it fails to yield insight into lock contention. Second, we consider an approach that builds on a strategy previously used for analyzing idleness in work-stealing computations; we show that this strategy does not yield insight into lock contention. Finally, we propose a new technique for measurement and analysis of lock contention that uses data associated with locks to blame lock holders for the idleness of spinning threads. Our approach incurs < 5 % overhead on a quantum chemistry application that makes extensive use of locking (65M distinct locks, a maximum of 340K live locks, and an average of 30K lock acquisitions per second per thread) and attributes lock contention to its full static and dynamic calling contexts. Our strategy is fully distributed and should scale well to systems with large core counts. Categories and Subject Descriptors C.4 [Performance of systems]:
Profile-guided specialization of an operating system kernel
- In Proc. Workshop on Binary Instrumentation and Applications
, 2006
"... Abstract General-purpose operating systems such as Linux are in-creasingly replacing custom embedded counterparts on a wide variety of devices. Despite their convenience and flex-ibility, however, such operating systems may be overly general and thus incur unnecessary performance overheads inthese c ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract General-purpose operating systems such as Linux are in-creasingly replacing custom embedded counterparts on a wide variety of devices. Despite their convenience and flex-ibility, however, such operating systems may be overly general and thus incur unnecessary performance overheads inthese contexts. This paper describes a new approach to mitigating these overheads by automatically specializing the OSkernel for particular execution environments. We use value profiling to identify targets for specialization such as fre-quent system call parameters. A novel profiling technique is used to identify frequently invoked procedure call sequenceswithin the kernel. This information is used to sidestep the problems arising from indirect function calls when carryingout interprocedural compiler optimization. It drives a variety of compiler optimizations such as function inlining and codespecialization that reduce the execution overheads along frequent paths. A prototype implementation that uses the PLTObinary rewriting system to specialize the Linux kernel is described. While overall performance data are mixed, the im-provements we see argue for the potential of this approach.
A Survey of Tools for the Development and Maintenance of Programs.
, 1994
"... In this report is a survey of tools required for the implementation and maintenance of software. The need to develop code of high quality means that the programmer must have a range of development and maintenance tools for the generation of code and its static and dynamic analysis. Individual tools ..."
Abstract
- Add to MetaCart
In this report is a survey of tools required for the implementation and maintenance of software. The need to develop code of high quality means that the programmer must have a range of development and maintenance tools for the generation of code and its static and dynamic analysis. Individual tools as well as integrated environments supplying a range of coordinated tools are covered. A Survey of Tools for the Development and Maintenance of Programs. 1 Introduction This report provides an overview of the nature of various tools that are available to the programmer to implement and maintain programs. Such tools are classified into a taxonomy and examples of each type of tool are given. Due to the number of tools available examples are restricted to the UNIX/C environment, except where a tool for another language contains facets that are not found in UNIX/C tools. This discussion is preceded by a short introduction to the software development process. 2 The Software Development Process...
Dominant Variance Characterization
"... There are a whole range of program analysis techniques that characterize different aspects of an application’s performance: hot-spots, distinct phases of behavior, code segments that could potentially run in parallel, etc. For a growing class of applications, there is a need to add another analysis ..."
Abstract
- Add to MetaCart
There are a whole range of program analysis techniques that characterize different aspects of an application’s performance: hot-spots, distinct phases of behavior, code segments that could potentially run in parallel, etc. For a growing class of applications, there is a need to add another analysis technique to the repertoire that can characterize the locations and underlying causes of execution time variance in repetitive parts of the application. In this paper we introduce the notion of dominant variance analysis of an application. We illustrate the unique performance optimization benefits of performing such an analysis. We motivate that traditional program analysis and profiling techniques are not sufficient to analyze the variant execution time behavior of the application. We introduce a new program representation called Variance Characterization Graph that is used both as the intermediate representation to enable the dominant variance analysis and as the final representation that provides concise and actionable information to programmers. We identify the unique challenges associated with characterizing the dominant behavior of an application and develop a methodology based on statistical pattern matching to efficiently recognize dominant patterns of behavior. 1.

