Results 1 - 10
of
15
The Tau Parallel Performance System
- The International Journal of High Performance Computing Applications
, 2006
"... The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. Flexibility and portabilit ..."
Abstract
-
Cited by 97 (14 self)
- Add to MetaCart
The ability of performance technology to keep pace with the growing complexity of parallel and distributed systems depends on robust performance frameworks that can at once provide system-specific performance capabilities and support high-level performance problem solving. Flexibility and portability in empirical methods and processes are influenced primarily by the strategies available for instrumentation and measurement, and how effectively they are integrated and composed. This paper presents the TAU (Tuning and Analysis Utilities) parallel performance system and describe how it addresses diverse requirements for performance observation and analysis.
Vertical Profiling: Understanding the Behavior of Object-Oriented Applications
"... Object-oriented programming languages provide a rich set of features that provide significant software engineering benefits. The increased productivity provided by these features comes at a justifiable cost in a more sophisticated runtime system whose responsibility is to implement these features e# ..."
Abstract
-
Cited by 47 (14 self)
- Add to MetaCart
Object-oriented programming languages provide a rich set of features that provide significant software engineering benefits. The increased productivity provided by these features comes at a justifiable cost in a more sophisticated runtime system whose responsibility is to implement these features e#ciently. However, the virtualization introduced by this sophistication provides a significant challenge to understanding complete system performance, not found in traditionally compiled languages, such as C or C++. Thus, understanding system performance of such a system requires profiling that spans all levels of the execution stack, such as the hardware, operating system, virtual machine, and application.
Scalable parallel trace-based performance analysis
- In Proc. 13th European PVM/MPI Conference
, 2006
"... Abstract. Automatic trace analysis is an effective method for identifying complex performance phenomena in parallel applications. However, as the size of parallel systems and the number of processors used by individual applications is continuously raised, the traditional approach of analyzing a sing ..."
Abstract
-
Cited by 38 (22 self)
- Add to MetaCart
Abstract. Automatic trace analysis is an effective method for identifying complex performance phenomena in parallel applications. However, as the size of parallel systems and the number of processors used by individual applications is continuously raised, the traditional approach of analyzing a single global trace file, as done by KOJAK’s EXPERT trace analyzer, becomes increasingly constrained by the large number of events. In this article, we present a scalable version of the EXPERT analysis based on analyzing separate local trace files with a parallel tool which ‘replays ’ the target application’s communication behavior. We describe the new parallel analyzer architecture and discuss first empirical results. 1
Using Hardware Performance Monitors to Understand the Behavior of Java Applications
- IN PROC. OF THE THIRD USENIX VIRTUAL MACHINE RESEARCH AND TECHNOLOGY SYMP
, 2004
"... Modern Java programs, such as middleware and application servers, include many complex software components. Improving the performance of these Java applications requires a better understanding of the interactions between the application, virtual machine, operating system, and architecture. Hardware ..."
Abstract
-
Cited by 33 (8 self)
- Add to MetaCart
Modern Java programs, such as middleware and application servers, include many complex software components. Improving the performance of these Java applications requires a better understanding of the interactions between the application, virtual machine, operating system, and architecture. Hardware performance monitors, which are available on most modern processors, provide facilities to obtain detailed performance measurements of long-running applications in real time. However, interpreting the data collected using hardware performance monitors is difficult because of the low-level nature of the data. We have
Automating Vertical Profiling
, 2005
"... Last year at OOPSLA we presented a methodology, vertical profiling, for understanding the performance of objectoriented programs. The key insight behind this methodology is that modern programs run on top of many layers (virtual machine, middleware, etc) and thus we need to collect and combine infor ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Last year at OOPSLA we presented a methodology, vertical profiling, for understanding the performance of objectoriented programs. The key insight behind this methodology is that modern programs run on top of many layers (virtual machine, middleware, etc) and thus we need to collect and combine information from all layers in order to understand system performance. Although our methodology was able to explain previously unexplained performance phenomena, it was extremely labor intensive. In this paper we describe and evaluate techniques for automating two significant activities of vertical profiling: trace alignment and correlation. Trace alignment aligns traces obtained from separate runs so that one can reason across the traces. We are not aware of any prior approach that effectively and automatically aligns traces. Correlation sifts through hundreds of metrics to find ones that have a bearing on a performance anomaly of interest. In prior work we found that statistical correlation was only sometimes effective. We have identified highly-effective approaches for both activities. For aligning traces we explore dynamic time warping, and for correlation we explore eight correlators based on statistical correlation, distance measures, and piecewise linear segmentation. Although we explore these activities in the context of vertical profiling, both activities are widely applicable in the performance analysis area.
Large event traces in parallel performance analysis
- In 8th Workshop Parallel Systems and Algorithms (PASA), Lecture Notes in Informatics, Frankfurt/Main
, 2006
"... A powerful and widely-used method for analyzing the performance behavior of parallel programs is event tracing. When an application is traced, performancerelevant events, such as entering functions or sending messages, are recorded at runtime and analyzed post-mortem to identify and potentially remo ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
A powerful and widely-used method for analyzing the performance behavior of parallel programs is event tracing. When an application is traced, performancerelevant events, such as entering functions or sending messages, are recorded at runtime and analyzed post-mortem to identify and potentially remove performance problems. While event tracing enables the detection of performance problems at a high level of detail, growing trace-file size often constrains its scalability on large-scale systems and complicates management, analysis, and visualization of trace data. In this article, we survey current approaches to handle large traces and classify them according to the primary issues they address and the primary benefits they offer.
A Parallel Trace-Data Interface for Scalable Performance Analysis
"... Abstract Automatic trace analysis is an effective method of identifying complex performance phenomena in parallel applications. To simplify the development of complex trace-analysis algorithms, the EARL library interface offers high-level access to individual events contained in a global trace file. ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
Abstract Automatic trace analysis is an effective method of identifying complex performance phenomena in parallel applications. To simplify the development of complex trace-analysis algorithms, the EARL library interface offers high-level access to individual events contained in a global trace file. However, as the size of parallel systems grows further and the number of processors used by individual applications is continuously raised, the traditional approach of analyzing a single global trace file becomes increasingly constrained by the large number of events. To enable scalable trace analysis, we present a new design of the aforementioned EARL interface that accesses multiple local trace files in parallel while offering means to conveniently exchange events between processes. This article describes the modified view of the trace data as well as related programming abstractions provided by the new PEARL library interface and discusses its application in performance analysis. 1
An Efficient Format for Nearly Constant-Time Access to Arbitrary Time Intervals in Large Trace
, 2007
"... A powerful method to aid in understanding the performance of parallel applications uses log or trace files containing time-stamped events and states (pairs of events). These trace files can be very large, often hundreds or even thousands of megabytes. Because of the cost of accessing and displaying ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
A powerful method to aid in understanding the performance of parallel applications uses log or trace files containing time-stamped events and states (pairs of events). These trace files can be very large, often hundreds or even thousands of megabytes. Because of the cost of accessing and displaying such files, other methods are often used that reduce the size of the tracefiles at the cost of sacrificing detail or other information. This paper describes a hierarchical trace file format that provides for display of an arbitrary time window in a time independent of the total size of the file and roughly proportional to the number of events within the time window. This format eliminates the need to sacrifice data to achieve a smaller trace file size (since storage is inexpensive, it is necessary only to make efficient use of bandwidth to that storage). The format can be used to organize a trace file or to create a separate file of annotations that may be used with conventional trace files. We present an analysis of the time to access all of the events relevant to an interval of time and we describe experiments demonstrating the performance of this file format. 1
Portable High Performance and Scalability for Partitioned Global Address Space Languages
, 2007
"... Large scale parallel simulations are fundamental tools for engineers and scientists. Con-sequently, it is critical to develop both programming models and tools that enhance devel-opment time productivity, enable harnessing of massively-parallel systems, and to guide the diagnosis of poorly scaling p ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Large scale parallel simulations are fundamental tools for engineers and scientists. Con-sequently, it is critical to develop both programming models and tools that enhance devel-opment time productivity, enable harnessing of massively-parallel systems, and to guide the diagnosis of poorly scaling programs. This thesis addresses this challenge in two ways. First, we show that Co-array Fortran (CAF), a shared-memory parallel program-ming model, can be used to write scientific codes that exhibit high performance on modern parallel systems. Second, we describe a novel technique for analyzing parallel program performance and identifying scalability bottlenecks, and apply it across multiple program-ming models. Although the message passing parallel programming model provides both portability and high performance, it is cumbersome to program. CAF eases this burden by providing a partitioned global address space, but has before now only been implemented on shared-memory machines. To significantly broaden CAF’s appeal, we show that CAF programs can deliver high-performance on commodity cluster platforms. We designed and imple-

