Results 1 - 10
of
26
Parallel Performance Prediction using Lost Cycles Analysis
- IN PROCEEDINGS OF SUPERCOMPUTING '94
, 1994
"... Most performance debugging and tuning of parallel programs is based on the "measure-modify" approach, which is heavily dependent on detailed measurements of programs during execution. This approach is extremely time-consuming and does not lend itself to predicting performance under varying condition ..."
Abstract
-
Cited by 62 (1 self)
- Add to MetaCart
Most performance debugging and tuning of parallel programs is based on the "measure-modify" approach, which is heavily dependent on detailed measurements of programs during execution. This approach is extremely time-consuming and does not lend itself to predicting performance under varying conditions. Analytic modeling and scalability analysis provide predictive power, but are not widely used inpractice, due primarily to their emphasis on asymptotic behavior and the difficulty of developing accurate models that work for real-world programs. In this paper we describe a set of tools for performance tuning of parallel programs that bridges this gap between measurement and modeling. Our approach is based on lost cycles analysis, which involves measurement and modeling of all sources of overhead in a parallel program. We first describe a tool for measuring overheads in parallel programs that we have incorporated into the runtime environment for Fortran programs on the Kendall Square KSR1. We then describe a tool that ts these overhead measurements to analytic forms. We illustrate the use of these tools by analyzing the performance tradeoffs among parallel implementations of 2D FFT. These examples show how our tools enable programmers to develop accurate performance models of parallel applications without requiring extensive performance modeling expertise.
Improving effective bandwidth through compiler enhancement of global cache reuse
- In Proceedings of International Parallel and Distributed Processing Symposium
, 2001
"... While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent o ..."
Abstract
-
Cited by 62 (17 self)
- Add to MetaCart
While CPU speed has been improved by a factor of 6400 over the past twenty years, memory bandwidth has increased by a factor of only 139 during the same period. Consequently, on modern machines the limited data supply simply cannot keep a CPU busy, and applications often utilize only a few percent of peak CPU performance. The hardware solution, which provides layers of high-bandwidth data cache, is not effective for large and complex applications primarily for two reasons: far-separated data reuse and large-stride data access. The first repeats unnecessary transfer and the second communicates useless data. Both waste memory bandwidth. This dissertation pursues a software remedy. It investigates the potential for compiler optimizations to alter program behavior and reduce its memory bandwidth consumption. To this end, this research has studied a two-step transformation strategy: first fuse computations on the same data and then group data used by the same computation. Existing techniques such as loop blocking can be viewed as an application of this strategy within a single loop nest. In order to carry out this strategy
Using Regression Techniques to Predict Large Data Transfers
- International Journal of High Performance Computing Applications
, 2003
"... {vazhkuda, ..."
Multivariate Statistical Techniques for Parallel Performance Prediction
- IN PROC. 28TH HAWAII INT. CONF. ON SYSTEM SCIENCES, VOL. II, IEEE
, 1995
"... Performance prediction can play an important role in improving the efficiency of multicomputers in executing scalable parallel applications. An accurate model of program execution time must include detailed algorithmic and architectural characterizations. The exact values for critical model paramete ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
Performance prediction can play an important role in improving the efficiency of multicomputers in executing scalable parallel applications. An accurate model of program execution time must include detailed algorithmic and architectural characterizations. The exact values for critical model parameters such as message latency and cache miss penalty can often be difficult to determine. This research uses multivariate data analysis to estimate the values of these coefficients in an analytical model. Representing the coefficients as random variables with a specified mean and variance improves the utility of a performance model. Confidence intervals for predicted execution time can be generated using the standard error values for model parameters. Improvements in the model can also be made by investigating the cause of large variance values for a particular architecture.
Interpreting the Performance of HPF/Fortran 90D
- In Proceedings of Supercomputing '94
, 1994
"... In this paper we present a novel interpretive approach for accurate and cost-effective performance prediction in a high performance computing environment, and describe the design of a source-driven HPF/Fortran 90D performance prediction framework based on this approach. The performance prediction fr ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
In this paper we present a novel interpretive approach for accurate and cost-effective performance prediction in a high performance computing environment, and describe the design of a source-driven HPF/Fortran 90D performance prediction framework based on this approach. The performance prediction framework has been implemented as part of a HPF/Fortran 90D application development environment. A set of benchmarking kernels and application codes are used to validate the accuracy, utility, usability, and cost-effectiveness of the performance prediction framework. The use of the framework for selecting appropriate compiler directives and for application performance debugging is demonstrated. Keywords: Performance prediction, HPF/Fortran 90D application development, System & Application characterization. 1 Introduction Although currently available High Performance Computing (HPC) systems possess large computing capabilities, few existing applications are able to fully exploit this potenti...
Performance Prediction and Scheduling for Parallel Applications on Multi-User Clusters
, 1998
"... ..."
Performance Stability and Prediction
- In IEEE International Workshop on High Performance Computing (WHPC’94
, 1994
"... This paper presents experimental data from our research on stability of parallel programs and crossmachine performance prediction on multicomputers. We characterize program behavior by an execution graph, obtained from running an instrumented version of the program. We assess program stability using ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
This paper presents experimental data from our research on stability of parallel programs and crossmachine performance prediction on multicomputers. We characterize program behavior by an execution graph, obtained from running an instrumented version of the program. We assess program stability using time perturbations, and analyze the resulting execution graphs with an approximation of a graph comparison metric based on subgraph isomorphism. On programs with stable behavior, we predict performance across different systems by transforming the observed execution trace; this trace transformation adjusts timestamps of events according to the architectural parameters of the systems under study, assuming the same partial event order on both systems. This technique allows us to predict performance also for future, hypothetical systems. We provide results showing that the technique is viable. 1 Introduction Performance stability is a key requirement for the widespread adoption of multicompute...
An adaptive performance modeling tool for gpu architectures
- In PPoPP
, 2010
"... This paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
This paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. The proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.
Performance Modeling for the Panda Array I/O Library
- In Proceedings of Supercomputing '96
, 1996
"... We present an analytical performance model for Panda, a library for synchronized i/o of large multidimensional arrays on parallel and sequential platforms, and show how the Panda developers use this model to evaluate Panda's parallel i/o performance and guide future Panda development. The model vali ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
We present an analytical performance model for Panda, a library for synchronized i/o of large multidimensional arrays on parallel and sequential platforms, and show how the Panda developers use this model to evaluate Panda's parallel i/o performance and guide future Panda development. The model validation shows that system developers can simplify performance analysis, identify potential performance bottlenecks, and study the design trade-offs for Panda on massively parallel platforms more easily than by conducting empirical experiments. More importantly, we show that the outputs of the performance model can be used to help make optimal plans for handling application i/o requests, the first step toward our long-term goal of automatically optimizing i/o request handling in Panda. This research was supported by an ARPA Fellowship in High Performance Computing administered by the Institute for Advanced Computer Studies, University of Maryland, by NSF under PYI grant IRI 89 58582, and by N...

