Results 1 
7 of
7
Generating Performance Bounds from Source Code
"... Understanding and tuning the performance of complex applications on modern hardware are challenging tasks, requiring understanding of the algorithms, implementation, compiler optimizations, and underlying architecture. Many tools exist for measuring and analyzing the runtime performance of applicati ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Understanding and tuning the performance of complex applications on modern hardware are challenging tasks, requiring understanding of the algorithms, implementation, compiler optimizations, and underlying architecture. Many tools exist for measuring and analyzing the runtime performance of applications. Obtaining sufficiently detailed performance data and comparing it with the peak performance of an architecture are one path to understanding the behavior of a particular algorithm implementation. A complementary approach relies on the analysis of the source code itself, coupling it with a simplified architecture description to arrive at performance estimates that can provide a more meaningful upper bound than the peak hardware performance. We present a tool for estimating upper performance bounds of C/C++ applications through static compiler analysis. It generates parameterized expressions for different types of memory accesses and integer and floatingpoint computations. We then incorporate architectural parameters to estimate upper bounds on the performance of an application on a particular system. We present validation results for several codes on two architectures. 1
Memory Hierarchy Optimizations and Performance Bounds for Sparse
"... Abstract. This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = A T Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = A T Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrixvector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2×. We also develop platformspecific upperbounds on the performance of these implementations. We analyze how closely we can approach these bounds, and show when lowlevel tuning techniques (e.g., better instruction scheduling) are likely to yield a significant payoff. Finally, we propose a hybrid offline/runtime heuristic which in practice automatically selects nearoptimal values of the key tuning parameters, the register block sizes. 1
Abstract Performance Modeling for 3D Visualization in a Heterogeneous Computing Environment
"... The visualization of large, remotely located data sets necessitates the development of a distributed computing pipeline in order to reduce the data, in stages, to a manageable size. The required baseline infrastructure for launching such a distributed pipeline is becoming available, but few services ..."
Abstract
 Add to MetaCart
(Show Context)
The visualization of large, remotely located data sets necessitates the development of a distributed computing pipeline in order to reduce the data, in stages, to a manageable size. The required baseline infrastructure for launching such a distributed pipeline is becoming available, but few services support even marginally optimal resource selection and partitioning of the data analysis workflow. We explore a methodology for building a model of overall application performance using a composition of the analytic models of individual components that comprise the pipeline. The analytic models are shown to be accurate on a testbed of distributed heterogeneous systems. The prediction methodology will form the foundation of a more robust resource management service for future Gridbased visualization applications. 1
Automating the Generation of Composed Linear Algebra Kernels
"... Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However ..."
Abstract
 Add to MetaCart
(Show Context)
Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However, because the BLAS are tuned in isolation, they do not take advantage of opportunities for memory optimization that result from composing multiple subprograms. Because it is not practical to create a library of all BLAS combinations, we have developed a domainspecific compiler that generates them on demand. In this paper, we describe the novel algorithm underlying the compiler that searches for the best combination of optimization choices, and we present a new hybrid analytic/empirical method for quickly evaluating the profitability of each optimization. We report experimental results showing speedups of 5 % to 140 % relative to GotoBLAS and vendortuned BLAS on the Intel Cost 2 and the AMD Opteron. I.
Generating Empirically Optimized Numerical Software from MATLAB Prototypes
, 2008
"... The growing demand for higher levels of detail and accuracy in results means that the size and complexity of scientific computations is increasing at least as fast as the improvements in processor technology. Programming scientific applications is hard, and optimizing them for high performance is ev ..."
Abstract
 Add to MetaCart
(Show Context)
The growing demand for higher levels of detail and accuracy in results means that the size and complexity of scientific computations is increasing at least as fast as the improvements in processor technology. Programming scientific applications is hard, and optimizing them for high performance is even harder. The development of optimized codes requires extensive knowledge, not only of the costs of floatingpoint arithmetic but also of memory access issues and compiler optimizations. Experiments show that the complexity of this hardwaresoftware system means that performance is difficult to predict fully. Therefore, computational scientists are often forced to choose between investing too much time in tuning code or accepting performance that is significantly lower than the best achievable performance on a given architecture. In this paper, we describe the first steps toward a fully automated system for the optimization of the matrix algebra kernels that are a foundational part of many scientific applications. To generate highly optimized code from a highlevel MATLAB prototype, we define a threestep approach. To begin, we have developed a compiler that converts a MATLAB script into simple C code. We then use the polyhedral optimization system PLuTo to optimize that code for coarsegrained parallelism and locality simultaneously. Finally, we annotate the resulting code with performance tuning directives and