Results 1  10
of
19
OSKI: A library of automatically tuned sparse matrix kernels
 Institute of Physics Publishing
, 2005
"... kernels ..."
(Show Context)
Statistical models for empirical searchbased performance tuning
 INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2004
"... Achieving peak performance from the computational kernels that dominate application performance often requires extensive machinedependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementa ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
Achieving peak performance from the computational kernels that dominate application performance often requires extensive machinedependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (i.e., actually running the code). This paper presents quantitative data that motivates the development of such a searchbased system, using dense matrix multiply as a case study. The statistical distributions of performance within spaces of reasonable implementations, when observed on a variety of hardware platforms, lead us to pose and address two general problems which arise during the search process. First, we develop a heuristic for stopping an exhaustive compiletime search early if a nearoptimal implementation is found. Second, we show how to construct
Synthesizing objects
 In: Proceedings of the 13th European Conference on ObjectOriented Programming
, 1999
"... Abstract. This paper argues that the current OO technology does not support reuse and configurability in an effective way. This problem can be addressed by augmenting OO analysis and design with feature modeling and by applying generative implementation techniques. Feature modeling allows capturing ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
(Show Context)
Abstract. This paper argues that the current OO technology does not support reuse and configurability in an effective way. This problem can be addressed by augmenting OO analysis and design with feature modeling and by applying generative implementation techniques. Feature modeling allows capturing the variability of domain concepts. Concrete concept instances can then be synthesized from abstract specifications. Using a simple example of a configurable list component, we demonstrate the application of feature modeling and how to implement a feature model as a generator. We introduce the concepts of configuration repositories and configuration generators and show how to implement them using objectoriented, generic, and generative language mechanisms. The configuration generator utilizes C++ template metaprogramming, which enables its execution at compiletime. 1
AnnotationBased Empirical Performance Tuning Using Orio
"... In many scientific applications, significant time is spent tuning codes for a particular highperformance architecture. Tuning approaches range from the relatively nonintrusive (e.g., by using compiler options) to extensive code modifications that attempt to exploit specific architecture features. I ..."
Abstract

Cited by 25 (6 self)
 Add to MetaCart
(Show Context)
In many scientific applications, significant time is spent tuning codes for a particular highperformance architecture. Tuning approaches range from the relatively nonintrusive (e.g., by using compiler options) to extensive code modifications that attempt to exploit specific architecture features. Intrusive techniques often result in code changes that are not easily reversible, which can negatively impact readability, maintainability, and performance on different architectures. We introduce an extensible annotationbased empirical tuning system called Orio, which is aimed at improving both performance and productivity by enabling software developers to insert annotations in the form of structured comments into their source code that trigger a number of lowlevel performance optimizations on a specified code fragment. To maximize the performance tuning opportunities, we have designed the annotation processing infrastructure to support both architectureindependent and architecturespecific code optimizations. Given the annotated code as input, Orio generates many tuned versions of the same operation and empirically evaluates the versions to select the best performing one for production use. We have also enabled the use of the PLuTo automatic parallelization tool in conjunction with Orio to generate efficient OpenMPbased parallel code. We describe our experimental results involving a number of computational kernels, including dense array and sparse matrix operations.
The Matrix Template Library: A Unifying Framework for Numerical Linear Algebra
 In Parallel Object Oriented Scientific Computing. ECOOP
, 1998
"... . We present a uni#ed approach for expressing high performance numerical linear algebra routines for a class of dense and sparse matrix formats and shapes. As with the Standard Template Library #7#, we explicitly separate algorithms from data structures through the use of generic programming tec ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
. We present a uni#ed approach for expressing high performance numerical linear algebra routines for a class of dense and sparse matrix formats and shapes. As with the Standard Template Library #7#, we explicitly separate algorithms from data structures through the use of generic programming techniques. We conclude that such an approach does not hinder high performance. On the contrary, writing portable high performance codes is actually enabled with such an approach because the performance critical code sections can be isolated from the algorithms and the data structures. 1 Introduction The traditional approach to writing basic linear algebra routines is a combinatorial a#air. There are typically four precision types that need to be handled #single and double precision real, single and double precision complex#, several dense storage types #general, banded, packed#, a multitude of sparse storage types #the Sparse BLAS Standard Proposal includes 13 #1##, as well as row and co...
Automatic Performance Tuning and Analysis of Sparse Triangular Solve
 In ICS 2002: Workshop on Performance Optimization via HighLevel Languages and Libraries
, 2002
"... this paper, we consider the solution of the sparse lower triangular system Lx = y for a single dense vector x, given the lower triangular sparse matrix L and dense vector y ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
this paper, we consider the solution of the sparse lower triangular system Lx = y for a single dense vector x, given the lower triangular sparse matrix L and dense vector y
Memory Hierarchy Optimizations and Performance Bounds for Sparse A^T Ax
 In ICCS 2003: Workshop on Parallel Linear Algebra
, 2003
"... This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = A Ax, where A is a sparse matrix and x; y are dense vectors. We describe an implementation of this computational kernel which brings A thro ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = A Ax, where A is a sparse matrix and x; y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrixvector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4:2. We also develop platformspeci c upperbounds on the performance of these implementations.
New Data Structures for Matrices and Specialized Inner Kernels: Low Overhead For High Performance
, 2007
"... Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, however, achieves suboptimal performance due to the overheads associated to such calls. Taking as an example the dense Cholesky factorization of a symmetric positive definite matrix we show that the pote ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, however, achieves suboptimal performance due to the overheads associated to such calls. Taking as an example the dense Cholesky factorization of a symmetric positive definite matrix we show that the potential of noncanonical data structures for dense linear algebra can be better exploited with the use of specialized inner kernels. The use of noncanonical data structures together with specialized inner kernels has low overhead and can produce excellent performance.
DESOLA: an Active Linear Algebra Library Using Delayed Evaluation and Runtime Code Generation ⋆
"... Active libraries can be defined as libraries which play an active part in the compilation, in particular, the optimisation of their client code. This paper explores the implementation of an active dense linear algebra library by delaying evaluation of expressions built using library calls, then gene ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Active libraries can be defined as libraries which play an active part in the compilation, in particular, the optimisation of their client code. This paper explores the implementation of an active dense linear algebra library by delaying evaluation of expressions built using library calls, then generating code at runtime for the compositions that occur. The key optimisations in this context are loop fusion and array contraction. Our prototype C++ implementation, DESOLA, automatically fuses loops arising from different client calls, identifies unnecessary intermediate temporaries, and contracts temporary arrays to scalars. Performance is evaluated using a benchmark suite of linear solvers from ITL (Iterative Template Library), and is compared with MTL (Matrix Template Library), ATLAS (Automatically Tuned Linear Algebra) and IMKL (Intel Math Kernel Library). Excluding runtime compilation overheads (caching means they occur only on the first iteration), for larger matrix sizes, performance matches or exceeds MTL; when fusion of matrix operations occurs, performance exceeds that of ATLAS and IMKL. Key words: runtime code generation, delayed evaluation, active libraries, numerical libraries