Results 1 
9 of
9
Automating the Generation of Composed Linear Algebra Kernels
"... Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However, tuning the BLAS in isolation misses opportunities for memory optimization that result from composing multiple subprograms. Because it is not practical to create a library of all BLAS combinations, we have developed a domainspecific compiler that generates them on demand. In this paper, we describe a novel algorithm for compiling linear algebra kernels and searching for the best combination of optimization choices. We also present a new hybrid analytic/empirical method for quickly evaluating the profitability of each optimization. We report experimental results showing speedups of up to 130 % relative to the GotoBLAS on an AMD Opteron and up to 137 % relative to MKL on an Intel Core 2. 1.
Families of Algorithms for Reducing a Matrix to Condensed Form
"... In a recent paper it was shown how memory traffic can be diminished by reformulating the classic algorithm for reducing a matrix to bidiagonal form, a preprocess when computing the singular values of a dense matrix. The key is a reordering of the computation so that the most compute and memoryinte ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
In a recent paper it was shown how memory traffic can be diminished by reformulating the classic algorithm for reducing a matrix to bidiagonal form, a preprocess when computing the singular values of a dense matrix. The key is a reordering of the computation so that the most compute and memoryintensive operations can be “fused”. In this paper, we show that other operations that reduce matrices to condensed form (reduction to upper Hessenberg and reduction to ridiagonal form) can be similarly reorganized, yielding different sets of operations that can be fused. By developing the algorithms with a common framework and notation, we facilitate the comparing and contrasting of the different algorithms and opportunities for optimization. We discuss the algorithms and showcase the performance improvements that they facilitate.
Algorithms for Reducing a Matrix to Condensed Form FLAME
, 2010
"... In a recent paper it was shown how memory traffic can be diminished by reformulating the classic algorithm for reducing a matrix to bidiagonal form, a preprocess when computing the singular values of a dense matrix. The key is a reordering of the computation so that the most compute and memoryinten ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
In a recent paper it was shown how memory traffic can be diminished by reformulating the classic algorithm for reducing a matrix to bidiagonal form, a preprocess when computing the singular values of a dense matrix. The key is a reordering of the computation so that the most compute and memoryintensive operations can be “fused”. In this paper, we show that other operations that reduce matrices to condensed form (reduction to upper Hessenberg and reduction to ridiagonal form) can be similarly reorganized, yielding different sets of operations that can be fused. By developing the algorithms with a common framework and notation, we facilitate the comparing and contrasting of the different algorithms and opportunities for optimization. We discuss the algorithms and showcase the performance improvements that they facilitate.
Exploring the Optimization Space for Build to Order Matrix Algebra
"... The Build to Order (BTO) system compiles a sequence of matrix and vector operations into a highperformance C program for a given architecture. We focus on optimizing programs where memory traffic is the bottleneck. Loop fusion and data parallelism play an important role in this context, but applyin ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
The Build to Order (BTO) system compiles a sequence of matrix and vector operations into a highperformance C program for a given architecture. We focus on optimizing programs where memory traffic is the bottleneck. Loop fusion and data parallelism play an important role in this context, but applying them at every opportunity does not necessarily lead to the best performance. We present an empirical and exhaustive characterization of the optimization space for these two optimizations reporting its size and how many points in the space are close to the fastest option. We show how optimizations of different parts of the program affect one another and how the best choices depend on the computer system. We also evaluate the suitability of several algorithms for searching the space. We leverage these findings to ensure that the BTO compiler produces kernels that outperform vendortuned BLAS on a variety of modern computer architectures. 1.
Restructuring the QR Algorithm for Performance
"... We show how the QR algorithm can be restructured so that it becomes rich in operations that can achieve nearpeak performance on a modern processor. The key is a novel algorithm for applying multiple sets of Givens rotations. We demonstrate the merits of this new QR algorithm for computing the Hermi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We show how the QR algorithm can be restructured so that it becomes rich in operations that can achieve nearpeak performance on a modern processor. The key is a novel algorithm for applying multiple sets of Givens rotations. We demonstrate the merits of this new QR algorithm for computing the Hermitian (symmetric) eigenvalue decomposition and singular value decomposition of dense matrices when all eigenvectors/singular vectors are computed. The approach yields vastly improved performance relative to the traditional QR algorithm and is competitive with two commonly used alternatives—Cuppen’s Divide and Conquer algorithm and the Method of Multiple Relatively Robust Representations—while inheriting the more modest O(n) workspace requirements of the original QR algorithm. Since the computations performed by the restructured algorithm remain essentially identical to those performed by the original method, robust numerical properties are preserved.
A Reliable Generation of HighPerformance Matrix Algebra
"... Scientific programmers often turn to vendortuned Basic Linear Algebra Subprograms (BLAS) to obtain portable high performance. However, many numerical algorithms require several BLAS calls in sequence, and those successive calls result in suboptimal performance. The entire sequence needs to be optim ..."
Abstract
 Add to MetaCart
Scientific programmers often turn to vendortuned Basic Linear Algebra Subprograms (BLAS) to obtain portable high performance. However, many numerical algorithms require several BLAS calls in sequence, and those successive calls result in suboptimal performance. The entire sequence needs to be optimized in concert. Instead of vendortuned BLAS, a programmer could start with source code in Fortran or C (e.g., based on the Netlib BLAS) and use a stateoftheart optimizing compiler. However, our experiments show that optimizing compilers often attain only onequarter the performance of handoptimized code. In this paper we present a domainspecific compiler for matrix algebra, the Build to Order BLAS (BTO), that reliably achieves high performance using a scalable search algorithm for choosing the best combination of loop fusion, array contraction, and multithreading for data parallelism. The BTO compiler generates code that is between 16 % slower and 39 % faster than handoptimized code.
Communication Avoiding Symmetric Band Reduction
"... The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for tridiagonalizing a symmetric band matrix: we present an algori ..."
Abstract
 Add to MetaCart
The running time of an algorithm depends on both arithmetic and communication (i.e., data movement) costs, and the relative costs of communication are growing over time. In this work, we present both theoretical and practical results for tridiagonalizing a symmetric band matrix: we present an algorithm that asymptotically reduces communication, and we show that it indeed performs well in practice. The tridiagonalization of a symmetric band matrix is a key kernel in solving the symmetric eigenvalue problem for both full and band matrices. In order to preserve sparsity, tridiagonalization routines use annihilateandchase procedures that previously have suffered from poor data locality. We improve data locality by reorganizing the computation, asymptotically reducing communication costs compared to existing algorithms. Our sequential implementation demonstrates that avoiding communication improves runtime even at the expense of extra arithmetic: we observe a 2 ⇥ speedup over Intel MKL while doing 43 % more floating point operations. Our parallel implementation targets sharedmemory multicore platforms. It uses pipelined parallelism and a static scheduler while retaining the locality properties of the sequential algorithm. Due to lightweight synchronization and effective data reuse, we see 9.5 ⇥ scaling over our serial code and up to 6 ⇥ speedup over the PLASMA library, comparing parallel performance on a tencore processor.
“BLIS: A Framework for Rapidly Instantiating BLAS Functionality, ” which was submitted to ACM TOMS.)
, 2012
"... We propose the portable BLASlike Interface Software (BLIS) framework which addresses a number of shortcomings in both the original BLAS interface and presentday BLAS implementations. The framework allows developers to rapidly instantiate highperformance BLASlike libraries on existing and new arc ..."
Abstract
 Add to MetaCart
We propose the portable BLASlike Interface Software (BLIS) framework which addresses a number of shortcomings in both the original BLAS interface and presentday BLAS implementations. The framework allows developers to rapidly instantiate highperformance BLASlike libraries on existing and new architectures with relatively little effort. The key to this achievement is the observation that virtually all computation within level2 and level3 BLAS operations may be expressed in terms of very simple kernels. Higherlevel framework code is generalized so that it can be reused and/or reparameterized for different operations (as well as different architectures) with little to no modification. Inserting highperformance kernels into the framework facilitates the immediate optimization of any and all BLASlike operations which are cast in terms of these kernels, and thus the framework acts as a productivity multiplier. Users of BLASdependent applications are supported through a straightforward compatibility layer, though calling sequences must be updated for those who wish to access new functionality. Experimental performance of level2 and level3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL). 1
0 BLIS: A Framework for Rapidly Instantiating BLAS Functionality
"... The BLASlike Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level2 (matrixvector) and level3 (matrixmatrix) BLAS operatio ..."
Abstract
 Add to MetaCart
The BLASlike Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level2 (matrixvector) and level3 (matrixmatrix) BLAS operations can be expressed and optimized in terms of very simple kernels. While others have had similar insights, BLIS reduces the necessary kernels to what we believe is the simplest set that still supports the high performance that the computational science community demands. Higherlevel framework code is generalized and implemented in ISO C99 so that it can be reused and/or reparameterized for different operations (and different architectures) with little to no modification. Inserting highperformance kernels into the framework facilitates the immediate optimization of any BLASlike operations which are cast in terms of these kernels, and thus the framework acts as a productivity multiplier. Users of BLASdependent applications are given a choice of using the the traditional Fortran77 BLAS interface, a generalized C interface, or any other higher level interface that builds upon this latter API. Preliminary performance of level2 and level3 operations is observed to be competitive with two mature open source libraries (OpenBLAS and ATLAS) as well as an established commercial product (Intel MKL).