Results 1  10
of
16
Automating the Generation of Composed Linear Algebra Kernels
"... Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
(Show Context)
Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However, tuning the BLAS in isolation misses opportunities for memory optimization that result from composing multiple subprograms. Because it is not practical to create a library of all BLAS combinations, we have developed a domainspecific compiler that generates them on demand. In this paper, we describe a novel algorithm for compiling linear algebra kernels and searching for the best combination of optimization choices. We also present a new hybrid analytic/empirical method for quickly evaluating the profitability of each optimization. We report experimental results showing speedups of up to 130 % relative to the GotoBLAS on an AMD Opteron and up to 137 % relative to MKL on an Intel Core 2. 1.
Restructuring the QRAlgorithm for HighPerformance Applications of Givens Rotations
, 2011
"... We show how the QR algorithm can be restructured so that it becomes rich in operations that can achieve nearpeak performance on a modern processor. The key is a novel, cachefriendly algorithm for applying multiple sets of Givens rotations, which is then implemented with optimizations that (1) leve ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
We show how the QR algorithm can be restructured so that it becomes rich in operations that can achieve nearpeak performance on a modern processor. The key is a novel, cachefriendly algorithm for applying multiple sets of Givens rotations, which is then implemented with optimizations that (1) leverage vector instruction units to increase floatingpoint throughput, and (2) fuse multiple rotations to decrease the total number of memory operations. We demonstrate the merits of this new QR algorithm for computing the Hermitian (symmetric) eigenvalue decomposition and singular value decomposition of dense matrices when all eigenvectors/singular vectors are computed. The approach yields vastly improved performance relative to the traditional QR algorithm and is competitive with two commonly used alternatives—Cuppen’s Divide and Conquer algorithm and the Method of Multiple Relatively Robust Representations—while inheriting the more modest O(n) workspace requirements of the original QR algorithm. Since the computations performed by the restructured algorithm remain essentially identical to those performed by the original method, robust numerical properties are preserved. 1
Algorithms for Reducing a Matrix to Condensed Form FLAME
, 2010
"... In a recent paper it was shown how memory traffic can be diminished by reformulating the classic algorithm for reducing a matrix to bidiagonal form, a preprocess when computing the singular values of a dense matrix. The key is a reordering of the computation so that the most compute and memoryinten ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
In a recent paper it was shown how memory traffic can be diminished by reformulating the classic algorithm for reducing a matrix to bidiagonal form, a preprocess when computing the singular values of a dense matrix. The key is a reordering of the computation so that the most compute and memoryintensive operations can be “fused”. In this paper, we show that other operations that reduce matrices to condensed form (reduction to upper Hessenberg and reduction to ridiagonal form) can be similarly reorganized, yielding different sets of operations that can be fused. By developing the algorithms with a common framework and notation, we facilitate the comparing and contrasting of the different algorithms and opportunities for optimization. We discuss the algorithms and showcase the performance improvements that they facilitate.
Families of Algorithms for Reducing a Matrix to Condensed Form
"... In a recent paper it was shown how memory traffic can be diminished by reformulating the classic algorithm for reducing a matrix to bidiagonal form, a preprocess when computing the singular values of a dense matrix. The key is a reordering of the computation so that the most compute and memoryinte ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In a recent paper it was shown how memory traffic can be diminished by reformulating the classic algorithm for reducing a matrix to bidiagonal form, a preprocess when computing the singular values of a dense matrix. The key is a reordering of the computation so that the most compute and memoryintensive operations can be “fused”. In this paper, we show that other operations that reduce matrices to condensed form (reduction to upper Hessenberg and reduction to ridiagonal form) can be similarly reorganized, yielding different sets of operations that can be fused. By developing the algorithms with a common framework and notation, we facilitate the comparing and contrasting of the different algorithms and opportunities for optimization. We discuss the algorithms and showcase the performance improvements that they facilitate.
Restructuring the QR Algorithm for Performance
"... We show how the QR algorithm can be restructured so that it becomes rich in operations that can achieve nearpeak performance on a modern processor. The key is a novel algorithm for applying multiple sets of Givens rotations. We demonstrate the merits of this new QR algorithm for computing the Hermi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We show how the QR algorithm can be restructured so that it becomes rich in operations that can achieve nearpeak performance on a modern processor. The key is a novel algorithm for applying multiple sets of Givens rotations. We demonstrate the merits of this new QR algorithm for computing the Hermitian (symmetric) eigenvalue decomposition and singular value decomposition of dense matrices when all eigenvectors/singular vectors are computed. The approach yields vastly improved performance relative to the traditional QR algorithm and is competitive with two commonly used alternatives—Cuppen’s Divide and Conquer algorithm and the Method of Multiple Relatively Robust Representations—while inheriting the more modest O(n) workspace requirements of the original QR algorithm. Since the computations performed by the restructured algorithm remain essentially identical to those performed by the original method, robust numerical properties are preserved.
Exploring the Optimization Space for Build to Order Matrix Algebra
"... The Build to Order (BTO) system compiles a sequence of matrix and vector operations into a highperformance C program for a given architecture. We focus on optimizing programs where memory traffic is the bottleneck. Loop fusion and data parallelism play an important role in this context, but applyin ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
The Build to Order (BTO) system compiles a sequence of matrix and vector operations into a highperformance C program for a given architecture. We focus on optimizing programs where memory traffic is the bottleneck. Loop fusion and data parallelism play an important role in this context, but applying them at every opportunity does not necessarily lead to the best performance. We present an empirical and exhaustive characterization of the optimization space for these two optimizations reporting its size and how many points in the space are close to the fastest option. We show how optimizations of different parts of the program affect one another and how the best choices depend on the computer system. We also evaluate the suitability of several algorithms for searching the space. We leverage these findings to ensure that the BTO compiler produces kernels that outperform vendortuned BLAS on a variety of modern computer architectures. 1.
Generating Empirically Optimized Numerical Software from MATLAB Prototypes
, 2008
"... The growing demand for higher levels of detail and accuracy in results means that the size and complexity of scientific computations is increasing at least as fast as the improvements in processor technology. Programming scientific applications is hard, and optimizing them for high performance is ev ..."
Abstract
 Add to MetaCart
(Show Context)
The growing demand for higher levels of detail and accuracy in results means that the size and complexity of scientific computations is increasing at least as fast as the improvements in processor technology. Programming scientific applications is hard, and optimizing them for high performance is even harder. The development of optimized codes requires extensive knowledge, not only of the costs of floatingpoint arithmetic but also of memory access issues and compiler optimizations. Experiments show that the complexity of this hardwaresoftware system means that performance is difficult to predict fully. Therefore, computational scientists are often forced to choose between investing too much time in tuning code or accepting performance that is significantly lower than the best achievable performance on a given architecture. In this paper, we describe the first steps toward a fully automated system for the optimization of the matrix algebra kernels that are a foundational part of many scientific applications. To generate highly optimized code from a highlevel MATLAB prototype, we define a threestep approach. To begin, we have developed a compiler that converts a MATLAB script into simple C code. We then use the polyhedral optimization system PLuTo to optimize that code for coarsegrained parallelism and locality simultaneously. Finally, we annotate the resulting code with performance tuning directives and
Automating the Generation of Composed Linear Algebra Kernels
"... Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However ..."
Abstract
 Add to MetaCart
Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However, because the BLAS are tuned in isolation, they do not take advantage of opportunities for memory optimization that result from composing multiple subprograms. Because it is not practical to create a library of all BLAS combinations, we have developed a domainspecific compiler that generates them on demand. In this paper, we describe the novel algorithm underlying the compiler that searches for the best combination of optimization choices, and we present a new hybrid analytic/empirical method for quickly evaluating the profitability of each optimization. We report experimental results showing speedups of 5 % to 140 % relative to GotoBLAS and vendortuned BLAS on the Intel Cost 2 and the AMD Opteron. I.
OPTIMIZING CUDA CODE BY KERNEL FUSION—APPLICATION ON BLAS
"... Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs. However, the memory locality can be often improved by kernel fus ..."
Abstract
 Add to MetaCart
(Show Context)
Modern GPUs are able to perform significantly more arithmetic operations than transfers of a single word to or from global memory. Hence, many GPU kernels are limited by memory bandwidth and cannot exploit the arithmetic power of GPUs. However, the memory locality can be often improved by kernel fusion when a sequence of kernels is executed and some kernels in this sequence share data. In this paper, we show how kernels performing map, reduce or their nested combinations can be fused automatically by our sourcetosource compiler. To demonstrate the usability of the compiler, we have implemented several BLAS1 and BLAS2 routines and show how the performance of their sequences can be improved by fusions. Compared to similar sequences using CUBLAS, our compiler is able to generate code that is up to 2.61 × faster for the examples tested.
1Reliable Generation of HighPerformance Matrix Algebra
"... Abstract—Scientific programmers often turn to vendortuned Basic Linear Algebra Subprograms (BLAS) to obtain portable high performance. However, many numerical algorithms require several BLAS calls in sequence, and those successive calls result in suboptimal performance. The entire sequence needs to ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—Scientific programmers often turn to vendortuned Basic Linear Algebra Subprograms (BLAS) to obtain portable high performance. However, many numerical algorithms require several BLAS calls in sequence, and those successive calls result in suboptimal performance. The entire sequence needs to be optimized in concert. Instead of vendortuned BLAS, a programmer could start with source code in Fortran or C (e.g., based on the Netlib BLAS) and use a stateoftheart optimizing compiler. However, our experiments show that optimizing compilers often attain only onequarter the performance of handoptimized code. In this paper we present a domainspecific compiler for matrix algebra, the Build to Order BLAS (BTO), that reliably achieves high performance using a scalable search algorithm for choosing the best combination of loop fusion, array contraction, and multithreading for data parallelism. The BTO compiler generates code that is between 16 % slower and 39 % faster than handoptimized code. I.