Results 1 - 10
of
13
Minimizing Communication in Linear Algebra
, 2009
"... In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix-multiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in ..."
Abstract
-
Cited by 16 (9 self)
- Add to MetaCart
In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix-multiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic operations / √ M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain. 1
Parallel and fully recursive multifrontal supernodal sparse cholesky
- Future Generation Computer Systems
, 2004
"... We describe the design, implementation, and performance of a new parallel sparse Cholesky factorization code. The code uses a multifrontal factorization strategy. Operations on small dense submatrices are performed using new dense-matrix subroutines that are part of the code, although the code can a ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
We describe the design, implementation, and performance of a new parallel sparse Cholesky factorization code. The code uses a multifrontal factorization strategy. Operations on small dense submatrices are performed using new dense-matrix subroutines that are part of the code, although the code can also use the BLAS and LAPACK. The new code is recursive at both the sparse and the dense levels, it uses a novel recursive data layout for dense submatrices, and it is parallelized using Cilk, an extension of C specifically designed to parallelize recursive codes. We demonstrate that the new code performs well and scales well on SMP’s. In particular, on up to 16 processors, the code outperforms two state-of-the-art message-passing codes. The scalability and high performance that the code achieves imply that recursive schedules, blocked data layouts, and dynamic scheduling are effective in the implementation of sparse factorization codes.
permission. MINIMIZING COMMUNICATION IN NUMERICAL LINEAR ALGEBRA
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures
- N, Month 20YY. 24 · G. Quintana-Ortí et al. SPAA ’07: Proceedings of the Nineteenth ACM Symposium on Parallelism in Algorithms and Architectures
"... We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multi-core processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to re ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multi-core processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to render high performance as well as scalability on these architectures. The solution we propose is to arrange the data structures and algorithms so that matrix blocks become the fundamental units of data, and operations on these blocks become the fundamental units of computation, resulting in algorithms-byblocks as opposed to the more traditional blocked algorithms. We show that this facilitates the adoption of techniques akin to dynamic scheduling and out-of-order execution usual in superscalar processors, which we name SuperMatrix Out-of-Order scheduling. Performance results on a 16 CPU Itanium2-based server are used to highlight opportunities and issues related to this new approach. 1
A faster and simpler recursive algorithm for the LAPACK routine DGELS
- BIT
, 2001
"... We present new algorithms for computing the linear least squares solution to overdetermined linear systems and the minimum norm solution to underdetermined linear systems. For both problems, we consider the standard formulation min �AX − B�F and the transposed formulation min �A T X − B�F, i.e, four ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We present new algorithms for computing the linear least squares solution to overdetermined linear systems and the minimum norm solution to underdetermined linear systems. For both problems, we consider the standard formulation min �AX − B�F and the transposed formulation min �A T X − B�F, i.e, four different problems in all. The functionality of our implementation corresponds to that of the LAPACK routine DGELS. The new implementation is significantly faster and simpler. It outperforms the LAPACK DGELS for all matrix sizes tested. The improvement is usually 50–100 % and it is as high as 400%. The four different problems of DGELS are essentially reduced to two, by use of explicit transposition of A. By explicit transposition we avoid computing Householder transformations on vectors with large stride. The QR factorization of block columns of A is performed using a recursive level-3 algorithm. By interleaving updates of B with the factorization of A, we reduce the number of floating point operations performed for the linear least squares problem. By avoiding redundant computations in the update of B we reduce the work needed to compute the minimum norm solution. Finally, we outline fully recursive algorithms for the four problems of DGELS as well as for QR factorization.
The Opie Compiler: from Row-major Source to Morton-ordered Matrices
, 2004
"... The Opie Project aims to develop a compiler to transform C codes written for row-major matrix representation into equivalentcodes for Morton-order matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in performance weseek to compile libraries of u ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The Opie Project aims to develop a compiler to transform C codes written for row-major matrix representation into equivalentcodes for Morton-order matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in performance weseek to compile libraries of usable code to support future developmentofnew algorithms better suited to Morton-ordered matrices.
Prospectus for the Next LAPACK and ScaLAPACK Libraries
"... Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ease of use, and
ATLAS Installation Guide
, 2007
"... This note provides a brief overview of ATLAS, and describes how to install it. It includes extensive discussion of common configure options, and describes why they might be employed on various platforms. In addition to discussing how to configure and build the ATLAS package, this note also describes ..."
Abstract
- Add to MetaCart
This note provides a brief overview of ATLAS, and describes how to install it. It includes extensive discussion of common configure options, and describes why they might be employed on various platforms. In addition to discussing how to configure and build the ATLAS package, this note also describes how an installer can confirm that the resulting libraries are producing correct answers and running efficiently. Extensive examples are provided, including a full-length example showing the installation of both ATLAS and LAPACK on an example architecture.
Oolala - From Numerical Linear Algebra To Compiler Technology For Design Patterns
, 2002
"... Declaration 19 Copyright 20 Acknowledgements 22 1 ..."

