Results 1  10
of
16
Minimizing Communication in Linear Algebra
, 2009
"... In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, nbyn matrixmultiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in ..."
Abstract

Cited by 16 (8 self)
 Add to MetaCart
In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, nbyn matrixmultiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic operations / √ M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain. 1
Parallel and fully recursive multifrontal supernodal sparse cholesky
 Future Generation Computer Systems
, 2004
"... We describe the design, implementation, and performance of a new parallel sparse Cholesky factorization code. The code uses a multifrontal factorization strategy. Operations on small dense submatrices are performed using new densematrix subroutines that are part of the code, although the code can a ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
We describe the design, implementation, and performance of a new parallel sparse Cholesky factorization code. The code uses a multifrontal factorization strategy. Operations on small dense submatrices are performed using new densematrix subroutines that are part of the code, although the code can also use the BLAS and LAPACK. The new code is recursive at both the sparse and the dense levels, it uses a novel recursive data layout for dense submatrices, and it is parallelized using Cilk, an extension of C specifically designed to parallelize recursive codes. We demonstrate that the new code performs well and scales well on SMP’s. In particular, on up to 16 processors, the code outperforms two stateoftheart messagepassing codes. The scalability and high performance that the code achieves imply that recursive schedules, blocked data layouts, and dynamic scheduling are effective in the implementation of sparse factorization codes.
SuperMatrix outoforder scheduling of matrix operations for SMP and multicore architectures
 N, Month 20YY. 24 · G. QuintanaOrtí et al. SPAA ’07: Proceedings of the Nineteenth ACM Symposium on Parallelism in Algorithms and Architectures
"... We discuss the highperformance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multicore processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to re ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
We discuss the highperformance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multicore processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to render high performance as well as scalability on these architectures. The solution we propose is to arrange the data structures and algorithms so that matrix blocks become the fundamental units of data, and operations on these blocks become the fundamental units of computation, resulting in algorithmsbyblocks as opposed to the more traditional blocked algorithms. We show that this facilitates the adoption of techniques akin to dynamic scheduling and outoforder execution usual in superscalar processors, which we name SuperMatrix OutofOrder scheduling. Performance results on a 16 CPU Itanium2based server are used to highlight opportunities and issues related to this new approach. 1
The relevance of New Data Structure Approaches for Dense Linear Algebra
 in the new Multicore / Manycore Environments. IBM RC Report 24599 (July), IBM Research
"... been applying recursion and New Data Structures (NDS) to increase the performance of Dense Linear Algebra (DLA) factorization algorithms. Later, John Gunnels, and later still, Jim Sexton, both now at IBM Research also began working in this area. For about three years now almost all computer manufact ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
been applying recursion and New Data Structures (NDS) to increase the performance of Dense Linear Algebra (DLA) factorization algorithms. Later, John Gunnels, and later still, Jim Sexton, both now at IBM Research also began working in this area. For about three years now almost all computer manufacturers have dramatically changed their computer architectures which they call MultiCore, (MC). It turns out that these new designs give poor performance for the traditional designs of DLA libraries such as LAPACK and ScaLAPACK. Recent results of Jack Dongarra’s group at the Innovative Computing Laboratory in Knoxville, Tennessee have shown how to obtain high performance for DLA factorization algorithms on the Cell architecture, an example of an MC processor, but only when they used NDS. In this talk we will give some reasons why this is so. 1
A faster and simpler recursive algorithm for the LAPACK routine DGELS
 BIT
, 2001
"... We present new algorithms for computing the linear least squares solution to overdetermined linear systems and the minimum norm solution to underdetermined linear systems. For both problems, we consider the standard formulation min �AX − B�F and the transposed formulation min �A T X − B�F, i.e, four ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
We present new algorithms for computing the linear least squares solution to overdetermined linear systems and the minimum norm solution to underdetermined linear systems. For both problems, we consider the standard formulation min �AX − B�F and the transposed formulation min �A T X − B�F, i.e, four different problems in all. The functionality of our implementation corresponds to that of the LAPACK routine DGELS. The new implementation is significantly faster and simpler. It outperforms the LAPACK DGELS for all matrix sizes tested. The improvement is usually 50–100 % and it is as high as 400%. The four different problems of DGELS are essentially reduced to two, by use of explicit transposition of A. By explicit transposition we avoid computing Householder transformations on vectors with large stride. The QR factorization of block columns of A is performed using a recursive level3 algorithm. By interleaving updates of B with the factorization of A, we reduce the number of floating point operations performed for the linear least squares problem. By avoiding redundant computations in the update of B we reduce the work needed to compute the minimum norm solution. Finally, we outline fully recursive algorithms for the four problems of DGELS as well as for QR factorization.
The Opie Compiler: from Rowmajor Source to Mortonordered Matrices
, 2004
"... The Opie Project aims to develop a compiler to transform C codes written for rowmajor matrix representation into equivalentcodes for Mortonorder matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in performance weseek to compile libraries of u ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
The Opie Project aims to develop a compiler to transform C codes written for rowmajor matrix representation into equivalentcodes for Mortonorder matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in performance weseek to compile libraries of usable code to support future developmentofnew algorithms better suited to Mortonordered matrices.
Prospectus for the Next LAPACK and ScaLAPACK Libraries
"... Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ease of use, and
ATLAS Installation Guide
, 2007
"... This note provides a brief overview of ATLAS, and describes how to install it. It includes extensive discussion of common configure options, and describes why they might be employed on various platforms. In addition to discussing how to configure and build the ATLAS package, this note also describes ..."
Abstract
 Add to MetaCart
This note provides a brief overview of ATLAS, and describes how to install it. It includes extensive discussion of common configure options, and describes why they might be employed on various platforms. In addition to discussing how to configure and build the ATLAS package, this note also describes how an installer can confirm that the resulting libraries are producing correct answers and running efficiently. Extensive examples are provided, including a fulllength example showing the installation of both ATLAS and LAPACK on an example architecture.