Results 1  10
of
23
Minimizing Communication in Linear Algebra
, 2009
"... In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, nbyn matrixmultiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, nbyn matrixmultiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic operations / √ M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain. 1
Parallel and fully recursive multifrontal supernodal sparse cholesky
 Future Generation Computer Systems
, 2004
"... We describe the design, implementation, and performance of a new parallel sparse Cholesky factorization code. The code uses a multifrontal factorization strategy. Operations on small dense submatrices are performed using new densematrix subroutines that are part of the code, although the code can a ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
We describe the design, implementation, and performance of a new parallel sparse Cholesky factorization code. The code uses a multifrontal factorization strategy. Operations on small dense submatrices are performed using new densematrix subroutines that are part of the code, although the code can also use the BLAS and LAPACK. The new code is recursive at both the sparse and the dense levels, it uses a novel recursive data layout for dense submatrices, and it is parallelized using Cilk, an extension of C specifically designed to parallelize recursive codes. We demonstrate that the new code performs well and scales well on SMP’s. In particular, on up to 16 processors, the code outperforms two stateoftheart messagepassing codes. The scalability and high performance that the code achieves imply that recursive schedules, blocked data layouts, and dynamic scheduling are effective in the implementation of sparse factorization codes.
SuperMatrix outoforder scheduling of matrix operations for SMP and multicore architectures
 N, Month 20YY. 24 · G. QuintanaOrtí et al. SPAA ’07: Proceedings of the Nineteenth ACM Symposium on Parallelism in Algorithms and Architectures
"... We discuss the highperformance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multicore processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to re ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
(Show Context)
We discuss the highperformance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multicore processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to render high performance as well as scalability on these architectures. The solution we propose is to arrange the data structures and algorithms so that matrix blocks become the fundamental units of data, and operations on these blocks become the fundamental units of computation, resulting in algorithmsbyblocks as opposed to the more traditional blocked algorithms. We show that this facilitates the adoption of techniques akin to dynamic scheduling and outoforder execution usual in superscalar processors, which we name SuperMatrix OutofOrder scheduling. Performance results on a 16 CPU Itanium2based server are used to highlight opportunities and issues related to this new approach. 1
The relevance of New Data Structure Approaches for Dense Linear Algebra
 in the new Multicore / Manycore Environments. IBM RC Report 24599 (July), IBM Research
"... been applying recursion and New Data Structures (NDS) to increase the performance of Dense Linear Algebra (DLA) factorization algorithms. Later, John Gunnels, and later still, Jim Sexton, both now at IBM Research also began working in this area. For about three years now almost all computer manufact ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
been applying recursion and New Data Structures (NDS) to increase the performance of Dense Linear Algebra (DLA) factorization algorithms. Later, John Gunnels, and later still, Jim Sexton, both now at IBM Research also began working in this area. For about three years now almost all computer manufacturers have dramatically changed their computer architectures which they call MultiCore, (MC). It turns out that these new designs give poor performance for the traditional designs of DLA libraries such as LAPACK and ScaLAPACK. Recent results of Jack Dongarra’s group at the Innovative Computing Laboratory in Knoxville, Tennessee have shown how to obtain high performance for DLA factorization algorithms on the Cell architecture, an example of an MC processor, but only when they used NDS. In this talk we will give some reasons why this is so. 1
The Opie Compiler: from Rowmajor Source to Mortonordered Matrices
, 2004
"... The Opie Project aims to develop a compiler to transform C codes written for rowmajor matrix representation into equivalentcodes for Mortonorder matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in performance weseek to compile libraries of u ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
The Opie Project aims to develop a compiler to transform C codes written for rowmajor matrix representation into equivalentcodes for Mortonorder matrix representation, and to apply its techniques to other languages. Accepting a possible reduction in performance weseek to compile libraries of usable code to support future developmentofnew algorithms better suited to Mortonordered matrices.
A faster and simpler recursive algorithm for the LAPACK routine DGELS
 BIT
, 2001
"... We present new algorithms for computing the linear least squares solution to overdetermined linear systems and the minimum norm solution to underdetermined linear systems. For both problems, we consider the standard formulation min �AX − B�F and the transposed formulation min �A T X − B�F, i.e, four ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
We present new algorithms for computing the linear least squares solution to overdetermined linear systems and the minimum norm solution to underdetermined linear systems. For both problems, we consider the standard formulation min �AX − B�F and the transposed formulation min �A T X − B�F, i.e, four different problems in all. The functionality of our implementation corresponds to that of the LAPACK routine DGELS. The new implementation is significantly faster and simpler. It outperforms the LAPACK DGELS for all matrix sizes tested. The improvement is usually 50–100 % and it is as high as 400%. The four different problems of DGELS are essentially reduced to two, by use of explicit transposition of A. By explicit transposition we avoid computing Householder transformations on vectors with large stride. The QR factorization of block columns of A is performed using a recursive level3 algorithm. By interleaving updates of B with the factorization of A, we reduce the number of floating point operations performed for the linear least squares problem. By avoiding redundant computations in the update of B we reduce the work needed to compute the minimum norm solution. Finally, we outline fully recursive algorithms for the four problems of DGELS as well as for QR factorization.
Rectangular full packed format for LAPACK algorithms timings on several computers
 In Applied Parallel Computing, State of the Art in Scientific Computing, PARA 2006 , Volume LNCS 4699 (SpringerVerlag
, 2007
"... Abstract. We describe a new data format for storing triangular and symmetric matrices called RFP (Rectangular Full Packed). The standard two dimensional arrays of Fortran and C (also known as full format) that are used to store triangular and symmetric matrices waste nearly half the storage space bu ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We describe a new data format for storing triangular and symmetric matrices called RFP (Rectangular Full Packed). The standard two dimensional arrays of Fortran and C (also known as full format) that are used to store triangular and symmetric matrices waste nearly half the storage space but provide high performance via the use of level 3 BLAS. Standard packed format arrays fully utilize storage (array space) but provide low performance as there are no level 3 packed BLAS. We combine the good features of packed and full storage using RFP format to obtain high performance using L3 (level 3) BLAS as RFP is full format. Also, RFP format requires exactly the same minimal storage as packed format. Each full and/or packed symmetric/triangular routine becomes a single new RFP routine. We present LAPACK routines for Cholesky factorization, inverse and solution computation in RFP format to illustrate this new work and to describe its performance on the IBM, Itanium, NEC, and SUN platforms. Performance of RFP versus LAPACK full routines for both serial and SMP parallel processing is about the same while using half the storage. Performance is roughly one to a factor of 33 for serial and one to a factor of 100 for SMP parallel times faster than LAPACK packed routines. Existing LAPACK routines and vendor LAPACK routines were used in the serial and the SMP parallel study, respectively. In both studies vendor L3 BLAS were used. 1
Prospectus for the Next LAPACK and ScaLAPACK Libraries
"... Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ease of use, and