## Communication-optimal parallel and sequential QR and LU factorizations (2008)

### Cached

### Download Links

Citations: | 59 - 27 self |

### BibTeX

@MISC{Demmel08communication-optimalparallel,

author = {James Demmel and Laura Grigori and Mark Frederick Hoemmen and Julien Langou},

title = {Communication-optimal parallel and sequential QR and LU factorizations},

year = {2008}

}

### OpenURL

### Abstract

### Citations

8795 | Introduction to Algorithms - Cormen, Leiserson, et al. - 1990 |

1071 |
Using MPI: Portable Parallel Programming with the Message Passing Interface
- Gropp, Lusk, et al.
- 1999
(Show Context)
Citation Context ...across some number P of processors. A reduction leaves the final result on exactly one of the P processors; an all-reduction leaves a copy of the final result on all the processors. See, for example, =-=[31]-=-. In the sequential case, there is an analogous operation. Imagine that there are P “virtual processors.” To each one is assigned a certain amount of fast memory. Virtual processors communicate by sen... |

875 | Accuracy and stability of numerical algorithms - Higham - 1996 |

781 |
The Symmetric Eigenvalue Problem
- Parlett
- 1980
(Show Context)
Citation Context ...s suffices to make the Q factor orthogonal to machine precision, as long as the input matrix is numerically 45full rank, i.e., if O(εκ(A)) < 1. This is the source of Kahan’s maxim, “Twice is enough” =-=[48]-=-: the accuracy reaches its theoretical best after one reorthogonalization pass (see also [1]), and further reorthogonalizations do not improve orthogonality. However, TSQR needs only half as many mess... |

431 | IBHVG2 — a user’s guide - Anderson, Fickie - 1987 |

386 |
Gaussian elimination is not optimal
- Strassen
- 1968
(Show Context)
Citation Context ...mutative and associative reorderings of the conventional O(n3 ) matrix multiplication algorithm, and so excludes faster algorithms using distributivity or special constants, such as those of Strassen =-=[59]-=- or Coppersmith and Winograd [11], and their use in asymptotically fast versions of LU and QR [18]. Extending communication lower bounds to these asymptotically faster algorithms is an open problem. 1... |

362 |
ScaLAPACK Users’ Guide
- Blackford, Choi, et al.
- 1997
(Show Context)
Citation Context ...ms. First, we decompose the m × n matrix A into four m/4 × n block rows: ⎛ ⎞ A = ⎜ ⎝ 1The ScaLAPACK Users’ Guide has a good explanation of 1-D and 2-D block and block cyclic layouts of dense matrices =-=[7]-=-. In particular, refer to the section entitled “Details of Example Program #1.” A0 A1 A2 A3 ⎟ ⎠ . 15Then, we independently compute the QR factorization of each block row: ⎛ ⎞ ⎛ ⎞ ⎜ ⎝ A0 A1 A2 A3 ⎟ ⎠ ... |

279 |
Space-Filling Curves
- Sagan
- 1994
(Show Context)
Citation Context ... restrictions on the function h we are allowed to use, it is easy to see that we can always choose n = 1, i.e. send the least possible amount of information: We do this by using a space-filling curve =-=[53]-=- to represent each y (n) ∈ R (n) by one of several preimages ˜y ∈ R. In other words, h (1)(y(n)) maps y (n) to a scalar ˜y that P1 can map back to y (n) by a space filling curve. This is obviously unr... |

173 |
I/O complexity: The red-blue pebble game
- Hong, Kung
- 1981
(Show Context)
Citation Context ...plete [33]. The only known QR factorization algorithm in arithmetic NC (see [11]) is numerically highly unstable [14], and no work suggests that a stable arithmetic NC algorithm exists. Hong and Kung =-=[28]-=- and Irony, Toledo, and Tiskin [27] proved lower bounds on 28communication for sequential and parallel matrix multiplication. We are, as far as we know, the first to attempt extending these bounds to... |

136 |
A cellular computer to implement the Kalman Filter Algorithm
- Cannon
- 1969
(Show Context)
Citation Context ...lower bound for sequential matrix multiplication is attained by “tiling” or “blocking” the matrices into square blocks of dimension √ W/3, and for parallel matrix multiplication by Cannon’s algorithm =-=[8]-=-. 5.2. Lower Bounds for CAQR. Now we need to extend our analysis of matrix multiplication. We assume all variables are real; extensions to the complex case are straightforward. Suppose A = QR is m-by-... |

121 | J.Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures
- Buttari, Langou, et al.
(Show Context)
Citation Context ...s of reduction trees will let us optimize TSQR for more general architectures. Now we briefly describe related work and our contributions. The tree-based QR idea itself is not novel (see for example, =-=[8, 15, 27, 32, 39, 49, 51, 52]-=-), but we have a number of optimizations and generalizations: • Our algorithm can perform almost all its floating-point operations using any fast sequential QR factorization routine. In particular, we... |

102 |
The block conjugate gradient algorithm and related methods
- O’Leary
- 1980
(Show Context)
Citation Context ...tly compute the QR factorization of a tall and skinny dense matrix. This includes algorithms for solving linear systems Ax = B with multiple right-hand sides (such as variants of GMRES, QMR, or CG 13=-=[62, 24, 47]-=-), as well as block iterative eigensolvers (for a summary of such methods, see [4, 40]). Many of these methods have widely used implementations, on which a large community of scientists and engineers ... |

97 | Locality of reference in LU decomposition with partial pivoting
- Toledo
- 1997
(Show Context)
Citation Context ...lternative LU algorithms in the literature that attain at least some of these communication lower bounds: [30] describes a parallel LU algorithm attaining both bandwidth and latency lower bounds, and =-=[60]-=- describes a sequential LU algorithm that at least attains the bandwidth lower bound. • We outline how to extend both algorithms and optimality results to certain kinds of hierarchical architectures, ... |

89 |
Fast parallel matrix inversion algorithms
- Csanky
- 1976
(Show Context)
Citation Context ...od (“CholeskyQR”) in detail in Section 9. Csanky shows arithmetic NC algorithms for inverting matrices and solving linear systems, and matrix-matrix multiplication also has an arithmetic NC algorithm =-=[14]-=-. Thus, we could construct a version of CholeskyQR that is in arithmetic NC. However, this method is highly inaccurate in floating-point arithmetic. Not only is CholeskyQR itself inaccurate (see Secti... |

78 |
Recursive blocked algorithms and hybrid data structures for dense matrix library software
- Elmroth, Gustavson, et al.
(Show Context)
Citation Context ...roblems, where the overhead would dominate.) Second, it suggests that the data structure with which the matrix is stored should be hierarchical as well, with matrices stored as subblocks of subblocks =-=[23]-=-. This is certainly possible, but it differs significantly from the usual data structures to which users are accustomed. It also suggests that recent approaches based on decomposing dense linear algeb... |

66 | A block QMR algorithm for non-Hermitian linear systems with multiple right-hand sides, Linear Algebra Appl. 254
- Freund, Malhotra
- 1997
(Show Context)
Citation Context ...tly compute the QR factorization of a tall and skinny dense matrix. This includes algorithms for solving linear systems Ax = B with multiple right-hand sides (such as variants of GMRES, QMR, or CG 13=-=[62, 24, 47]-=-), as well as block iterative eigensolvers (for a summary of such methods, see [4, 40]). Many of these methods have widely used implementations, on which a large community of scientists and engineers ... |

51 | Communication lower bounds for distributedmemory matrix multiplication
- Irony, Toledo, et al.
(Show Context)
Citation Context ... are superior in both theory and practice. We have extended known lower bounds on communication for sequential and parallel matrix multiplication (see Hong and Kung [35] and Irony, Toledo, and Tiskin =-=[34]-=-) to QR decomposition, and shown both that the new algorithms attain these lower bounds (sometimes only up to polylogarithmic factors), whereas existing LAPACK and ScaLAPACK algorithms perform asympto... |

49 |
de Geijn. Parallel out-of-core computation and updating the QR factorization
- Gunter, van
(Show Context)
Citation Context ...s of reduction trees will let us optimize TSQR for more general architectures. Now we briefly describe related work and our contributions. The tree-based QR idea itself is not novel (see for example, =-=[8, 15, 27, 32, 39, 49, 51, 52]-=-), but we have a number of optimizations and generalizations: • Our algorithm can perform almost all its floating-point operations using any fast sequential QR factorization routine. In particular, we... |

46 |
Etude de quelques méthodes de résolution de problèmes linéaires de grande taille sur multiprocesseur
- Vital
- 1990
(Show Context)
Citation Context ...tly compute the QR factorization of a tall and skinny dense matrix. This includes algorithms for solving linear systems Ax = B with multiple right-hand sides (such as variants of GMRES, QMR, or CG 13=-=[62, 24, 47]-=-), as well as block iterative eigensolvers (for a summary of such methods, see [4, 40]). Many of these methods have widely used implementations, on which a large community of scientists and engineers ... |

45 |
On the asymptotic complexity of matrix multiplication
- Coppersmith, Winograd
- 1982
(Show Context)
Citation Context ...ngs of the conventional O(n3 ) matrix multiplication algorithm, and so excludes faster algorithms using distributivity or special constants, such as those of Strassen [59] or Coppersmith and Winograd =-=[11]-=-, and their use in asymptotically fast versions of LU and QR [18]. Extending communication lower bounds to these asymptotically faster algorithms is an open problem. 17.1 Matrix Multiplication Lower B... |

45 |
An inequality related to the isoperimetric inequality
- Loomis, Whitney
- 1949
(Show Context)
Citation Context ...nt is that the multiplication at (i, j, k) cannot occur unless face is associated with ¯ Ri,k, for 1 ≤ i, k ≤ n 2 ˜Qj,i and Ãj,k are together in memory. Finally, we need the Loomis-Whitney inequality =-=[42]-=-: Suppose V is a set of lattice points in 3D, Vi is projection of V along i onto the (j, k) plane, and similarly for Vj and Vk. Let |V | denote the cardinality of V , i.e. counting lattice points. The... |

39 | Extending the hong-kung model to memory hierarchies - Savage - 1995 |

39 |
Solving linear least squares problems by Gram-Schmidt orthogonalization
- BJORCK
- 1967
(Show Context)
Citation Context ...erations by a hardware-dependent factor. Algorithm ‖I − QT Q‖2 bound Assumption on κ(A) Reference(s) Householder QR O(ε) None [26] TSQR O(ε) None [26] CGS2 O(ε) O(εκ(A)) < 1 [1, 36] MGS O(εκ(A)) None =-=[6]-=- CholeskyQR O(εκ(A) 2 ) None [58] CGS O(εκ(A) n−1 ) None [36, 56] Table 12: Upper bounds on deviation from orthogonality of the Q factor from various QR algorithms. Machine precision is ε. “Assumption... |

36 |
D.: On stable parallel linear systems solvers
- Sameh, Kuck
- 1978
(Show Context)
Citation Context ... to the amount of parallelism that can be extracted, regardless of the cost of communication. A number of authors have developed parallel QR factorization methods based on Givens rotations (see e.g., =-=[54, 45, 13]-=-). Givens rotations are a good model for a large class of QR factorizations, including those based on Householder reflections. This is because all such algorithms have a similar dataflow graph (see e.... |

31 |
The design and implementation
- Choi, Dongarra, et al.
- 1995
(Show Context)
Citation Context ...quential Householder QR, so it pays to model a block column version. 9.3.1 Parallel Householder QR ScaLAPACK’s parallel QR factorization routine, PDGEQRF, uses a right-looking Householder QR approach =-=[10]-=-. The cost of PDGEQRF depends on how the original matrix A is distributed across the processors. For comparison with TSQR, we assume the same block row layout on P processors. PDGEQRF computes an expl... |

31 | New serial and parallel recursive QR factorization algorithms for SMP systems
- Elmroth, Gustavson
(Show Context)
Citation Context ...m almost all its floating-point operations using any fast sequential QR factorization routine. In particular, we can achieve significant speedups by invoking Elmroth and Gustavson’s recursive QR (see =-=[21, 22]-=-). • We apply TSQR to the parallel factorization of arbitrary rectangular matrices in a two-dimensional block cyclic layout. • We adapt TSQR to work on general reduction trees. This flexibility lets s... |

31 | Predicting structure in nonsymmetric sparse matrix factorizations
- Gilbert, Ng
- 1993
(Show Context)
Citation Context ...is true for all the matrix structures encountered in the local QR factorizations in this report. A number of authors discuss how to predict fill in general sparse QR factorizations; see, for example, =-=[25]-=-. We do not need this level of generality, since the structures we exploit do not cause fill. 101The factorization proceeds column-by-column, starting from the left. For each column, two operations a... |

30 |
Communication avoiding Gaussian elimination
- Grigori, Demmel, et al.
- 2008
(Show Context)
Citation Context ...ns minimum bandwidth and flop count, though not minimum latency. • We observe that there are alternative LU algorithms in the literature that attain at least some of these communication lower bounds: =-=[30]-=- describes a parallel LU algorithm attaining both bandwidth and latency lower bounds, and [60] describes a sequential LU algorithm that at least attains the bandwidth lower bound. • We outline how to ... |

28 | The design and implementation of the parallel outof-core ScaLAPACK LU, QR, and Cholesky factorization routines
- Dongarra, D’Azevedo
- 1997
(Show Context)
Citation Context ...CGS2 because it is no slower than applying CGS twice, but the number of orthogonalization steps may vary based on the numerical properties of the input, so it is hard to predict performance a priori. =-=[16]-=-. It uses ScaLAPACK’s parallel QR factorization PDGEQRF to perform the current panel factorization in DRAM. Thus, it is able to exploit parallelism. We assume here, though, that it is running sequenti... |

28 | On the complexity of matrix product
- Raz
(Show Context)
Citation Context ... based on Strassen’s algorithm, or indeed of any matrix multiplication algorithm, based on Raz’s theorem converting any matrix multiplication algorithm to be “Strassen-like” (bilinear noncommutative) =-=[42]-=-. But the following question is of more practical importance. Our TSQR and CAQR algorithms have been described and analyzed in most detail for simple machine models: either sequential with two levels ... |

27 |
P.: Distributed orthogonal factorization: Givens and Householder algorithms
- Pothen, Raghavan
- 1989
(Show Context)
Citation Context ...s of reduction trees will let us optimize TSQR for more general architectures. Now we briefly describe related work and our contributions. The tree-based QR idea itself is not novel (see for example, =-=[8, 15, 27, 32, 39, 49, 51, 52]-=-), but we have a number of optimizations and generalizations: • Our algorithm can perform almost all its floating-point operations using any fast sequential QR factorization routine. In particular, we... |

26 | Fast linear algebra is stable
- Demmel, Dumitriu, et al.
(Show Context)
Citation Context ...nd so excludes faster algorithms using distributivity or special constants, such as those of Strassen [59] or Coppersmith and Winograd [11], and their use in asymptotically fast versions of LU and QR =-=[18]-=-. Extending communication lower bounds to these asymptotically faster algorithms is an open problem. 17.1 Matrix Multiplication Lower Bounds Hong and Kung [35], and later Irony, Toledo and Tiskin [34]... |

22 | A recursive formulation of Cholesky factorization of a matrix in packed storage - Andersen, Gustavson, et al. - 2001 |

21 | Automatic generation of block-recursive codes - Ahmed, Pingali - 2000 |

21 | Communication-avoiding Krylov subspace methods
- Hoemmen
- 2010
(Show Context)
Citation Context ... methods, in which some number s steps of the algorithm are performed all at once, in order to reduce communication. Demmel et al. review the existing literature and discuss new advances in this area =-=[19]-=-. Such a method begins with an n×n matrix A and a starting vector v, and generates some basis for the Krylov subspace span{v, Av, A 2 v, . . . , A s v}, using a small number of communication steps tha... |

19 | Std 754-2008, Standard for Floating-Point Arithmetic - IEEE - 2008 |

19 | Communication-avoiding parallel and sequential QR and LU factorizations: theory and practice
- Demmel, Grigori, et al.
- 2008
(Show Context)
Citation Context ...mizations, performance models, comparisons to LAPACK and ScaLAPACK, and how it can be adapted to other architectures. Section 3 presents CAQR analogously. (This paper is based on the technical report =-=[16]-=-, to which we leave many of the detailed derivations of the performance models.) Section 4 presents our lower bounds for TSQR, and Section 5 for CAQR (as well as LU). Section 6 describes related work.... |

17 |
QR factorization for the CELL processor
- Kurzak, Dongarra
- 2009
(Show Context)
Citation Context |

16 |
de Geijn. Scheduling of QR factorization algorithms on SMP and multi-core architectures
- Quintana-Ortı́, Quintana-Ortı́, et al.
- 2008
(Show Context)
Citation Context |

15 | Some issues in dense linear algebra for multicore and special purpose architectures. LAPACK Working
- Baboulin, Dongarra, et al.
(Show Context)
Citation Context ... differs significantly from the usual data structures to which users are accustomed. It also suggests that recent approaches based on decomposing dense linear algebra operations into DAGs of subtasks =-=[8, 2, 39, 51, 50]-=- may need to be hierarchical, rather than have a single 100layer of tasks. A single layer is a good match for the single socket multicore architectures that motivate these systems, but may not scale ... |

12 | Trading off parallelism and numerical stability
- Demmel
- 1993
(Show Context)
Citation Context ...see Section 10), Demmel observes that Csanky’s arithmetic NC linear solver is so unstable that it loses all significant digits when inverting 3In×n in IEEE 754 double-precision arithmetic, for n ≥ 60 =-=[17]-=-. As far as we know, there exists no stable, practical QR factorization that is in arithmetic NC. 19 Extending algorithms and optimality proofs to general architectures Our TSQR and CAQR algorithms ha... |

11 | Minimal-storage highperformance Cholesky factorization via blocking and recursion - Gustavson, Jonsson |

11 |
Basis selection in LOBPCG
- Hetmaniuk, Lehoucq
(Show Context)
Citation Context ...5, 57]. Eigenvalue computation is particularly sensitive to the accuracy of the orthogonalization; two recent papers suggest that large-scale eigenvalue applications require a stable QR factorization =-=[33, 38]-=-. 3.2 s-step Krylov methods Recent research has reawakened an interest in alternate formulations of Krylov subspace methods, called s-step Krylov methods, in which some number s steps of the algorithm... |

11 |
An Alternative Givens Ordering
- Modi, Clarke
- 1984
(Show Context)
Citation Context ... to the amount of parallelism that can be extracted, regardless of the cost of communication. A number of authors have developed parallel QR factorization methods based on Givens rotations (see e.g., =-=[54, 45, 13]-=-). Givens rotations are a good model for a large class of QR factorizations, including those based on Householder reflections. This is because all such algorithms have a similar dataflow graph (see e.... |

9 |
Algorithm 827: irbleigs: A MATLAB program for computing a few eigenpairs of a large sparse Hermitian matrix, ACMTrans.Math
- Baglama, Calvetti, et al.
(Show Context)
Citation Context ...d engineers depends for their computational tasks. Examples include TRLAN (Thick Restart Lanczos), BLZPACK (Block Lanczos), Anasazi (various block methods), and PRIMME (block Jacobi-Davidson methods) =-=[64, 44, 37, 3, 5, 57]-=-. Eigenvalue computation is particularly sensitive to the accuracy of the orthogonalization; two recent papers suggest that large-scale eigenvalue applications require a stable QR factorization [33, 3... |

9 | Design of scalable dense linear algebra libraries for multithreaded architectures: the LU factorization
- Quintana-Ortı́, Quintana-Ortı́, et al.
(Show Context)
Citation Context ... differs significantly from the usual data structures to which users are accustomed. It also suggests that recent approaches based on decomposing dense linear algebra operations into DAGs of subtasks =-=[8, 2, 39, 51, 50]-=- may need to be hierarchical, rather than have a single 100layer of tasks. A single layer is a good match for the single socket multicore architectures that motivate these systems, but may not scale ... |

9 | A block orthogonalization procedure with constant synchronization requirements
- Stathopoulos, Wu
(Show Context)
Citation Context ...different ways that favor certain applications over others. In this section, we model the performance of the following competitors to TSQR: 35• Four different Gram-Schmidt variants • CholeskyQR (see =-=[58]-=-) • Householder QR, with a block row layout Each includes parallel and sequential versions. For Householder QR, we base our parallel model on the ScaLAPACK routine PDGEQRF, and the sequential model on... |

8 |
off error analysis for Gram–Schmidt method and solution of linear least squares problems
- Abdelmalek, Round
- 1971
(Show Context)
Citation Context ... the cost of arithmetic operations by a hardware-dependent factor. Algorithm ‖I − QT Q‖2 bound Assumption on κ(A) Reference(s) Householder QR O(ε) None [26] TSQR O(ε) None [26] CGS2 O(ε) O(εκ(A)) < 1 =-=[1, 36]-=- MGS O(εκ(A)) None [6] CholeskyQR O(εκ(A) 2 ) None [58] CGS O(εκ(A) n−1 ) None [36, 56] Table 12: Upper bounds on deviation from orthogonality of the Q factor from various QR algorithms. Machine preci... |

8 |
Parallel QR decomposition of a rectangular matrix, Numerische Mathematik 48
- Cosnard, Muller, et al.
- 1986
(Show Context)
Citation Context ...trix. 18.1 Minimum critical path length Cosnard, Muller, and Robert proved lower bounds on the critical path length Opt(m, n) of any parallel QR algorithm of an m × n matrix based on Givens rotations =-=[12]-=-. They assume any number of processors, any communication network, and any initial data distribution; in the extreme case, there many be mn processors, each with one element of the matrix. In their cl... |

8 |
Block locally optimal preconditioned eigenvalue xolvers (blopex) in hypre and petsc
- Knyazev, Argentati, et al.
- 2007
(Show Context)
Citation Context ...5, 57]. Eigenvalue computation is particularly sensitive to the accuracy of the orthogonalization; two recent papers suggest that large-scale eigenvalue applications require a stable QR factorization =-=[33, 38]-=-. 3.2 s-step Krylov methods Recent research has reawakened an interest in alternate formulations of Krylov subspace methods, called s-step Krylov methods, in which some number s steps of the algorithm... |