## SuperLU DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems (2003)

### Cached

### Download Links

Venue: | ACM Trans. Mathematical Software |

Citations: | 94 - 18 self |

### BibTeX

@ARTICLE{Li03superludist:,

author = {Xiaoyes. Li and James W. Demmel},

title = {SuperLU DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems},

journal = {ACM Trans. Mathematical Software},

year = {2003},

volume = {29},

pages = {110--140}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present the main algorithmic features in the software package SuperLU DIST, a distributedmemory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the software’s parallel performance and scalability on current machines. The solver is based on sparse Gaussian elimination, with an innovative static pivoting strategy proposed earlier by the authors. The main advantage of static pivoting over classical partial pivoting is that it permits a priori determination of data structures and communication patterns, which lets us exploit techniques used in parallel sparse Cholesky algorithms to better parallelize both LU decomposition and triangular solution on large-scale distributed machines.

### Citations

1381 | GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems
- Saad, Schultz
- 1986
(Show Context)
Citation Context ...e, when large pivot growth still occurs, there are inexpensive methods to tolerate and compensate for the growth, such as 4 iterative methods preconditioned by the computed LU factors, of which GMRES =-=[55-=-] and iterative renement are two examples. This observation led us to design a static pivoting factorization algorithm, called GESP [46]. We demonstrated that GESP works well for practical matrices. I... |

550 | Applied Numerical Linear Algebra
- Demmel
- 1997
(Show Context)
Citation Context ...gnitude than the sum of magnitudes of the o-diagonal entries in its row ( P j 6=i ja ij j) or column ( P j 6=i ja ji j). It is known that choosing diagonal pivots ensures stability for such matrices [=-=18, 33-=-]. We therefore expect that if each diagonal entry can somehow be made larger relative to the o-diagonals in its row or column, then diagonal pivoting will be more stable. The purpose of step (1) is t... |

548 |
Direct methods for sparse matrices
- Duff, Erisman, et al.
- 1986
(Show Context)
Citation Context ...ted than Choleksy for at least two reasons. First and foremost, some kind of numerical pivoting is necessary for stability. Classical partial pivoting [33] or the sparse variant of threshold pivoting =-=[23]-=- typically cause thesll-ins and workload to be generated dynamically during factorization. Therefore, we must either design dynamic data structures and algorithms to accommodate thesesll-ins [3], or e... |

512 | Computer solution of large sparse positive definite systems - George, Liu - 1981 |

361 |
ScaLAPACK User’s Guide
- Blackford, Choi, et al.
- 1997
(Show Context)
Citation Context ... # of blocks nzval block # row subscripts i1 i2 # of full rows block # row subscripts i1 i2 # of full rows LDA of nzval were demonstrated to be more scalable in the implementations for dense matrices =-=[13]-=- and sparse Cholesky factorization [37, 54]. We now describe the distributed data structures to store local submatrices. In the 2D blocking, each block column of L resides on more than one process, na... |

311 | University of Florida Sparse Matrix Collection
- Davis
- 1997
(Show Context)
Citation Context ...variety of applications. The application domains of the matrices are given in Table 1. Most of them, except for wu, can be obtained from the Harwell-Boeing Collection [24] and the collection of Davis =-=[16]-=-. Matrix wu was provided by Yushu Wu from the Earth Sciences Division of Lawrence Berkeley National Laboratory. Figure 2 plots the dimension, nnz(A), and nnz(L+U) (i.e., thesll-ins, after the minimum ... |

295 | METIS | A software package for partitioning unstructured graphs, partitioning meshes, and computing ¯ll-reducing orderings of sparse matrices, version 4.0
- Karypis, Kumar
- 1998
(Show Context)
Citation Context ... processors. For all these matrices, the algorithm can eciently use 128 processors. Beyond 128 processors, not all matrices can benet from the additional processor power. Only bbmat with ND ordering [=-=43-=-] and ecl32 with AMD [2] can benet from using 512 processors. Our lack of other large unsymmetric systems gives us few data points in this regime. To further analyse the scalability of our solvers, we... |

266 | A Basic Tool Kit for Sparse Matrix Computations
- Saad
- 1990
(Show Context)
Citation Context ...), (2), (6), and the diagonal perturbation in step (4)). Now we turn to the second conguration of our algorithm, in which restarted GMRES [55] was used in step (6) (we used the version from SPARSKIT [=-=5-=-6]). The restart value is 50. Here, our LU factorization is used in preconditioning for GMRES. The convergence test is based on residual norm: jjr i jj 2 rtol jjr 0 jj 2 + atol, where the relative t... |

263 | A column approximate minimum degree ordering algorithm
- Davis, Gilbert, et al.
- 2004
(Show Context)
Citation Context ...f. The scheme is similar to the Markowitz scheme [49] but limits the pivot search to the entries on the main diagonal. The ecient implementation is similar to that of approximate minimum degree (AMD) =-=[2]-=-, but it generalizes the (symmetric) quotient graph to the bipartite quotient graph to model the unsymmetric node elimination. The preliminary results show that the new ordering method reduces the amo... |

235 | Users’ Guide for the Harwell-Boeing Sparse Matrix Collection, Research and Technology Division, Boeing Computer Services
- Duff, Grimes, et al.
- 1992
(Show Context)
Citation Context ...etric matrices drawn from a wide variety of applications. The application domains of the matrices are given in Table 1. Most of them, except for wu, can be obtained from the Harwell-Boeing Collection =-=[24]-=- and the collection of Davis [16]. Matrix wu was provided by Yushu Wu from the Earth Sciences Division of Lawrence Berkeley National Laboratory. Figure 2 plots the dimension, nnz(A), and nnz(L+U) (i.e... |

196 | A supernodal approach to sparse partial pivoting
- Demmel, Eisenstat, et al.
- 1995
(Show Context)
Citation Context ... Despite these diculties, researchers have been addressing these issues successfully for sequential and shared memory machines; available codes include MA41 [6, 5], PARDISO [57], SPOOLES [9], SuperLU =-=[19], Sup-=-erLU MT [20], UMFPACK/MA38 [15], and WSMP [34]. 3 In our earlier codes SuperLU (serial) and SuperLU MT (shared-memory), we devised ecient \symbolic" factorization algorithms to accommodate the dy... |

181 | Nested dissection of a regular finite element mesh - George - 1973 |

165 |
The Role of Elimination Trees in Sparse Factorization
- Liu
- 1990
(Show Context)
Citation Context ... structurally dierent yet closely related to each other in theslled pattern. Unlike the Cholesky factor whose minimum graph representation is a tree (called the elimination tree, or etree for short) [=-=48]-=-, the minimum graph representations of the L and U factors are directed acyclic graphs (called elimination DAGs, or edags for short) [31, 32]. Despite these diculties, researchers have been addressing... |

157 | A fully asynchronous multifrontal solver using distributed dynamic scheduling
- Amestoy, Duff, et al.
- 2001
(Show Context)
Citation Context ... data structures and communication pattern. Researchers have been quite successful in achieving \scalable" performance for sparse Cholesky factorization; available codes include CAPSS [38], MUMPS=-=-SYM [3-=-], PaStix [40], PSLDLT [54], and PSPACES [36]. In contrast, for nonsymmetric or indenite systems, few distributed-memory codes exist. They are more complicated than Choleksy for at least two reasons. ... |

142 |
Modi of the minimum degree algorithm by multiple elimination
- Liu
- 1985
(Show Context)
Citation Context ... A in step (1). Step (2) is standard in sparse direct solvers. The column permutation P c can be obtained from anysll-reducing heuristic. In our code, we provide the minimum degree ordering algorithm =-=[47]-=- on the structure of A T + A. The code can also take as input an ordering based on some other algorithm, such as the nested dissection on A T +A [27, 39, 5 Figure 1: The outline of the GESP algorithm.... |

119 | Highly scalable parallel algorithms for sparse matrix factorization
- Gupta, Karypis, et al.
- 1997
(Show Context)
Citation Context ... Researchers have been quite successful in achieving \scalable" performance for sparse Cholesky factorization; available codes include CAPSS [38], MUMPS-SYM [3], PaStix [40], PSLDLT [54], and PSP=-=ACES [36-=-]. In contrast, for nonsymmetric or indenite systems, few distributed-memory codes exist. They are more complicated than Choleksy for at least two reasons. First and foremost, some kind of numerical p... |

88 |
The elimination form of the inverse and its application to linear programming, Management Sci
- Markowitz
- 1957
(Show Context)
Citation Context ...roposed a new symmetric ordering scheme that does not require any symmetrization of the underlying matrix, that is, it works directly on matrix A itself. The scheme is similar to the Markowitz scheme =-=[49]-=- but limits the pivot search to the entries on the main diagonal. The ecient implementation is similar to that of approximate minimum degree (AMD) [2], but it generalizes the (symmetric) quotient grap... |

85 |
Solving sparse linear systems with sparse backward error
- Arioli, Demmel, et al.
- 1989
(Show Context)
Citation Context ...of an iterative method like iterative renement (shown) or GMRES [55] if the solution from step (5) is not accurate enough. The termination criterion is based on the componentwise backward error berr [=-=8, 18]. T-=-he condition berr " means that the computed solution is the exact solution of a slightly dierent sparse linear system (A + A)x = b + b where A changes only each nonzero entry a ij by at most one... |

77 | A Combined Unifrontal/Multifrontal Method for Unsymmetric Sparse Matrices", submitted to
- Davis, Du
- 1995
(Show Context)
Citation Context ...s have been addressing these issues successfully for sequential and shared memory machines; available codes include MA41 [6, 5], PARDISO [57], SPOOLES [9], SuperLU [19], SuperLU MT [20], UMFPACK/MA38 =-=[15], and-=- WSMP [34]. 3 In our earlier codes SuperLU (serial) and SuperLU MT (shared-memory), we devised ecient \symbolic" factorization algorithms to accommodate the dynamically generatedsll-ins due to pa... |

74 | Introduction to Parallel Computing (Benjamin/Cummings - Kumar, Grama, et al. - 1994 |

73 | The design and use of algorithms for permuting large entries to the diagonal of sparse matrices - DUFF, KOSTER - 1999 |

71 | The Chaco user’s guide - version 1.0 - LELAND - 1993 |

65 | An asynchronous parallel supernodal algorithm for sparse gaussian elimination
- Demmel, Gilbert, et al.
- 1997
(Show Context)
Citation Context ...culties, researchers have been addressing these issues successfully for sequential and shared memory machines; available codes include MA41 [6, 5], PARDISO [57], SPOOLES [9], SuperLU [19], SuperLU MT =-=[20], UMF-=-PACK/MA38 [15], and WSMP [34]. 3 In our earlier codes SuperLU (serial) and SuperLU MT (shared-memory), we devised ecient \symbolic" factorization algorithms to accommodate the dynamically generat... |

52 | Design, implementation and testing of extended and mixed precision blas
- Li, Demmel, et al.
(Show Context)
Citation Context ...gorithms will probably depend on the number of right-hand sides. Improve numerical robustness. More techniques can be used; these include performing iterative renement with extra precise residuals [4=-=5-=-] and using dynamic precision during the factorization, see Appendix A. Acknowledgments We would like to thank Patrick Amestoy, Iain Du, Jean-Yves L'Excellent and Rich Vuduc for very helpful discussio... |

49 |
A Collection of Fortran Codes for Large Scale Scientific Computation, 2002. Available from: http://www.cse.clrc.ac.uk/nag/hsl
- HSL
(Show Context)
Citation Context ...g matrices D r and D c , and the permutation P r to make each a ii larger in this sense. We have experimented with a number of heuristic algorithms implemented in the routine MC64 (available from HSL =-=[41-=-]) [22]. All depend on the following graph representation of an n n sparse matrix A: it is represented as an undirected weighted bipartite graph with one vertex for each row, one vertex for each colu... |

47 | Robust ordering of sparse matrices using multisection
- Ashcraft, Liu
- 1998
(Show Context)
Citation Context ...e been developed. Comparing SuperLU DIST with those solvers remains future work. SPOOLES is a supernodal, left-up-looking solver [9]. Thesll reducing ordering is a hybrid approach called multisection =-=[10]-=-, which is applied to the structure of A T +A. It performs threshold rook pivoting with both row and column interchanges. The task dependency graph is the elimination tree of A T + A. S+ is a supernod... |

45 | Computer Solution of Linear Algebraic Systems - Forsythe, Moller - 1967 |

41 | Predicting structure in sparse matrix computations
- Gilbert
- 1986
(Show Context)
Citation Context ...on is a tree (called the elimination tree, or etree for short) [48], the minimum graph representations of the L and U factors are directed acyclic graphs (called elimination DAGs, or edags for short) =-=[31, 32]-=-. Despite these diculties, researchers have been addressing these issues successfully for sequential and shared memory machines; available codes include MA41 [6, 5], PARDISO [57], SPOOLES [9], SuperLU... |

41 | Preconditioning highly indefinite and nonsymmetric matrices - Benzi, Haws, et al. - 2000 |

40 |
Performance of panel and block approaches to sparse Cholesky factorization on the iPSC=860 and Paragon multicomputers
- Rothberg
- 1996
(Show Context)
Citation Context ...unication pattern. Researchers have been quite successful in achieving \scalable" performance for sparse Cholesky factorization; available codes include CAPSS [38], MUMPS-SYM [3], PaStix [40], PS=-=LDLT [54-=-], and PSPACES [36]. In contrast, for nonsymmetric or indenite systems, few distributed-memory codes exist. They are more complicated than Choleksy for at least two reasons. First and foremost, some k... |

39 |
Symbolic factorization for sparse Gaussian elimination with partial pivoting
- George, Ng
- 1987
(Show Context)
Citation Context ... row interchanges during the factorization depend on the numerical values. However, for any row interchanges, the structures of L and U are subsets of the structures of H (or R T ) and R respectively =-=[28, 30]-=-. Therefore, a good symmetric ordering P c on A T A (either based on minimum degree or nested dissection) that preserves the sparsity of R can be applied to the columns of A, forming AP T c , so that ... |

37 | Elimination structures for unsymmetric sparse LU factors
- Gilbert, Liu
- 1993
(Show Context)
Citation Context ...on is a tree (called the elimination tree, or etree for short) [48], the minimum graph representations of the L and U factors are directed acyclic graphs (called elimination DAGs, or edags for short) =-=[31, 32]-=-. Despite these diculties, researchers have been addressing these issues successfully for sequential and shared memory machines; available codes include MA41 [6, 5], PARDISO [57], SPOOLES [9], SuperLU... |

37 | A new pivoting strategy for gaussian elimination
- OLSCHOWKA, NEUMAIER
- 1996
(Show Context)
Citation Context ... r and D c simultaneously so that each diagonal entry of P r D r AD c is 1, each o-diagonal entry is bounded by 1 in magnitude. The implementation is based on the algorithm by Olshowka and Neumaier [5=-=-=-0]. We report results for this algorithm only. The worst case serial complexity of this algorithm is O(n nnz(A) log n), where nnz(A) is the number of nonzeros in A. In practice it is much faster; th... |

36 | Scalable iterative solution of sparse linear systems
- Jones, Plassmann
- 1994
(Show Context)
Citation Context ...the orderings that give wide and bushy elimination DAGs, such as nested dissection. To speed up the triangular solve, we may apply some graph coloring heuristic to reduce the number of parallel steps =-=[42]-=-. There are also alternative algorithms other than substitutions, such as those based on partitioned inversion [1] or selective inversion [51]. However, these algorithms usually require preprocessing ... |

34 | Making sparse Gaussian elimination scalable by static pivoting
- Li, Demmel
- 1998
(Show Context)
Citation Context ...SuperLU DIST which is targeted for large-scale distributed-memory machines, we use a static pivoting approach, called GESP (Gaussian Elimination with Static Pivoting), proposed earlier by the authors =-=[46]-=-. We parallelized the GESP algorithm using MPI. Our parallelization strategies center around the scalability concern. We use a 2D block-cyclic mapping of a sparse matrix to the processors, and designe... |

29 | Ecient sparse LU factorization with partial pivoting on distributed memory architectures
- Fu, Jiao, et al.
- 1998
(Show Context)
Citation Context ...actorization. Therefore, we must either design dynamic data structures and algorithms to accommodate thesesll-ins [3], or else use static data structures which can grossly overestimate the truesll-in =-=[26, 35-=-]. The second complication is the need to handle two factored matrices L and U , which are structurally dierent yet closely related to each other in theslled pattern. Unlike the Cholesky factor whose ... |

28 | Highly parallel sparse triangular solution
- Alvarado, Pothen, et al.
- 1993
(Show Context)
Citation Context ...e, we may apply some graph coloring heuristic to reduce the number of parallel steps [42]. There are also alternative algorithms other than substitutions, such as those based on partitioned inversion =-=[1-=-] or selective inversion [51]. However, these algorithms usually require preprocessing or dierent matrix distributions than the one used in our factorization. Whether the preprocessing and redistribut... |

27 |
Memory Management Issues in Sparse Multifrontal Methods on Multiprocessors
- Amestoy, Duff
- 1993
(Show Context)
Citation Context ...ination DAGs, or edags for short) [31, 32]. Despite these diculties, researchers have been addressing these issues successfully for sequential and shared memory machines; available codes include MA41 =-=[6, 5], PAR-=-DISO [57], SPOOLES [9], SuperLU [19], SuperLU MT [20], UMFPACK/MA38 [15], and WSMP [34]. 3 In our earlier codes SuperLU (serial) and SuperLU MT (shared-memory), we devised ecient \symbolic" facto... |

27 |
Efcient sparse LU factorization with left-right looking strategy on shared memory multiprocessors
- SCHENK, ¤ARTNER, et al.
(Show Context)
Citation Context ... edags for short) [31, 32]. Despite these diculties, researchers have been addressing these issues successfully for sequential and shared memory machines; available codes include MA41 [6, 5], PARDISO =-=[57], SPO-=-OLES [9], SuperLU [19], SuperLU MT [20], UMFPACK/MA38 [15], and WSMP [34]. 3 In our earlier codes SuperLU (serial) and SuperLU MT (shared-memory), we devised ecient \symbolic" factorization algor... |

21 | Analysis and Comparison of two General Sparse Solvers for Distributed Memory Computers
- Amestoy, Duff, et al.
- 2001
(Show Context)
Citation Context ...process PK+1 on the critical path. This could happen if the sender and receiver are required to handshake before proceeding, as is the case with large messages that exceed the MPI internal buer size [=-=7]-=-. That is, process PK+1 posts mpi send long before processes PROC r (K) post the matching mpi recv, and the sender must be blocked to wait for mpi recv. To avoid this synchronization cost, we introduc... |

19 | An unsymmetrized multifrontal LU factorization
- Amestoy, Puglisi
(Show Context)
Citation Context ...ination DAGs, or edags for short) [31, 32]. Despite these diculties, researchers have been addressing these issues successfully for sequential and shared memory machines; available codes include MA41 =-=[6, 5], PAR-=-DISO [57], SPOOLES [9], SuperLU [19], SuperLU MT [20], UMFPACK/MA38 [15], and WSMP [34]. 3 In our earlier codes SuperLU (serial) and SuperLU MT (shared-memory), we devised ecient \symbolic" facto... |

19 |
Computer Solution of Large Sparse Positive De Systems
- George, Liu
- 1981
(Show Context)
Citation Context ...nsider the 3D cubic grid problem using the standard nested dissection ordering, thesll in the factored matrix is O(N 4=3 ) and the number ofsoating-point operations to factorize the matrix is O(N 2 ) =-=[29]-=-. Let the P processors be arranged as a square process grid. In our parallel algorithm (Figure 12), each nonzero element is sent to at most p P processors. The total communication overhead is O(N 4=3 ... |

18 | Performance of a fully parallel sparse solver
- Heath, Raghavan
- 1994
(Show Context)
Citation Context ...(2D) distributed data structures and communication pattern. Researchers have been quite successful in achieving \scalable" performance for sparse Cholesky factorization; available codes include C=-=APSS [38-=-], MUMPS-SYM [3], PaStix [40], PSLDLT [54], and PSPACES [36]. In contrast, for nonsymmetric or indenite systems, few distributed-memory codes exist. They are more complicated than Choleksy for at leas... |

17 | Improved symbolic and numerical factorization algorithms for unsymmetric sparse matrices
- Gupta
(Show Context)
Citation Context ...actorization. Therefore, we must either design dynamic data structures and algorithms to accommodate thesesll-ins [3], or else use static data structures which can grossly overestimate the truesll-in =-=[26, 35-=-]. The second complication is the need to handle two factored matrices L and U , which are structurally dierent yet closely related to each other in theslled pattern. Unlike the Cholesky factor whose ... |

15 |
The design and use of algorithms for permuting large entries to the diagonal of sparse matrices
- Du, Koster
- 1997
(Show Context)
Citation Context ...column permutation algorithms in steps (1) and (2) (computing P r and P c ) are not easy to parallelize (their parallelization is future work). Fortunately, their memory requirement is just O(nnz(A)) =-=[17, 21]-=-, as opposed to the superlinear memory requirement for L and U factors, so in the meantime we can run the ordering algorithms on a single processor. Figure 7 shows the times spent in the other steps o... |

14 |
Optimally scalable parallel sparse cholesky factorization
- Gupta, Kumar
- 1995
(Show Context)
Citation Context ...ts i1 i2 # of full rows block # row subscripts i1 i2 # of full rows LDA of nzval were demonstrated to be more scalable in the implementations for dense matrices [13] and sparse Cholesky factorization =-=[37, 54]-=-. We now describe the distributed data structures to store local submatrices. In the 2D blocking, each block column of L resides on more than one process, namely, a column of processes. For example, i... |

13 |
A data structure for sparse QR and LU factorizations
- George, Liu, et al.
- 1988
(Show Context)
Citation Context ... row interchanges during the factorization depend on the numerical values. However, for any row interchanges, the structures of L and U are subsets of the structures of H (or R T ) and R respectively =-=[28, 30]-=-. Therefore, a good symmetric ordering P c on A T A (either based on minimum degree or nested dissection) that preserves the sparsity of R can be applied to the columns of A, forming AP T c , so that ... |

11 | A Mapping and Scheduling Algorithm for Parallel Sparse Fan-In Numerical Factorization
- Hénon, Ramet, et al.
- 1999
(Show Context)
Citation Context ...ures and communication pattern. Researchers have been quite successful in achieving \scalable" performance for sparse Cholesky factorization; available codes include CAPSS [38], MUMPS-SYM [3], Pa=-=Stix [40-=-], PSLDLT [54], and PSPACES [36]. In contrast, for nonsymmetric or indenite systems, few distributed-memory codes exist. They are more complicated than Choleksy for at least two reasons. First and for... |

9 | WSMP: Watson Sparse Matrix Package
- GUPTA
- 2000
(Show Context)
Citation Context ...ressing these issues successfully for sequential and shared memory machines; available codes include MA41 [6, 5], PARDISO [57], SPOOLES [9], SuperLU [19], SuperLU MT [20], UMFPACK/MA38 [15], and WSMP =-=[34]. 3 I-=-n our earlier codes SuperLU (serial) and SuperLU MT (shared-memory), we devised ecient \symbolic" factorization algorithms to accommodate the dynamically generatedsll-ins due to partial pivoting.... |

8 | Diagonal Markowitz scheme with local symmetrization
- Amestoy, Li, et al.
(Show Context)
Citation Context ...n A T +A may destroy the 12 sparsity of matrix A, particularly when A is highly unsymmetric. Recently, motivated by the GESP algorithm and an unsymmetrized multifrontal method [5], Amestoy, Li and Ng =-=[4]-=- proposed a new symmetric ordering scheme that does not require any symmetrization of the underlying matrix, that is, it works directly on matrix A itself. The scheme is similar to the Markowitz schem... |