Results 1 - 10
of
34
Highly scalable parallel algorithms for sparse matrix factorization
- IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algo ..."
Abstract
-
Cited by 100 (29 self)
- Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems—both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge,
Multifrontal Parallel Distributed Symmetric and Unsymmetric Solvers
, 1998
"... We consider the solution of both symmetric and unsymmetric systems of sparse linear equations. A new parallel distributed memory multifrontal approach is described. To handle numerical pivoting efficiently, a parallel asynchronous algorithm with dynamic scheduling of the computing tasks has been dev ..."
Abstract
-
Cited by 83 (25 self)
- Add to MetaCart
We consider the solution of both symmetric and unsymmetric systems of sparse linear equations. A new parallel distributed memory multifrontal approach is described. To handle numerical pivoting efficiently, a parallel asynchronous algorithm with dynamic scheduling of the computing tasks has been developed. We discuss some of the main algorithmic choices and compare both implementation issues and the performance of the LDL T and LU factorizations. Performance analysis on an IBM SP2 shows the efficiency and the potential of the method. The test problems used are from the Rutherford-Boeing collection and from the PARASOL end users.
Hybrid scheduling for the parallel solution of linear systems
- Parallel Computing
, 2006
"... In this paper, we consider the problem of designing a dynamic scheduling strategy that takes into account both workload and memory information in the context of the parallel multifrontal factorization. The originality of our approach is that we base our estimations (work and memory) on a static opti ..."
Abstract
-
Cited by 42 (6 self)
- Add to MetaCart
In this paper, we consider the problem of designing a dynamic scheduling strategy that takes into account both workload and memory information in the context of the parallel multifrontal factorization. The originality of our approach is that we base our estimations (work and memory) on a static optimistic scenario during the analysis phase. This scenario is then used during the factorization phase to constrain the dynamic decisions. The task scheduler has been redesigned to take into account these new features. Moreover performance have been improved because the new constraints allow the new scheduler to make optimal decisions that were forbidden or too dangerous in unconstrained formulations. Performance analysis show that the memory estimation becomes much closer to the memory effectively used and that even in a constrained memory environment we decrease the factorization time with respect to the initial approach.
Improved load distribution in parallel sparse Cholesky factorization
- In Proc. of Supercomputing'94
, 1994
"... Compared to the customary column-oriented ap-proaches, block-oriented, distributed-memory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
Compared to the customary column-oriented ap-proaches, block-oriented, distributed-memory sparse Cholesky factorization benefits from an asymptotic reduction in interprocessor communication volume and an asymptotic increase in the amount of concurrency that is exposed in the problem. Unfortunately, blockoriented approaches (specifically, the block fan-out method) have suffered from poor balance of the computational load. As a result, achieved performance can be quite low. This paper investigates the reasons for this load imbalance and proposes simple block mapping heuristics that dramatically improve it. The result is a roughly 20_o increase in realized parallel factorization performance, as demonstrated by performance results from an Intel Paragon TM system. We have achieved performance of nearly 3.2 billion floating point operations per second with this technique on a 196-node Paragon system. 1
Sparse Gaussian Elimination on High Performance Computers
, 1996
"... This dissertation presents new techniques for solving large sparse unsymmetric linear systems on high performance computers, using Gaussian elimination with partial pivoting. The efficiencies of the new algorithms are demonstrated for matrices from various fields and for a variety of high performan ..."
Abstract
-
Cited by 33 (5 self)
- Add to MetaCart
This dissertation presents new techniques for solving large sparse unsymmetric linear systems on high performance computers, using Gaussian elimination with partial pivoting. The efficiencies of the new algorithms are demonstrated for matrices from various fields and for a variety of high performance machines. In the first part we discuss optimizations of a sequential algorithm to exploit the memory hierarchies that exist in most RISC-based superscalar computers. We begin with the left-looking supernode-column algorithm by Eisenstat, Gilbert and Liu, which includes Eisenstat and Liu's symmetric structural reduction for fast symbolic factorization. Our key contribution is to develop both numeric and symbolic schemes to perform supernodepanel updates to achieve better data reuse in cache and floating-point register...
Hybridizing Nested Dissection and Halo Approximate Minimum Degree for Efficient Sparse Matrix Ordering
- IN PROCEEDINGS OF IRREGULAR'99, LNCS 1586
, 1999
"... Minimum degree and nested dissection are the two most popular reordering schemes used to reduce ll-in and operation count when factoring and solving sparse matrices. Most of the state-of-the-art ordering packages hybridize these methods by performing incomplete nested dissection and ordering by ..."
Abstract
-
Cited by 24 (14 self)
- Add to MetaCart
Minimum degree and nested dissection are the two most popular reordering schemes used to reduce ll-in and operation count when factoring and solving sparse matrices. Most of the state-of-the-art ordering packages hybridize these methods by performing incomplete nested dissection and ordering by minimum degree the subgraphs associated with the leaves of the separation tree, but most often only loose couplings have been achieved, resulting in poorer performance than could have been expected. This paper presents a tight coupling of the nested dissection and halo approximate minimum degree algorithms, which allows the minimum degree algorithm to use exact degrees on the boundaries of the subgraphs passed to it, and to yield back not only the ordering of the nodes of the subgraph, but also the amalgamated assembly subtrees, for efficient block computations. Experimental results show the performance improvement of this hybridization, both in terms of fill-in reduction and increa...
Efficient Parallel Solutions Of Large Sparse SPD Systems On Distributed-Memory Multiprocessors
- Advanced Computing Research Institute, Center for Theory and Simulation in Science and Engineering, Cornell
"... . We consider several issues involved in the solution of sparse symmetric positive definite systems by multifrontal method on distributed-memory multiprocessors. First, we present a new algorithm for computing the partial factorization of a frontal matrix on a subset of processors which significantl ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
. We consider several issues involved in the solution of sparse symmetric positive definite systems by multifrontal method on distributed-memory multiprocessors. First, we present a new algorithm for computing the partial factorization of a frontal matrix on a subset of processors which significantly improves the performance of a distributed multifrontal algorithm previously designed. Second, new parallel algorithms for computing sparse forward elimination and sparse backward substitution are described. The new algorithms solve the sparse triangular systems in a multifrontal fashion. Numerical experiments run on an Intel iPSC/860 and an Intel iPSC/2 for a set of problems with regular and irregular sparsity structure are reported. More than 180 million flops per second during the numerical factorization are achieved for a three-dimensional grid problem on an iPSC/860 machine with 32 processors. Key words. Cholesky factorization, clique tree, distributed-memory multiprocessors, multifro...
A Parallel Formulation of Interior Point Algorithms
- DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF MINNESOTA
, 1994
"... In recent years, interior point algorithms have been used successfully for solving medium to large-size linear programming (LP) problems. In this paper we describe a highly parallel formulation of the interior point algorithm. A key component of the interior point algorithm is the solution of a s ..."
Abstract
-
Cited by 16 (9 self)
- Add to MetaCart
In recent years, interior point algorithms have been used successfully for solving medium to large-size linear programming (LP) problems. In this paper we describe a highly parallel formulation of the interior point algorithm. A key component of the interior point algorithm is the solution of a sparse system of linear equations using Cholesky factorization. The performance of parallel Cholesky factorization is determined by (a) the communication overhead incurred by the algorithm, and (b) the load imbalance among the processors. In our parallel interior point algorithm, we use our recently developed parallel multifrontal algorithm that has the smallest communication overhead over all parallel algorithms for Cholesky factorization developed to date. The computation imbalance depends on the shape of the elimination tree associated with the sparse system reordered for factorization. To balance the computation, we implemented and evaluated four di#erent ordering algorithms. Among these algorithms, Kernighan-Lin and spectral nested dissection yield the most balanced elimination trees and greatly increase the amount of parallelism that can be exploited. Our preliminary implementation achieves a speedup as high as 108 on 256-processor nCUBE 2 on moderate-size problems.
Sparse Numerical Linear Algebra: Direct Methods and Preconditioning
, 1996
"... Most of the current techniques for the direct solution of linear equations are based on supernodal or multifrontal approaches. An important feature of these methods is that arithmetic is performed on dense submatrices and Level 2 and Level 3 BLAS (matrixvector and matrix-matrix kernels) can be us ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
Most of the current techniques for the direct solution of linear equations are based on supernodal or multifrontal approaches. An important feature of these methods is that arithmetic is performed on dense submatrices and Level 2 and Level 3 BLAS (matrixvector and matrix-matrix kernels) can be used. Both sparse LU and QR factorizations can be implemented within this framework. Partitioning and ordering techniques have seen major activity in recent years. We discuss bisection and multisection techniques, extensions to orderings to block triangular form, and recent improvements and modifications to standard orderings such as minimum degree. We also study advances in the solution of indefinite systems and sparse least-squares problems. The desire to exploit parallelism has been responsible for many of the developments in direct methods for sparse matrices over the last ten years. We examine this aspect in some detail, illustrating how current techniques have been developed or ...
MUMPS MUltifrontal Massively Parallel Solver Version 2.0
, 1998
"... We describe aspects of the interface and design of Version 2.0 of the MUltifrontal Massively Parallel Solver MUMPS. This code solves sets of sparse linear equations Ax = b, where the matrix A is unsymmetric. It is written in Fortran 90 and uses MPI for message passing. It also calls the ScaLAPACK c ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
We describe aspects of the interface and design of Version 2.0 of the MUltifrontal Massively Parallel Solver MUMPS. This code solves sets of sparse linear equations Ax = b, where the matrix A is unsymmetric. It is written in Fortran 90 and uses MPI for message passing. It also calls the ScaLAPACK code which in turn uses the BLACS. Level 3 BLAS are also used by the code. MUMPS is the direct solver in the PARASOL project, an EU LTR Project with twelve partners from five countries. The main aim of PARASOL is to develop a public domain library of sparse codes for distributed memory parallel computers. This report describes the interface to the MUMPS code and the message passing mechanisms that are used in the package. Keywords: Multifrontal, sparse solver, distributed memory parallelism, MPI, BLAS, BLACS, ScaLAPACK, PARASOL. AMS(MOS) subject classifications: 65F05, 65F50. 1 Current reports available at http://www.cerfacs.fr/algor/algo reports.html. 2 amestoy@enseeiht.fr. ENSEEIHT-IRIT...

