Results 1  10
of
28
Applied Numerical Linear Algebra
 Society for Industrial and Applied Mathematics
, 1997
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate ..."
Abstract

Cited by 531 (26 self)
 Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.
Highly scalable parallel algorithms for sparse matrix factorization
 IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algo ..."
Abstract

Cited by 117 (29 self)
 Add to MetaCart
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systemsâ€”both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two and threedimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for mediumsize structural engineering and linear programming problems. To the best of our knowledge,
Distributed memory compiler design for sparse problems
 IEEE Transactions on Computers
, 1995
"... This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accesses that make it impossible to precisely determine the reference pattern at compile time. This paper pro ..."
Abstract

Cited by 67 (10 self)
 Add to MetaCart
This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accesses that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used e ectively by a compiler to generate e cient code in these situations. The compiler accepts as input aFortran 77 program enhanced with speci cations for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. Avariety of performance results on the Intel iPSC/860 are presented.
SPOOLES: An ObjectOriented Sparse Matrix Library
 In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing
, 1999
"... ction and multisection. The latter two orderings depend on a domain/separator tree that is constructed using a graph partitioning method. Domain decomposition is used to find an initial separator, and a sequence of network flow problems are solved to smooth the separator. The qualities of our nested ..."
Abstract

Cited by 35 (0 self)
 Add to MetaCart
ction and multisection. The latter two orderings depend on a domain/separator tree that is constructed using a graph partitioning method. Domain decomposition is used to find an initial separator, and a sequence of network flow problems are solved to smooth the separator. The qualities of our nested dissection and multisection orderings are comparable to other state of the art packages. Factorizations of square matrices have the form A = PLDUQ and A = PLDL T P T , where P and Q are permutation matrices. Square systems of the form A + #B may also be factored and solved (as found in shiftandinvert eigensolvers), as well as full rank overdetermined linear systems, where a QR factorization is computed and the solution found by solving the seminormal equations. # This research was supported in part by the
PARTI Primitives for Unstructured and Block Structured Problems
, 1992
"... This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of o ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of our methods.
Efficient Parallel Solutions Of Large Sparse SPD Systems On DistributedMemory Multiprocessors
 Advanced Computing Research Institute, Center for Theory and Simulation in Science and Engineering, Cornell
"... . We consider several issues involved in the solution of sparse symmetric positive definite systems by multifrontal method on distributedmemory multiprocessors. First, we present a new algorithm for computing the partial factorization of a frontal matrix on a subset of processors which significantl ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
. We consider several issues involved in the solution of sparse symmetric positive definite systems by multifrontal method on distributedmemory multiprocessors. First, we present a new algorithm for computing the partial factorization of a frontal matrix on a subset of processors which significantly improves the performance of a distributed multifrontal algorithm previously designed. Second, new parallel algorithms for computing sparse forward elimination and sparse backward substitution are described. The new algorithms solve the sparse triangular systems in a multifrontal fashion. Numerical experiments run on an Intel iPSC/860 and an Intel iPSC/2 for a set of problems with regular and irregular sparsity structure are reported. More than 180 million flops per second during the numerical factorization are achieved for a threedimensional grid problem on an iPSC/860 machine with 32 processors. Key words. Cholesky factorization, clique tree, distributedmemory multiprocessors, multifro...
Sparse Numerical Linear Algebra: Direct Methods and Preconditioning
, 1996
"... Most of the current techniques for the direct solution of linear equations are based on supernodal or multifrontal approaches. An important feature of these methods is that arithmetic is performed on dense submatrices and Level 2 and Level 3 BLAS (matrixvector and matrixmatrix kernels) can be us ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
Most of the current techniques for the direct solution of linear equations are based on supernodal or multifrontal approaches. An important feature of these methods is that arithmetic is performed on dense submatrices and Level 2 and Level 3 BLAS (matrixvector and matrixmatrix kernels) can be used. Both sparse LU and QR factorizations can be implemented within this framework. Partitioning and ordering techniques have seen major activity in recent years. We discuss bisection and multisection techniques, extensions to orderings to block triangular form, and recent improvements and modifications to standard orderings such as minimum degree. We also study advances in the solution of indefinite systems and sparse leastsquares problems. The desire to exploit parallelism has been responsible for many of the developments in direct methods for sparse matrices over the last ten years. We examine this aspect in some detail, illustrating how current techniques have been developed or ...
PaStiX: A Parallel Sparse Direct Solver Based on a Static Scheduling for Mixed 1D/2D Block Distributions
 In Proceedings of Irregular'2000, Cancun, Mexique, number 1800 in Lecture Notes in Computer Science
, 2000
"... We present and analyze a general algorithm which computes an ecient static scheduling of block computations for a parallel L:D:L t factorization of sparse symmetric positive denite systems based on a combination of 1D and 2D block distributions. Our solver uses a supernodal fanin approach and ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
We present and analyze a general algorithm which computes an ecient static scheduling of block computations for a parallel L:D:L t factorization of sparse symmetric positive denite systems based on a combination of 1D and 2D block distributions. Our solver uses a supernodal fanin approach and is fully driven by this scheduling. We give an overview of the algorithm and present performance results and comparisons with PSPASES on an IBMSP2 with 120 MHz Power2SC nodes for a collection of irregular problems. This work is supported by the Commissariat a l' Energie Atomique CEA/CESTA under contract No. 7V1555AC, and by the GDR ARP (iHPerf group) of the CNRS. 1 1 Introduction Solving large sparse symmetric positive denite systems Ax = b of linear equations is a crucial and timeconsuming step, arising in many scientic and engineering applications. Consequently, many parallel formulations for sparse matrix factorization have been studied and implemented; one can refer t...
Distributed Solution Of Sparse Linear Systems
"... We consider the solution of a linear system Ax = b on a distributed memory machine when the matrix A is large, sparse and symmetric positive de nite. In a previous paper we developed an algorithm to compute a llreducing nested dissection ordering of A on a distributed memory machine. We now develop ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
We consider the solution of a linear system Ax = b on a distributed memory machine when the matrix A is large, sparse and symmetric positive de nite. In a previous paper we developed an algorithm to compute a llreducing nested dissection ordering of A on a distributed memory machine. We now develop algorithms for the remaining steps of the solution process. The largegrain task parallelism resulting from sparsity is identified by a tree of separators available from nested dissection. Our parallel algorithms use this separator tree to estimate the structure of the Cholesky factor L and to organize numeric computations as a sequence of dense matrix operations. We present results of an implementation onanIntel iPSC/860 parallel computer. An an alternative to estimating the structure of L using the separator tree, we develop an algorithm to compute the elimination tree on a distributed memory machine. Our algorithm uses the separator tree to achieve better time and space complexity than earlier work.
A Mapping and Scheduling Algorithm for Parallel Sparse FanIn Numerical Factorization
 In EuroPar'99 Parallel Processing, Lecture Notes in Computer Science
, 2000
"... We present and analyze a general algorithm which computes ecient static schedulings of block computations for parallel sparse linear factorization. Our solver, based on a supernodal fanin approach, is fully driven by this scheduling. We give an overview of the algorithms and present performance ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
We present and analyze a general algorithm which computes ecient static schedulings of block computations for parallel sparse linear factorization. Our solver, based on a supernodal fanin approach, is fully driven by this scheduling. We give an overview of the algorithms and present performance results on a 16node IBMSP2 with 66 MHz Power2 thin nodes for a collection of grid and irregular problems. This work is supported by the Commissariat a l' Energie Atomique CEA/CESTA under contract No. 7V1555AC, and by the GDR ARP (iHPerf group) of the CNRS. 1 1 Introduction Solving large sparse symmetric positive denite systems Ax = b of linear equations is a crucial and timeconsuming step, arising in many scientic and engineering applications. Consequently, many parallel formulations for sparse matrix factorization have been studied and implemented; one can refer to [6] for a complete survey on high performance sparse factorization. In this paper, we focus on the block par...