Results 1 - 10
of
23
A column approximate minimum degree ordering algorithm
, 2000
"... Sparse Gaussian elimination with partial pivoting computes the factorization PAQ = LU of a sparse matrix A, where the row ordering P is selected during factorization using standard partial pivoting with row interchanges. The goal is to select a column preordering, Q, based solely on the nonzero patt ..."
Abstract
-
Cited by 202 (40 self)
- Add to MetaCart
Sparse Gaussian elimination with partial pivoting computes the factorization PAQ = LU of a sparse matrix A, where the row ordering P is selected during factorization using standard partial pivoting with row interchanges. The goal is to select a column preordering, Q, based solely on the nonzero pattern of A such that the factorization remains as sparse as possible, regardless of the subsequent choice of P. The choice of Q can have a dramatic impact on the number of nonzeros in L and U. One scheme for determining a good column ordering for A is to compute a symmetric ordering that reduces fill-in in the Cholesky factorization of ATA. This approach, which requires the sparsity structure of ATA to be computed, can be expensive both in
Fast Sparse Matrix Multiplication
, 2004
"... Let A and B two n n matrices over a ring R (e.g., the reals or the integers) each containing at most m non-zero elements. We present a new algorithm that multiplies A and B using O(m ) algebraic operations (i.e., multiplications, additions and subtractions) over R. The naive matrix multi ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
Let A and B two n n matrices over a ring R (e.g., the reals or the integers) each containing at most m non-zero elements. We present a new algorithm that multiplies A and B using O(m ) algebraic operations (i.e., multiplications, additions and subtractions) over R. The naive matrix multiplication algorithm, on the other hand, may need to perform #(mn) operations to accomplish the same task. For , the new algorithm performs an almost optimal number of only n operations. For m the new algorithm is also faster than the best known matrix multiplication algorithm for dense matrices which uses O(n ) algebraic operations. The new algorithm is obtained using a surprisingly straightforward combination of a simple combinatorial idea and existing fast rectangular matrix multiplication algorithms. We also obtain improved algorithms for the multiplication of more than two sparse matrices.
A Revised Proposal for a Sparse BLAS Toolkit
, 1996
"... This paper describes a proposal for a "toolkit" of kernel routines for some of the basic operations in (iterative) sparse numerical methods. In particular, we describe an interface for routines which perform (i) sparse matrix times dense matrix product, (ii) the solution of a sparse triangular syste ..."
Abstract
-
Cited by 18 (7 self)
- Add to MetaCart
This paper describes a proposal for a "toolkit" of kernel routines for some of the basic operations in (iterative) sparse numerical methods. In particular, we describe an interface for routines which perform (i) sparse matrix times dense matrix product, (ii) the solution of a sparse triangular system with multiple right-hand-sides, (iii) the right permutation of a sparse matrix and (iv) a check for the integrity of a sparse matrix representation. The interfaces for these four operations are defined for a variety of common data structures and a set of guidelines is given to define interfaces for new data structures. The primary purpose of this toolkit is to provide a set of basic routines upon which the "User Level Sparse BLAS," as described in [9], can be built. This paper is a revision of the original proposal found in [14].
A Proposal for a Sparse BLAS Toolkit
- In preparation
, 1992
"... This paper describes a proposal for a "toolkit" of kernel routines for some of the basic operations in (iterative) sparse numerical methods. In particular, we describe an interface for routines which perform (i) sparse matrix times dense matrix product, (ii) the solution of a sparse triangular syste ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
This paper describes a proposal for a "toolkit" of kernel routines for some of the basic operations in (iterative) sparse numerical methods. In particular, we describe an interface for routines which perform (i) sparse matrix times dense matrix product, (ii) the solution of a sparse triangular system with multiple right-hand-sides, (iii) the right permutation of a sparse matrix and (iv) a check for the integrity of a sparse matrix representation. The interfaces for these four operations are defined for a variety of common data structures and a set of guidelines is given to define interfaces for new data structures. The primary purpose of this toolkit is to provide a set of basic routines upon which the "User Level Sparse BLAS," as described in [6], can be built. Keywords Sparse matrices, sparse data structures, programming standards, sparse BLAS. 1 Introduction Standard interfaces for numerical linear algebra software have been shown to be very useful if the interface is simple yet ...
On Automatic Data Structure Selection and Code Generation for Sparse Computations
- Lecture Notes in Computer Science
, 1993
"... Traditionally restructuring compilers were only able to apply program transformations in order to exploit certain characteristics of the target architecture. Adaptation of data structures was limited to e.g. linearization or transposing of arrays. However, as more complex data structures are require ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
Traditionally restructuring compilers were only able to apply program transformations in order to exploit certain characteristics of the target architecture. Adaptation of data structures was limited to e.g. linearization or transposing of arrays. However, as more complex data structures are required to exploit characteristics of the data operated on, current compiler support appears to be inappropriate. In this paper we present the implementation issues of a restructuring compiler that automatically converts programs operating on dense matrices into sparse code, i.e. after a suited data structure has been selected for every dense matrix that in fact is sparse, the original code is adapted to operate on these data structures. This simplifies the task of the programmer and, in general, enables the compiler to apply more optimizations. Index Terms: Restructuring Compilers, Sparse Computations, Sparse Matrices. 1 Introduction Development and maintenance of sparse codes is a complex tas...
On the Representation and Multiplication of Hypersparse Matrices
, 2008
"... Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on the multiplication of sparse matrices (SpGEMM). We first present the issues with existing sparse matrix representations and multiplication algorithms that make them unscalable to thousands of processors. Then, we develop and analyze two new algorithms that overcome these limitations. We consider our algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm for SpGEMM that would execute different kernels depending on the sparsity of the input matrices. Such a sequential kernel requires a new data structure that exploits the hypersparsity of the individual submatrices owned by a single processor after the 2D partitioning. We experimentally evaluate the performance and characteristics of our algorithms and show that they scale significantly better than existing kernels.
Caching-Efficient Multithreaded Fast Multiplication Of Sparse Matrices
- Proceedings 12the International Parallel Processing Symposium
, 1998
"... Several fast sequential algorithms have been proposed in the past to multiply sparse matrices. These algorithms do not explicitly address the impact of caching on performance. We show that a rather simple sequential cache--efficient algorithm provides significantly better performance than existing a ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Several fast sequential algorithms have been proposed in the past to multiply sparse matrices. These algorithms do not explicitly address the impact of caching on performance. We show that a rather simple sequential cache--efficient algorithm provides significantly better performance than existing algorithms for sparse matrix multiplication. We then describe a multithreaded implementation of this simple algorithm and show that its performance scales well with the number of threads and CPUs. For 10% sparse, 500 X 500 matrices, the multithreaded version running on 4--CPU systems provides more than a 41.1--fold speed increase over the well--known BLAS routine and a 14.6 fold and 44.6--fold speed increase over two other recent techniques for fast sparse matrix multiplication, both of which are relatively difficult to parallelize efficiently. Keywords: sparse matrix multiplication, caching, loop interchanging 1. Introduction The need to efficiently multiply two sparse matrices is critica...
High Performance Parallel Implementations of the NAS Kernel Benchmarks on the IBM SP2
- IBM Systems Journal
, 1995
"... Recently, researchers at NASA Ames have defined a set of computational benchmarks designed to measure the performance of parallel supercomputers. In this paper, we describe the parallel implementation of the five kernel benchmarks from this suite on the IBM SP2, a scalable, distributed-memory parall ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Recently, researchers at NASA Ames have defined a set of computational benchmarks designed to measure the performance of parallel supercomputers. In this paper, we describe the parallel implementation of the five kernel benchmarks from this suite on the IBM SP2, a scalable, distributed-memory parallel computer. High-performance implementations of these kernels have been obtained by mapping the computation of these kernels to the underlying architecture of the SP2 machine. Performance results for the SP2 are compared with publicly available results for other high-performance computers. 1 Introduction Recently, researchers at NASA Ames have defined a set of computational benchmarks for the performance evaluation of parallel supercomputers for large scientific applications [3, 4]. Known as the NAS parallel benchmarks, this set has become an increasingly recognized means of quantifying performance of high-performance computers on a range of algorithms of interest to many users of such mac...
Highly Parallel Sparse Matrix-Matrix Multiplication ✩,✩✩
"... Generalized sparse matrix-matrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Generalized sparse matrix-matrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an unbounded number of processors. Our algorithms are based on two-dimensional block distribution of sparse matrices where serial sections use a novel hypersparse kernel for scalability. We give a state-of-the-art MPI implementation of one of our algorithms. Our experiments show scaling up to thousands of processors on a variety of test scenarios.
Algorithmic Design On The Cedar Multiprocessor
, 1989
"... The CEDAR system under development in the Center for Supercomputing Research and Development (CSRD) at the University of Illinois at Urbana-Champaign is a clustered shared-memory multiprocessor system which can be used to support parallel processing for a wide range of numerical and non-numerical ap ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The CEDAR system under development in the Center for Supercomputing Research and Development (CSRD) at the University of Illinois at Urbana-Champaign is a clustered shared-memory multiprocessor system which can be used to support parallel processing for a wide range of numerical and non-numerical applications. In this paper, we survey selected algorithms and applications for CEDAR which are under investigation by members of the CSRD Applications Group. Topics include the design of sparse basic linear algebra kernels, the Conjugate Gradient method for linear systems of equations, direct block tridiagonal linear system solvers, a domain decomposition technique for structural mechanics, boundary integral domain decomposition for partial differential equations, parallel algorithms for circuit simulation, and parallelization of a computer graphics technique (ray tracing). Performance results on the 2-cluster CEDAR system, Alliant FX/8, and Cray X-MP/48 are presented. 1 Introduction Exploit...

