Results 1  10
of
258
Scan Primitives for GPU Computing
 GRAPHICS HARDWARE 2007
, 2007
"... The scan primitives are powerful, generalpurpose dataparallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Us ..."
Abstract

Cited by 111 (8 self)
 Add to MetaCart
The scan primitives are powerful, generalpurpose dataparallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrixvector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallowwater fluid simulation using the scan framework for a tridiagonal matrix solver.
Preconditioning techniques for large linear systems: A survey
 J. COMPUT. PHYS
, 2002
"... This article surveys preconditioning techniques for the iterative solution of large linear systems, with a focus on algebraic methods suitable for general sparse matrices. Covered topics include progress in incomplete factorization methods, sparse approximate inverses, reorderings, parallelization i ..."
Abstract

Cited by 105 (5 self)
 Add to MetaCart
This article surveys preconditioning techniques for the iterative solution of large linear systems, with a focus on algebraic methods suitable for general sparse matrices. Covered topics include progress in incomplete factorization methods, sparse approximate inverses, reorderings, parallelization issues, and block and multilevel extensions. Some of the challenges ahead are also discussed. An extensive bibliography completes the paper.
SuperLU DIST: A scalable distributedmemory sparse direct solver for unsymmetric linear systems
 ACM Trans. Mathematical Software
, 2003
"... We present the main algorithmic features in the software package SuperLU DIST, a distributedmemory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the software’s parallel performance and sc ..."
Abstract

Cited by 87 (17 self)
 Add to MetaCart
We present the main algorithmic features in the software package SuperLU DIST, a distributedmemory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the software’s parallel performance and scalability on current machines. The solver is based on sparse Gaussian elimination, with an innovative static pivoting strategy proposed earlier by the authors. The main advantage of static pivoting over classical partial pivoting is that it permits a priori determination of data structures and communication patterns, which lets us exploit techniques used in parallel sparse Cholesky algorithms to better parallelize both LU decomposition and triangular solution on largescale distributed machines.
Weighted graph cuts without eigenvectors: A multilevel approach
 IEEE Trans. Pattern Anal. Mach. Intell
, 2007
"... Abstract—A variety of clustering algorithms have recently been proposed to handle data that is not linearly separable; spectral clustering and kernel kmeans are two of the main methods. In this paper, we discuss an equivalence between the objective functions used in these seemingly different method ..."
Abstract

Cited by 78 (12 self)
 Add to MetaCart
Abstract—A variety of clustering algorithms have recently been proposed to handle data that is not linearly separable; spectral clustering and kernel kmeans are two of the main methods. In this paper, we discuss an equivalence between the objective functions used in these seemingly different methods—in particular, a general weighted kernel kmeans objective is mathematically equivalent to a weighted graph clustering objective. We exploit this equivalence to develop a fast highquality multilevel algorithm that directly optimizes various weighted graph clustering objectives, such as the popular ratio cut, normalized cut, and ratio association criteria. This eliminates the need for any eigenvector computation for graph clustering problems, which can be prohibitive for very large graphs. Previous multilevel graph partitioning methods such as Metis have suffered from the restriction of equalsized clusters; our multilevel algorithm removes this restriction by using kernel kmeans to optimize weighted graph cuts. Experimental results show that our multilevel algorithm outperforms a stateoftheart spectral clustering algorithm in terms of speed, memory usage, and quality. We demonstrate that our algorithm is applicable to largescale clustering tasks such as image segmentation, social network analysis, and gene network analysis. Index Terms—Clustering, data mining, segmentation, kernel kmeans, spectral clustering, graph partitioning. 1
The design and use of algorithms for permuting large entries to the diagonal of sparse matrices
 SIAM J. MATRIX ANAL. APPL
, 1999
"... ..."
A TwoDimensional Data Distribution Method For Parallel Sparse MatrixVector Multiplication
 SIAM REVIEW
"... A new method is presented for distributing data in sparse matrixvector multiplication. The method is twodimensional, tries to minimise the true communication volume, and also tries to spread the computation and communication work evenly over the processors. The method starts with a recursive bipar ..."
Abstract

Cited by 68 (9 self)
 Add to MetaCart
A new method is presented for distributing data in sparse matrixvector multiplication. The method is twodimensional, tries to minimise the true communication volume, and also tries to spread the computation and communication work evenly over the processors. The method starts with a recursive bipartitioning of the sparse matrix, each time splitting a rectangular matrix into two parts with a nearly equal number of nonzeros. The communication volume caused by the split is minimised. After the matrix partitioning, the input and output vectors are partitioned with the objective of minimising the maximum communication volume per processor. Experimental results of our implementation, Mondriaan, for a set of sparse test matrices show a reduction in communication compared to onedimensional methods, and in general a good balance in the communication work.
HypergraphPartitioning Based Decomposition for Parallel SparseMatrix Vector Multiplication
 IEEE Trans. on Parallel and Distributed Computing
"... In this work, we show that the standard graphpartitioning based decomposition of sparse matrices does not reflect the actual communication volume requirement for parallel matrixvector multiplication. We propose two computational hypergraph models which avoid this crucial deficiency of the graph mo ..."
Abstract

Cited by 61 (34 self)
 Add to MetaCart
In this work, we show that the standard graphpartitioning based decomposition of sparse matrices does not reflect the actual communication volume requirement for parallel matrixvector multiplication. We propose two computational hypergraph models which avoid this crucial deficiency of the graph model. The proposed models reduce the decomposition problem to the wellknown hypergraph partitioning problem. The recently proposed successful multilevel framework is exploited to develop a multilevel hypergraph partitioning tool PaToH for the experimental verification of our proposed hypergraph models. Experimental results on a wide range of realistic sparse test matrices confirm the validity of the proposed hypergraph models. In the decomposition of the test matrices, the hypergraph models using PaToH and hMeTiS result in up to 63% less communication volume (30%38% less on the average) than the graph model using MeTiS, while PaToH is only 1.32.3 times slower than MeTiS on the average. ...
Optimizing the performance of sparse matrixvector multiplication
, 2000
"... Copyright 2000 by EunJin Im ..."
Algorithm 887: Cholmod, supernodal sparse cholesky factorization and update/downdate
 ACM Transactions on Mathematical Software
, 2008
"... CHOLMOD is a set of routines for factorizing sparse symmetric positive definite matrices of the form A or A A T, updating/downdating a sparse Cholesky factorization, solving linear systems, updating/downdating the solution to the triangular system Lx = b, and many other sparse matrix functions for b ..."
Abstract

Cited by 53 (7 self)
 Add to MetaCart
CHOLMOD is a set of routines for factorizing sparse symmetric positive definite matrices of the form A or A A T, updating/downdating a sparse Cholesky factorization, solving linear systems, updating/downdating the solution to the triangular system Lx = b, and many other sparse matrix functions for both symmetric and unsymmetric matrices. Its supernodal Cholesky factorization relies on LAPACK and the Level3 BLAS, and obtains a substantial fraction of the peak performance of the BLAS. Both real and complex matrices are supported. CHOLMOD is written in ANSI/ISO C, with both C and MATLAB TM interfaces. It appears in MATLAB 7.2 as x=A\b when A is sparse symmetric positive definite, as well as in several other sparse matrix functions.
Robust approximate inverse preconditioning for the conjugate gradient method
 SIAM J. SCI. COMPUT
, 2000
"... We present a variant of the AINV factorized sparse approximate inverse algorithm which is applicable to any symmetric positive definite matrix. The new preconditioner is breakdownfree and, when used in conjunction with the conjugate gradient method, results in a reliable solver for highly illcondit ..."
Abstract

Cited by 48 (11 self)
 Add to MetaCart
We present a variant of the AINV factorized sparse approximate inverse algorithm which is applicable to any symmetric positive definite matrix. The new preconditioner is breakdownfree and, when used in conjunction with the conjugate gradient method, results in a reliable solver for highly illconditioned linear systems. We also investigate an alternative approach to a stable approximate inverse algorithm, based on the idea of diagonally compensated reduction of matrix entries. The results of numerical tests on challenging linear systems arising from finite element modeling of elasticity and diffusion problems are presented.