Results 1  10
of
368
Scan Primitives for GPU Computing
 GRAPHICS HARDWARE 2007
, 2007
"... The scan primitives are powerful, generalpurpose dataparallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Us ..."
Abstract

Cited by 131 (8 self)
 Add to MetaCart
The scan primitives are powerful, generalpurpose dataparallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrixvector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallowwater fluid simulation using the scan framework for a tridiagonal matrix solver.
Preconditioning techniques for large linear systems: A survey
 J. COMPUT. PHYS
, 2002
"... This article surveys preconditioning techniques for the iterative solution of large linear systems, with a focus on algebraic methods suitable for general sparse matrices. Covered topics include progress in incomplete factorization methods, sparse approximate inverses, reorderings, parallelization i ..."
Abstract

Cited by 118 (5 self)
 Add to MetaCart
(Show Context)
This article surveys preconditioning techniques for the iterative solution of large linear systems, with a focus on algebraic methods suitable for general sparse matrices. Covered topics include progress in incomplete factorization methods, sparse approximate inverses, reorderings, parallelization issues, and block and multilevel extensions. Some of the challenges ahead are also discussed. An extensive bibliography completes the paper.
SuperLU DIST: A scalable distributedmemory sparse direct solver for unsymmetric linear systems
 ACM Trans. Mathematical Software
, 2003
"... We present the main algorithmic features in the software package SuperLU DIST, a distributedmemory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the software’s parallel performance and sc ..."
Abstract

Cited by 105 (19 self)
 Add to MetaCart
(Show Context)
We present the main algorithmic features in the software package SuperLU DIST, a distributedmemory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the software’s parallel performance and scalability on current machines. The solver is based on sparse Gaussian elimination, with an innovative static pivoting strategy proposed earlier by the authors. The main advantage of static pivoting over classical partial pivoting is that it permits a priori determination of data structures and communication patterns, which lets us exploit techniques used in parallel sparse Cholesky algorithms to better parallelize both LU decomposition and triangular solution on largescale distributed machines.
Weighted graph cuts without eigenvectors: A multilevel approach
 IEEE Trans. Pattern Anal. Mach. Intell
, 2007
"... Abstract—A variety of clustering algorithms have recently been proposed to handle data that is not linearly separable; spectral clustering and kernel kmeans are two of the main methods. In this paper, we discuss an equivalence between the objective functions used in these seemingly different method ..."
Abstract

Cited by 84 (13 self)
 Add to MetaCart
(Show Context)
Abstract—A variety of clustering algorithms have recently been proposed to handle data that is not linearly separable; spectral clustering and kernel kmeans are two of the main methods. In this paper, we discuss an equivalence between the objective functions used in these seemingly different methods—in particular, a general weighted kernel kmeans objective is mathematically equivalent to a weighted graph clustering objective. We exploit this equivalence to develop a fast highquality multilevel algorithm that directly optimizes various weighted graph clustering objectives, such as the popular ratio cut, normalized cut, and ratio association criteria. This eliminates the need for any eigenvector computation for graph clustering problems, which can be prohibitive for very large graphs. Previous multilevel graph partitioning methods such as Metis have suffered from the restriction of equalsized clusters; our multilevel algorithm removes this restriction by using kernel kmeans to optimize weighted graph cuts. Experimental results show that our multilevel algorithm outperforms a stateoftheart spectral clustering algorithm in terms of speed, memory usage, and quality. We demonstrate that our algorithm is applicable to largescale clustering tasks such as image segmentation, social network analysis, and gene network analysis. Index Terms—Clustering, data mining, segmentation, kernel kmeans, spectral clustering, graph partitioning. 1
The design and use of algorithms for permuting large entries to the diagonal of sparse matrices
 SIAM J. MATRIX ANAL. APPL
, 1999
"... ..."
A TwoDimensional Data Distribution Method For Parallel Sparse MatrixVector Multiplication
 SIAM REVIEW
"... A new method is presented for distributing data in sparse matrixvector multiplication. The method is twodimensional, tries to minimise the true communication volume, and also tries to spread the computation and communication work evenly over the processors. The method starts with a recursive bipar ..."
Abstract

Cited by 72 (8 self)
 Add to MetaCart
(Show Context)
A new method is presented for distributing data in sparse matrixvector multiplication. The method is twodimensional, tries to minimise the true communication volume, and also tries to spread the computation and communication work evenly over the processors. The method starts with a recursive bipartitioning of the sparse matrix, each time splitting a rectangular matrix into two parts with a nearly equal number of nonzeros. The communication volume caused by the split is minimised. After the matrix partitioning, the input and output vectors are partitioned with the objective of minimising the maximum communication volume per processor. Experimental results of our implementation, Mondriaan, for a set of sparse test matrices show a reduction in communication compared to onedimensional methods, and in general a good balance in the communication work.
HypergraphPartitioning Based Decomposition for Parallel SparseMatrix Vector Multiplication
 IEEE Trans. on Parallel and Distributed Computing
"... In this work, we show that the standard graphpartitioning based decomposition of sparse matrices does not reflect the actual communication volume requirement for parallel matrixvector multiplication. We propose two computational hypergraph models which avoid this crucial deficiency of the graph mo ..."
Abstract

Cited by 63 (34 self)
 Add to MetaCart
In this work, we show that the standard graphpartitioning based decomposition of sparse matrices does not reflect the actual communication volume requirement for parallel matrixvector multiplication. We propose two computational hypergraph models which avoid this crucial deficiency of the graph model. The proposed models reduce the decomposition problem to the wellknown hypergraph partitioning problem. The recently proposed successful multilevel framework is exploited to develop a multilevel hypergraph partitioning tool PaToH for the experimental verification of our proposed hypergraph models. Experimental results on a wide range of realistic sparse test matrices confirm the validity of the proposed hypergraph models. In the decomposition of the test matrices, the hypergraph models using PaToH and hMeTiS result in up to 63% less communication volume (30%38% less on the average) than the graph model using MeTiS, while PaToH is only 1.32.3 times slower than MeTiS on the average. ...
Algorithm 887: Cholmod, supernodal sparse cholesky factorization and update/downdate
 ACM Transactions on Mathematical Software
, 2008
"... CHOLMOD is a set of routines for factorizing sparse symmetric positive definite matrices of the form A or A A T, updating/downdating a sparse Cholesky factorization, solving linear systems, updating/downdating the solution to the triangular system Lx = b, and many other sparse matrix functions for b ..."
Abstract

Cited by 61 (7 self)
 Add to MetaCart
CHOLMOD is a set of routines for factorizing sparse symmetric positive definite matrices of the form A or A A T, updating/downdating a sparse Cholesky factorization, solving linear systems, updating/downdating the solution to the triangular system Lx = b, and many other sparse matrix functions for both symmetric and unsymmetric matrices. Its supernodal Cholesky factorization relies on LAPACK and the Level3 BLAS, and obtains a substantial fraction of the peak performance of the BLAS. Both real and complex matrices are supported. CHOLMOD is written in ANSI/ISO C, with both C and MATLAB TM interfaces. It appears in MATLAB 7.2 as x=A\b when A is sparse symmetric positive definite, as well as in several other sparse matrix functions.
Optimizing the performance of sparse matrixvector multiplication
, 2000
"... Copyright 2000 by EunJin Im ..."
(Show Context)
ARMS: An Algebraic Recursive Multilevel Solver for general sparse linear systems
 Numer. Linear Alg. Appl
, 1999
"... This paper presents a general preconditioning method based on a multilevel partial solution approach. The basic step in constructing the preconditioner is to separate the initial points into two subsets. The first subset which can be termed "coarse" is obtained by using "block" ..."
Abstract

Cited by 49 (25 self)
 Add to MetaCart
(Show Context)
This paper presents a general preconditioning method based on a multilevel partial solution approach. The basic step in constructing the preconditioner is to separate the initial points into two subsets. The first subset which can be termed "coarse" is obtained by using "block" independent sets, or "aggregates". Two aggregates have no coupling between them, but nodes in the same aggregate may be coupled. The nodes not in the coarse set are part of what might be called the "Fringe" set. The idea of the methods is to form the Schur complement related to the fringe set. This leads to a natural block LU factorization which can be used as a preconditioner for the system. This system is then solver recursively using as preconditioner the factorization that could be obtained from the next level. Unlike other multilevel preconditioners available, iterations between levels are allowed. One interesting aspect of the method is that it provides a common framework for many other technique...