Results 1  10
of
27
Scan Primitives for GPU Computing
 GRAPHICS HARDWARE 2007
, 2007
"... The scan primitives are powerful, generalpurpose dataparallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Us ..."
Abstract

Cited by 120 (8 self)
 Add to MetaCart
The scan primitives are powerful, generalpurpose dataparallel primitives that are building blocks for a broad range of applications. We describe GPU implementations of these primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA API.Using the scan primitives, we show novel GPU implementations of quicksort and sparse matrixvector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallowwater fluid simulation using the scan framework for a tridiagonal matrix solver.
ILUT: A Dual Threshold Incomplete LU Factorization
, 1994
"... In this paper we describe an Incomplete LU factorization technique based on a strategy which combines two heuristics. This ILUT factorization extends the usual ILU(0) factorization without using the concept of level of fillin. There are two traditional ways of developing incomplete factorization ..."
Abstract

Cited by 85 (6 self)
 Add to MetaCart
In this paper we describe an Incomplete LU factorization technique based on a strategy which combines two heuristics. This ILUT factorization extends the usual ILU(0) factorization without using the concept of level of fillin. There are two traditional ways of developing incomplete factorization preconditioners. The first uses a symbolic factorization approach in which a level of fill is attributed to each fillin element using only the graph of the matrix. Then each fillin that is introduced is dropped whenever its level of fill exceeds a certain threshold. The second class of methods consists of techniques derived from modifications of a given direct solver by including a dropoff rule, based on the numerical size of the fillins introduced. traditionally referred to as threshold preconditioners. The first type of approach may not be reliable for indefinite problems, since it does not consider numerical values. The second is often far more expensive than the standard IL...
Krylov subspace methods on supercomputers
 SIAM J. SCI. STAT. COMPUT
, 1989
"... This paper presents a short survey of recent research on Krylov subspace methods with emphasis on implementation on vector and parallel computers. Conjugate gradient methods have proven very useful on traditional scalar computers, and their popularity is likely to increase as three dimensional model ..."
Abstract

Cited by 69 (4 self)
 Add to MetaCart
This paper presents a short survey of recent research on Krylov subspace methods with emphasis on implementation on vector and parallel computers. Conjugate gradient methods have proven very useful on traditional scalar computers, and their popularity is likely to increase as three dimensional models gain importance. A conservative approach to derive effective iterative techniques for supercomputers has been to find efficient parallel / vector implementations of the standard algorithms. The main source of difficulty in the incomplete factorization preconditionings is in the solution of the triangular systems at each step. We describe in detail a few approaches consisting of implementing efficient forward and backward triangular solutions. Then we discuss polynomial preconditioning as an alternative to standard incomplete factorization techniques. Another efficient approach is to reorder the equations so as improve the structure of the matrix to achieve better parallelism or vectorization. We give an overview of these ideas and others and attempt to comment on their effectiveness or potential for different types of architectures.
ILUM: A MultiElimination ILU Preconditioner For General Sparse Matrices
 SIAM J. Sci. Comput
, 1999
"... Standard preconditioning techniques based on incomplete LU (ILU) factorizations offer a limited degree of parallelism, in general. A few of the alternatives advocated so far consist of either using some form of polynomial preconditioning, or applying the usual ILU factorization to a matrix obtain ..."
Abstract

Cited by 54 (11 self)
 Add to MetaCart
Standard preconditioning techniques based on incomplete LU (ILU) factorizations offer a limited degree of parallelism, in general. A few of the alternatives advocated so far consist of either using some form of polynomial preconditioning, or applying the usual ILU factorization to a matrix obtained from a multicolor ordering. In this paper we present an incomplete factorization technique based on independent set orderings and multicoloring. We note that in order to improve robustness, it is necessary to allow the preconditioner to have an arbitrarily high accuracy, as is done with ILUs based on threshold techniques. The ILUM factorization described in this paper is in this category. It can be viewed as a multifrontal version a Gaussian elimination procedure with threshold dropping which has a high degree of potential parallelism. The emphasis is on methods that deal specifically with general unstructured sparse matrices such as those arising from finite element methods on un...
BILUM: Block versions of multielimination and multilevel ILU preconditioner for general sparse linear systems
 SIAM J. Sci. Comput
, 1999
"... Abstract. We introduce block versions of the multielimination incomplete LU (ILUM) factorization preconditioning technique for solving general sparse unstructured linear systems. These preconditioners have a multilevel structure and, for certain types of problems, may exhibit properties that are typ ..."
Abstract

Cited by 53 (29 self)
 Add to MetaCart
Abstract. We introduce block versions of the multielimination incomplete LU (ILUM) factorization preconditioning technique for solving general sparse unstructured linear systems. These preconditioners have a multilevel structure and, for certain types of problems, may exhibit properties that are typically enjoyed by multigrid methods. Several heuristic strategies for forming blocks of independent sets are introduced and their relative merits are discussed. The advantages of block ILUM over point ILUM include increased robustness and efficiency. We compare several versions of the block ILUM, point ILUM, and the dualthresholdbased ILUT preconditioners. In particular, tests with some convectiondiffusion problems show that it may be possible to obtain convergence that is nearly independent of the Reynolds number as well as of the grid size.
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
, 1993
"... In this paper we present a new technique for sparse matrix multiplication on vector multiprocessors based on the efficient implementation of a segmented sum operation. We describe how the segmented sum can be implemented on vector multiprocessors such that it both fully vectorizes within each proces ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
In this paper we present a new technique for sparse matrix multiplication on vector multiprocessors based on the efficient implementation of a segmented sum operation. We describe how the segmented sum can be implemented on vector multiprocessors such that it both fully vectorizes within each processor and parallelizes across processors. Because of our method's insensitivity to relative row size, it is better suited than the Ellpack/Itpack or the Jagged Diagonal algorithms for matrices which have a varying number of nonzero elements in each row. Furthermore, our approach requires less preprocessing (no more time than a single sparse matrixvector multiplication), less auxiliary storage, and uses a more convenient data representation (an augmented form of the standard compressed sparse row format). We have implemented our algorithm (SEGMV) on the Cray YMP C90, and have compared its performance with other methods on a variety of sparse matrices from the HarwellBoeing collection and in...
Highly Parallel Sparse Triangular Solution
, 1992
"... In this paper we survey a recent approach for solving sparse triangular systems of equations on highly parallel computers. This approach employs a partitioned representation of the inverse of the triangular matrix so that the solution can be computed by matrixvector multiplication. The number of fa ..."
Abstract

Cited by 27 (4 self)
 Add to MetaCart
In this paper we survey a recent approach for solving sparse triangular systems of equations on highly parallel computers. This approach employs a partitioned representation of the inverse of the triangular matrix so that the solution can be computed by matrixvector multiplication. The number of factors in the partitioned inverse is proportional to the number of general communication steps (router steps on a CM2) required in a highly parallel algorithm. We describe partitioning algorithms that minimize the number of factors in the partitioned inverse over all symmetric permutations of the triangular matrix such that the permuted matrix continues to be triangular. For a Cholesky factor we describe an O(n) time and space algorithm to solve the partitioning problem above, where n is the order of the matrix. Our computational results on a CM2 demonstrate the potential superiority of the partitioned inverse approach over the conventional substitution algorithm for highly parallel spars...
Sparse Numerical Linear Algebra: Direct Methods and Preconditioning
, 1996
"... Most of the current techniques for the direct solution of linear equations are based on supernodal or multifrontal approaches. An important feature of these methods is that arithmetic is performed on dense submatrices and Level 2 and Level 3 BLAS (matrixvector and matrixmatrix kernels) can be us ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
Most of the current techniques for the direct solution of linear equations are based on supernodal or multifrontal approaches. An important feature of these methods is that arithmetic is performed on dense submatrices and Level 2 and Level 3 BLAS (matrixvector and matrixmatrix kernels) can be used. Both sparse LU and QR factorizations can be implemented within this framework. Partitioning and ordering techniques have seen major activity in recent years. We discuss bisection and multisection techniques, extensions to orderings to block triangular form, and recent improvements and modifications to standard orderings such as minimum degree. We also study advances in the solution of indefinite systems and sparse leastsquares problems. The desire to exploit parallelism has been responsible for many of the developments in direct methods for sparse matrices over the last ten years. We examine this aspect in some detail, illustrating how current techniques have been developed or ...
Exchange of Messages of Different Sizes
 In IRREGULAR '98
"... In this paper, we study the exchange of messages among a set of processors linked through an interconnection network. We focus on general, nonuniform versions of alltoall (or complete) exchange problems in asynchronous systems with a linear cost model and messages of arbitrary sizes. We exten ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
In this paper, we study the exchange of messages among a set of processors linked through an interconnection network. We focus on general, nonuniform versions of alltoall (or complete) exchange problems in asynchronous systems with a linear cost model and messages of arbitrary sizes. We extend previous complexity results to show that the general asynchronous problems are NPcomplete. We present several approximation algorithms and determine which heuristics are best suited to several parallel systems. We conclude with experimental results that show that our algorithms outperform the native alltoall exchange algorithm on an IBM SP2 when the number of processors is odd.