• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Efficient Sparse MatrixVector Multiplication on CUDA. (2008)

by N Bell, M Garland
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 113
Next 10 →

Implementing sparse matrix-vector multiplication on throughput-oriented processors

by Nathan Bell, Michael Garland - In SC ’09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing , 2009
"... Sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential ..."
Abstract - Cited by 142 (7 self) - Add to MetaCart
Sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughput-oriented processors for sparse operations requires that we expose substantial fine-grained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quad-core Intel Clovertown system. 1.
(Show Context)

Citation Context

...cture the matrix. In this paper, we summarize how data is laid out in memory and how work is assigned to parallel threads. The source code for these SpMV programs can be found in our technical report =-=[4]-=- and its accompanying software package. 3.1 Diagonal Format When nonzero values are restricted to a small number of matrix diagonals, the diagonal format (DIA) is an appropriate representation [19]. A...

One point isometric matching with the heat kernel

by Maks Ovsjanikov, Leonidas J. Guibas, Maks Ovsjanikov, Maks Ovsjanikov, Quentin Mérigot, Facundo Mémoli Leonidas Guibas - Computer Graphics Forum
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract - Cited by 68 (4 self) - Add to MetaCart
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
(Show Context)

Citation Context

...to contiguous sets of rows and issue separate threads to compute the product Lv independently for each set. Note that sparse matrix multiplication has also been implemented in graphics hardware (e.g. =-=[BG08]-=-), which can improve efficiency. Moreover, an efficient multiresolution approach to computing the heat kernel has been recently proposed by Vaxman et al. [VBCGar]. To compute nearest neighbors in the ...

Model-driven autotuning of sparse matrix-vector multiply on GPUs

by Jee W. Choi, Amik Singh, Richard W. Vuduc - In PPoPP , 2010
"... We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully hand-tuned SpMV implementations for G ..."
Abstract - Cited by 65 (4 self) - Add to MetaCart
We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully hand-tuned SpMV implementations for GPUs, identifying key GPU-specific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed state-of-the-art implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/s in doubleprecision on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementations (Bell and Garland, 2009) by up to 1.8 × and 1.5 × for single- and doubleprecision respectively. However, achieving this level of performance requires input matrix-dependent parameter tuning. Thus, in the second part of this study, we develop a performance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), this model requires offline measurements and run-time estimation, but more directly models the structure of multithreaded vector processors like GPUs. We show that our model can identify the implementations that achieve within 15 % of those found through exhaustive search.
(Show Context)

Citation Context

...gle NVIDIA T10P multiprocessor-based GPU, enables improvements over the best unblocked state-of-the-art implementation by up to 1.8× and 1.5× for single and double-precision computations respectively =-=[3]-=-. However, BELLPACK requires careful tuning. Thus, we propose a novel and accurate performance model-driven framework 1 At the time, up to over twice the performance of the cublasSgemv kernel in the C...

Accelerating CUDA graph algorithms at maximum warp

by Sungpack Hong, Sang Kyun, Kim Tayo, Oguntebi Kunle Olukotun - In PPoPP , 2011
"... Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most real-world graphs t ..."
Abstract - Cited by 49 (3 self) - Add to MetaCart
Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture. We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30 % improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.
(Show Context)

Citation Context

...nel, which resulted in more than 20% improvement. We use this optimized version as our baseline. 3 This data-structure is also known as compressed sparse row (CSR) in sparse-matrix computation domain =-=[9]-=-. (a) …… … Nodes Edges 0 … 7 25 7 8 99 … 189 …… … 0 25 99 189 … # N o d e s 1E+6 1E+5 1E+4 1E+3 1E+2 1E+1 1E+0 1E+0 1E+2 1E+4 1E+6 (b) Degree Figure 3. (a) A visualization of the graph data structure ...

Assembly of finite element methods on graphics processors

by Cris Cecka, Adrian J. Lew, E. Darve - International Journal for Numerical Methods in Engineering
"... Recently, graphics processing units (GPUs) have had great success in accelerating many numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA ..."
Abstract - Cited by 20 (0 self) - Add to MetaCart
Recently, graphics processing units (GPUs) have had great success in accelerating many numerical computations. We present their application to computations on unstructured meshes such as those in finite element methods. Multiple approaches in assembling and solving sparse linear systems with NVIDIA GPUs and the Compute Unified Device Architecture (CUDA) are presented and discussed. Multiple strategies for efficient use of global, shared, and local memory, methods to achieve memory coalescing, and optimal choice of parameters are introduced. We find that with appropriate preprocessing and arrangement of support data, the GPU coprocessor achieves speedups of 30 or more in comparison to a well optimized serial implementation. We also find that the optimal assembly strategy depends on the order of polynomials used in the finite-element discretization. Copyright c©
(Show Context)

Citation Context

... (CNC) which implements a preconditioned conjugate gradient solver using block compressed row storage for improved register blocking, but can result in non-optimal global memory accesses. Bell et al. =-=[23]-=- compare the performance of the SpMV under many sparse storage formats and sparsity patterns, from regular to highly irregular. Baskaran et al. [24] provide a highly optimized SpMV kernel specifically...

Parallel simrank computation on large graphs with iterative aggregation

by Guoming He, Haijun Feng, Cuiping Li, Hong Chen - KDD'10 , 2010
"... Recently there has been a lot of interest in graph-based analysis. One of the most important aspects of graph-based analysis is to measure similarity between nodes in a graph. SimRank is a simple and influential measure of this kind, based on a solid graph theoretical model. However, existing method ..."
Abstract - Cited by 18 (2 self) - Add to MetaCart
Recently there has been a lot of interest in graph-based analysis. One of the most important aspects of graph-based analysis is to measure similarity between nodes in a graph. SimRank is a simple and influential measure of this kind, based on a solid graph theoretical model. However, existing methods on SimRank computation suffer from two limitations: 1) the computing cost can be very high in practice; and 2) they can only be applied on static graphs. In this paper, we exploit the inherent parallelism and high memory bandwidth of graphics processing units (GPU) to accelerate the computation of SimRank on large graphs. Furthermore, based on the observation that SimRank is essentially a first-order Markov Chain, we propose to utilize the iterative aggregation techniques for uncoupling Markov chains to compute SimRank scores in parallel for large graphs. The iterative aggregation method can be applied on dynamic graphs. Moreover, it can handle not only the link-updating problem but also the node-updating problem. Extensive experiments on synthetic and real data sets verify that the proposed methods are efficient and effective.
(Show Context)

Citation Context

...ultiplications, which is indeed suitable to take advantage of parallel programming. [3] discussed the challenge and advances to implement sparse matrix multiplication under parallel architecture, and =-=[2]-=- used graphic processers to further improve the performance. 8. CONCLUSION This paper addresses the issues of optimization as well as incremental update of SimRank for static and dynamic graphs. We ha...

Exposing fine-grained parallelism in algebraic multigrid methods

by Nathan Bell, Steven Dalton, Luke N. Olson , 2012
"... Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating mu ..."
Abstract - Cited by 17 (0 self) - Add to MetaCart
Algebraic multigrid methods for large, sparse linear systems are a necessity in many computational simulations, yet parallel algorithms for such solvers are generally decomposed into coarse-grained tasks suitable for distributed computers with traditional processing cores. However, accelerating multigrid on massively parallel throughput-oriented processors, such as the GPU, de-mands algorithms with abundant fine-grained parallelism. In this paper, we develop a parallel algebraic multigrid method which exposes substantial fine-grained parallelism in both the construc-tion of the multigrid hierarchy as well as the cycling or solve stage. Our algorithms are expressed in terms of scalable parallel primitives that are efficiently implemented on the GPU. The resulting solver achieves an average speedup of 1.8 × in the setup phase and 5.7 × in the cycling phase when compared to a representative CPU implementation.
(Show Context)

Citation Context

...mponent is that of simplifying an array to a single value, or a reduction. In Thurst, the reduce algorithm reduces a range of numbers to a single value by successively summing values together: reduce(=-=[3, 4, 1, 5, 2]-=-)→ 15. The same algorithm can be used to determine the maximum entry, by specifying maximum for the reduction operator: reduce([3, 4, 1, 5, 2], maximum)→ 5. In general, any function that is both commu...

A memory efficient and fast sparse matrix vector product on a GPU

by A. Dziekonski, A. Lamecki, M. Mrozowski - PROGRESS IN ELECTROMAGNETIC RESEARCH , 2011
"... This paper proposes a new sparse matrix storage format which allows an efficient implementation of a sparse matrix vector product on a Fermi Graphics Processing Unit (GPU). Unlike previous formats it has both low memory footprint and good throughput. The new format, which we call Sliced ELLR-T has ..."
Abstract - Cited by 13 (2 self) - Add to MetaCart
This paper proposes a new sparse matrix storage format which allows an efficient implementation of a sparse matrix vector product on a Fermi Graphics Processing Unit (GPU). Unlike previous formats it has both low memory footprint and good throughput. The new format, which we call Sliced ELLR-T has been designed specifically for accelerating the iterative solution of a large sparse and complex-valued system of linear equations arising in computational electromagnetics. Numerical tests have shown that the performance of the new implementation reaches 69 GFLOPS in complex single precision arithmetic. Compared to the optimized six core Central Processing Unit (CPU) (Intel Xeon 5680) this performance implies a speedup by a factor of six. In terms of speed the new format is as fast as the best format published so far and at the same time it does not introduce redundant zero elements which have to be stored to ensure fast memory access. Compared to previously published solutions, significantly larger problems can be handled using low cost commodity GPUs with limited amount of on-board memory.
(Show Context)

Citation Context

...multilevel preconditioners with the Jacobi smoother [25] are used. One of the factors that affects the efficiency of the matrix-vector product is the way the sparse matrix is stored in the GPU memory =-=[21]-=-. For matrices with irregular non-zero entry patterns the best results on a GPU are obtained using variants of the storage scheme known as Ellpack [21]. The Ellpack format can be modified in order to ...

DL: A Data Layout Transformation System for Heterogeneous Computing

by Geng Daniel Liu, Wen-mei W. Hwu - Proc. IEEE Conf. Innovative Parallel Computing (InPar 12), IEEE , 2012
"... For many-core architectures like the GPUs, efficient off-chip memory access is crucial to high performance; the applications are often limited by off-chip memory bandwidth. Transforming data layout is an effective way to reshape the access patterns to improve off-chip memory access behavior, but sev ..."
Abstract - Cited by 13 (2 self) - Add to MetaCart
For many-core architectures like the GPUs, efficient off-chip memory access is crucial to high performance; the applications are often limited by off-chip memory bandwidth. Transforming data layout is an effective way to reshape the access patterns to improve off-chip memory access behavior, but several challenges had limited the use of automated data layout transformation systems on GPUs, namely how to efficiently handle arrays of aggregates, and transparently marshal data between layouts required by different performance sensitive kernels and legacy host code. While GPUs have higher memory bandwidth and are natural candidates for marshaling data between layouts, the relatively constrained GPU memory capacity, compared to that of the CPU, implies that not only the temporal cost of marshaling but also the spatial overhead must be considered for any practical layout transformation systems. This paper presents DL, a practical GPU data layout transformation system that addresses these problems: first, a novel approach to laying out array of aggregate types across GPU and CPU architectures is proposed to further improve memory parallelism and kernel performance beyond what is achieved by human programmers using discrete arrays today. Our proposed new layout can be derived in situ from the traditional Array of Structure, Structure of Arrays, and adjacent Discrete Arrays layouts used by programmers. Second, DL has a run-time library implemented in OpenCL that transparently and efficiently converts, or marshals, data to accommodate application components that have different data layout requirements. We present insights that lead to the design of this highly efficient run-time marshaling library. In particular, the in situ transformation implemented in the library is comparable or faster than optimized traditional out-of-place transformations while avoiding doubling the GPU DRAM usage. Third, we show experimental results that the new layout approach leads to substantial performance improvement at the applications level even when all marshaling cost is taken into account.
(Show Context)

Citation Context

...oid AoS( __global foo* f) { 7 f[get_global_id(0)].bar*=2.0; 8 } 9 10 __kernel void DA(__global float *bar, 11 __global int *baz) { 12 bar[get_global_id(0)]*=2.0; 13 } 14 15 struct foo_2 { 16 float bar=-=[4]-=-; 17 int baz[4]; 18 }; 19 20 __kernel void ASTA(__global foo_2* f) { 21 int gid0 = get_global_id(0); 22 f[gid0/4].bar[gid0%4] *=2.0; 23 } Listing 1: AoS, Discrete Arrays, and ASTA It is commonly assum...

A Parallel Algebraic Multigrid Solver on Graphics Processing Units ⋆

by Gundolf Haase, Manfred Liebmann, Craig C. Douglas, Gernot Plank
"... Abstract. The paper presents a multi-GPU implementation of the preconditioned conjugate gradient algorithm with an algebraic multigrid preconditioner (PCG-AMG) for an elliptic model problem on a 3D unstructured grid. An efficient parallel sparse matrix-vector multiplication scheme underlying the PCG ..."
Abstract - Cited by 13 (1 self) - Add to MetaCart
Abstract. The paper presents a multi-GPU implementation of the preconditioned conjugate gradient algorithm with an algebraic multigrid preconditioner (PCG-AMG) for an elliptic model problem on a 3D unstructured grid. An efficient parallel sparse matrix-vector multiplication scheme underlying the PCG-AMG algorithm is presented for the manycore GPU architecture. A performance comparison of the parallel solver shows that a singe Nvidia Tesla C1060 GPU board delivers the performance of a sixteen node Infiniband cluster and a multi-GPU configuration with eight GPUs is about 100 times faster than a typical server CPU core. 1
(Show Context)

Citation Context

...cation Kernel The most challenging kernel within the PCG-AMG solver is the sparse matrixvector multiplication v := Au. Different approaches to the sparse matrix-vector multiplication are discussed in =-=[11, 1, 2]-=-. Due to the coalescing restriction in accessing the GPU memory it is not efficient to use the standard compressed row storage (CRS) data format. Since there is no natural blocking within the data str...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University