## Efficient sparse matrix-vector multiplication on CUDA (2008)

Citations: | 50 - 1 self |

### BibTeX

@TECHREPORT{Bell08efficientsparse,

author = {Nathan Bell and Michael Garland},

title = {Efficient sparse matrix-vector multiplication on CUDA},

institution = {},

year = {2008}

}

### OpenURL

### Abstract

The massive parallelism of graphics processing units (GPUs) offers tremendous performance in many high-performance computing applications. While dense linear algebra readily maps to such platforms, harnessing this potential for sparse matrix computations presents additional challenges. Given its role in iterative methods for solving sparse linear systems and eigenvalue problems, sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra. In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the fine-grained parallel architecture of the GPU. Given the memory-bound nature of SpMV, we emphasize memory bandwidth efficiency and compact storage formats. We consider a broad spectrum of sparse matrices, from those that are well-structured and regular to highly irregular matrices with large imbalances in the distribution of nonzeros per matrix row. We develop methods to exploit several common forms of matrix structure while offering alternatives which accommodate greater irregularity. On structured, grid-based matrices we achieve performance of 36 GFLOP/s in single precision and 16 GFLOP/s in double precision on a GeForce GTX 280 GPU. For unstructured finite-element matrices, we observe performance in excess of 15 GFLOP/s and 10 GFLOP/s in single and double precision respectively. These results compare favorably to prior state-of-the-art studies of SpMV methods on conventional multicore processors. Our double precision SpMV performance is generally two and a half times that of a Cell BE with 8 SPEs and more than ten times greater than that of a quad-core Intel Clovertown system.

### Citations

1639 |
Iterative methods for sparse linear systems
- Saad
- 2003
(Show Context)
Citation Context ...-scale linear systems and eigenvalue problems that arise in a wide variety of scientific and engineering applications. The remaining part of these iterative methods (e.g the conjugate gradient method =-=[16]-=-), typically reduce to dense linear algebra operations that are readily handled by optimized BLAS [10] and LAPACK [1] implementations. Modern NVIDIA GPUs are throughput-oriented manycore processors th... |

567 |
Basic linear algebra subprograms for fortran usage
- Lawson, Hanson, et al.
- 1979
(Show Context)
Citation Context ...ng applications. The remaining part of these iterative methods (e.g the conjugate gradient method [16]), typically reduce to dense linear algebra operations that are readily handled by optimized BLAS =-=[10]-=- and LAPACK [1] implementations. Modern NVIDIA GPUs are throughput-oriented manycore processors that offer very high peak computational throughput. Realizing this potential requires exposing large amo... |

209 | Approximation algorithms for scheduling unrelated parallel machines
- Lenstra, Shmoys, et al.
- 1990
(Show Context)
Citation Context ...ata arrangement reduces to a packing problem: assign rows to threads such that the maximum work assigned to any thread is minimized. Although this minimum multiprocessor scheduling problem is NP-hard =-=[11]-=-, we find that simple packing heuristics suffice for the domain of interest. Figure 14 illustrates packed representations of the two submatrices in Figure 13 using four threads and a greedy least-occu... |

159 |
NVIDIA Tesla: A unified graphics and computing architecture
- Lindholm, Nickolls, et al.
- 2008
(Show Context)
Citation Context ...iented architectures will provide both (1) some form of SIMD thread execution and (2) vectorized or coalesced load/store operations. A modern NVIDIA GPU is built around an array of SM multiprocessors =-=[12]-=-, each of which supports up to 1024 co-resident threads. A single multiprocessor is equipped with 8 scalar cores, 16384 32-bit registers, and 16KB of high-bandwidth low-latency memory. Integer and sin... |

159 | Benchmarking GPUs to Tune Dense Linear Algebra
- Volkov, Demmel
(Show Context)
Citation Context ...al requires exposing large amounts of fine-grained parallelism and structuring computations to exhibit sufficient regularity of execution paths and memory access patterns. Recently, Volkov and Demmel =-=[18]-=- and Barrachina et al. [2] have demonstrated how to achieve significant percentages of peak floating point throughput and bandwidth on dense matrix operations. Dense operations are quite regular and a... |

131 |
Scalable parallel programming with cuda
- Nickolls, Buck, et al.
(Show Context)
Citation Context ...hly 90 GBytes/s, or 63.5% of peak, which indicates a relatively high level of efficiency on this bandwidth-limited computation. 2 Parallel Programming with CUDA In the CUDA parallel programming model =-=[13, 14]-=-, an application consists of a sequential host program that may execute parallel programs known as kernels on a parallel device. A kernel is a SPMD (Single Program Multiple Data) computation that is e... |

129 | Scan primitives for GPU computing
- Sengupta, Harris, et al.
- 2007
(Show Context)
Citation Context ... COO kernel (not shown). Segmented reduction is a data-parallel operation, which like other primitives such as parallel prefix sum (scan), facilitate numerous parallel algorithms [3]. Sengupta et al. =-=[17]-=- discuss efficient CUDA implementations of common parallel primitives, including an application of segmented scan to SpMV. Our COO kernel is most closely related to the work of Blelloch et al. [4], wh... |

104 | Optimization of sparse matrix-vector multiplication on emerging multicore platforms
- Williams, Oliker, et al.
- 2009
(Show Context)
Citation Context ...than uncoalesced single-word accesses on a memory bandwidth basis. 5.2 Unstructured Matrices Our unstructured matrix performance study considers the same corpus of 14 matrices used by Williams et al. =-=[19]-=- for benchmarking SpMV performance on several multicore processors. Table 4 lists these matrices and summarizes their basic properties. Further details regarding the origin of each matrix are provided... |

100 | Prefix sums and their applications
- Blelloch
- 1990
(Show Context)
Citation Context ...basis of our complete COO kernel (not shown). Segmented reduction is a data-parallel operation, which like other primitives such as parallel prefix sum (scan), facilitate numerous parallel algorithms =-=[3]-=-. Sengupta et al. [17] discuss efficient CUDA implementations of common parallel primitives, including an application of segmented scan to SpMV. Our COO kernel is most closely related to the work of B... |

91 | Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method
- Knyazev
(Show Context)
Citation Context ...r linear systems, this situation occurs when solving for several righthand-sides simultaneously (i.e. AX = B where B has multiple columns). Furthermore, in the case of eigensolvers such as the LOBPCG =-=[9]-=-, it is not uncommon to utilize block vectors with ten or more columns. 29 0.00 4.00 8.00 12.00 16.00 20.00 24.00 R e la ti vesP e rf o rm an ce Matrix Cell Opteron Xeon Niagara 0.00 2.00 4.00 6.00 8.... |

65 | Sparsity: Optimization framework for sparse matrix kernels
- Im, Yelick, et al.
(Show Context)
Citation Context ...e PKT format is outperformed by HYB in a majority of cases considered, further improvements are possible. 6.2 Future Work We have not considered block formats, such as Block CSR or Variable-Block CSR =-=[8]-=- in this paper. Block formats can deliver higher performance [5, 19], particularly for matrices arising in vector-valued problems. A number of the techniques we have applied to scalar formats are comp... |

34 |
NVIDIA CUDA Programming Guide
- Corporation
- 2007
(Show Context)
Citation Context ...hly 90 GBytes/s, or 63.5% of peak, which indicates a relatively high level of efficiency on this bandwidth-limited computation. 2 Parallel Programming with CUDA In the CUDA parallel programming model =-=[13, 14]-=-, an application consists of a sequential host program that may execute parallel programs known as kernels on a parallel device. A kernel is a SPMD (Single Program Multiple Data) computation that is e... |

28 | Segmented operations for sparse matrix computation on vector multiprocessors
- Blelloch, Heroux, et al.
- 1993
(Show Context)
Citation Context ...l. [17] discuss efficient CUDA implementations of common parallel primitives, including an application of segmented scan to SpMV. Our COO kernel is most closely related to the work of Blelloch et al. =-=[4]-=-, which demonstrated structure-insensitive SpMV performance the Cray C90 vector computer. __device__ void segmented_reduction(const int lane , const int * rows , float * vals) { // segmented reduction... |

23 | B.: Concurrent number cruncher: a GPU implementation of a general sparse linear solver
- BUATOIS, CAUMON, et al.
(Show Context)
Citation Context ...dered, further improvements are possible. 6.2 Future Work We have not considered block formats, such as Block CSR or Variable-Block CSR [8] in this paper. Block formats can deliver higher performance =-=[5, 19]-=-, particularly for matrices arising in vector-valued problems. A number of the techniques we have applied to scalar formats are compatible with block format extensions. Special handling for the case o... |

20 | Solving dense linear systems on graphics processors
- Barrachina, Castillo, et al.
(Show Context)
Citation Context ...amounts of fine-grained parallelism and structuring computations to exhibit sufficient regularity of execution paths and memory access patterns. Recently, Volkov and Demmel [18] and Barrachina et al. =-=[2]-=- have demonstrated how to achieve significant percentages of peak floating point throughput and bandwidth on dense matrix operations. Dense operations are quite regular and are consequently often limi... |

10 |
Vectorized sparse matrix multiply for compressed row storage format
- DAzevedo, Fahey, et al.
- 2005
(Show Context)
Citation Context ...x representations. Of these alternative formats, many are derived directly from one of the presented formats (e.g., modified CSR [15]). Others, like CSR with permutation or the jagged diagonal format =-=[6]-=- are natural generalizations of a basic format. Hybrid combinations of structured and unstructured formats, such as DIA and CSR [15], are also useful. 12 __global__ void spmv_dia_kernel(const int num_... |

10 |
SPARSKIT: A basic tool kit for sparse computations; Version 2
- Saad
- 1994
(Show Context)
Citation Context ...ction represent only a small portion the complete space of sparse matrix representations. Of these alternative formats, many are derived directly from one of the presented formats (e.g., modified CSR =-=[15]-=-). Others, like CSR with permutation or the jagged diagonal format [6] are natural generalizations of a basic format. Hybrid combinations of structured and unstructured formats, such as DIA and CSR [1... |

9 |
ITPACK 2.0 Userâ€™s Guide
- Grimes, Kincaid, et al.
- 1979
(Show Context)
Citation Context ...sity patterns that are inappropriate for DIA, such as those illustrated in Figure 6. 3.2 ELLPACK Format Another storage scheme that is well-suited to vector architectures is the ELLPACK (ELL) format3 =-=[7]-=-. For an M -by-N matrix with a maximum of K nonzeros per row, the ELLPACK format stores the nonzero values in a dense M -by-K array data, where rows with fewer than K nonzeros are zero-padded. Similar... |