Results 1 
1 of
1
Efficient sparse matrixvector multiplication on cachebased GPUs
"... Sparse matrixvector multiplication is an integral part of many scientific algorithms. Several studies have shown that it is a bandwidthlimited operation on current hardware. On cachebased architectures the main factors that influence performance are spatial locality in accessing the matrix, and t ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Sparse matrixvector multiplication is an integral part of many scientific algorithms. Several studies have shown that it is a bandwidthlimited operation on current hardware. On cachebased architectures the main factors that influence performance are spatial locality in accessing the matrix, and temporal locality in reusing the elements of the vector. This paper discusses efficient implementations of sparse matrixvector multiplication on NVIDIA’s Fermi architecture, the first to introduce conventional L1 caches to GPUs. We focus on the compressed sparse row (CSR) format for developing general purpose code. We present a parametrised algorithm, show the effects of parameter tuning on performance and introduce a method for determining the nearoptimal set of parameters that incurs virtually no overhead. On a set of sparse matrices from the University of Florida Sparse Matrix Collection we show an average speedup of 2.1 times over NVIDIA’s CUSPARSE 4.0 library in single precision and 1.4 times in double precision. Many algorithms require repeated evaluation of sparse matrixvector products with the same matrix, so we introduce a dynamic runtime autotuning system which improves performance by 1015 % in seven iterations. The CSR format is compared to alternative ELLPACK and HYB formats and the cost of conversion is assessed using CUSPARSE. Sparse matrixvector multiplication performance is also analysed when solving a finite element problem with the conjugate gradient method. We show how problemspecific knowledge can be used to improve performance by up to a factor of two.