## Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks (2009)

### Cached

### Download Links

Venue: | IN SPAA |

Citations: | 13 - 1 self |

### BibTeX

@INPROCEEDINGS{Buluç09parallelsparse,

author = {Aydın Buluç and Jeremy T. Fineman and Matteo Frigo and John R. Gilbert and Charles E. Leiserson},

title = {Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks},

booktitle = {IN SPAA},

year = {2009},

pages = {233--244},

publisher = {}

}

### OpenURL

### Abstract

This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense n-vector. Our algorithms use Θ(nnz) work (serial running time) and Θ ( √ nlgn) span (critical-path length), yielding a parallelism of Θ(nnz / √ nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is esssentially the same as that for the morestandard compressed-sparse-rows (CSR) format, for which computing Ax in parallel is easy but A T x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A T x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by offchip memory bandwidth.

### Citations

8530 | Introduction to Algorithms - Cormen, Leiserson, et al. - 1990 |

1518 |
Iterative methods for sparse linear systems
- Saad
- 2003
(Show Context)
Citation Context ...tion, however, requires storing 2nnz row and column indices, in addition to the nonzeros. The current standard storage format for sparse matrices in scientific computing, compressed sparse rows (CSR) =-=[32]-=-, is more efficient, because it stores only n+nnz indices or pointers. This reduction in storage of CSR compared with the tuple representation tends to result in faster serial algorithms. In the domai... |

1496 | The C++ Programming Language
- Stroustrup
- 1997
(Show Context)
Citation Context ...or implicitly converting from the concatenated to interleaved format when making a binarysearch comparison. Our preliminary implementation does the latter, using a C++ function object for comparisons =-=[35]-=-. In practice, the overhead of performing these conversions is small, since the number of binary-search steps is small. Performing the actual address calculation and determining the pointers to x and ... |

538 |
Direct Methods for Sparse Matrices
- DUFF, ERISMAN, et al.
- 1986
(Show Context)
Citation Context ...[28] on sparse Gaussian elimination does not discuss data structures, but it is likely that Markowitz used such a format as well. CSR and CSC have since become ubiquitous in sparse matrix computation =-=[13, 16, 17, 21, 23, 32]-=-. The following lemma states the well-known bound on space used by the index data in the CSR format (and hence the CSC format as well). By index data, we mean all data other than the nonzeros — that i... |

531 | Cilk: An efficient multithreaded runtime system
- Blumofe, Joerg, et al.
- 1995
(Show Context)
Citation Context ...(nr+lgn). The parallelism is therefore Θ(nnz/(nr+lgn)). In many common situations, we have nnz = Θ(n), which we will assume for 1 The literature also uses the terms depth [3] and critical-path length =-=[4]-=-.CSR_SPMV_T(A,x,y) 1 n ← A.cols 2 for i ← 0 to n − 1 3 do y[i] ← 0 4 for i ← 0 to n − 1 5 do for k ← A.row_ptr[i] to A.row_ptr[i+1] − 1 6 do y[A.col_ind[k]] ← y[A.col_ind[k]]+A.val[k] · x[i] Figure 3... |

395 | Scheduling multithreaded computations by work stealing
- Blumofe, Leiserson
- 1999
(Show Context)
Citation Context ... different processor, beginning a depth-first execution from some unexecuted parallel branch. Although not all work-stealing schedulers are space efficient, those maintaining the busy-leaves property =-=[5]-=- (e.g., as used in the Cilk work-stealing scheduler [4]) are space efficient. The “busy-leaves” property roughly says that if a procedure has begun (but not completed) executing, then there exists a p... |

322 | The implementation of the Cilk-5 multithreaded language
- Frigo, Leiserson, et al.
- 1998
(Show Context)
Citation Context ...ms. Implementation We parallelized our code using Cilk++ [9], which is a faithful extension of C++ for multicore and shared-memory parallel programming. Cilk++ is based on the earlier MIT Cilk system =-=[20]-=-, and it employs dynamic load balancing and provably optimal task scheduling. The CSB code used for the experiments is freely available for academic use at http://gauss.cs.ucsb.edu/~aydin/ software.ht... |

301 | University of Florida sparse matrix collection: http://www.cise.ufl.edu/research/ sparse
- Davis
- 1994
(Show Context)
Citation Context ...ral information on the sparse matrices used in our experiments, ordered by increasing number of nonzeros. The first ten matrices and Cage15 are from the University of Florida sparse matrix collection =-=[12]-=-. Grid3D200 is a 7-point finite difference mesh generated using the Matlab Mesh Partitioning and Graph Separator Toolbox [22]. The RMat23 matrix [26], which models scale-free graphs, is generated by u... |

211 |
Reducing the bandwidth of sparse symmetric matrices
- Cuthill, McKee
- 1969
(Show Context)
Citation Context ... regular and thus favorable to cacheline reuse and automatic prefetching. Strategies for reducing the bandwidth of a sparse matrix by permuting its rows and columns have been studied extensively (see =-=[11, 37]-=-, for example). Many matrices, however, cannot be permuted to have low bandwidth. For matrices with scattered nonzeros, CSB outperforms CSR, because CSR incurs many cache misses when accessing the x v... |

192 | Programming parallel algorithms
- Blelloch
- 1996
(Show Context)
Citation Context ...on. Thus, the total span is Θ(nr+lgn). The parallelism is therefore Θ(nnz/(nr+lgn)). In many common situations, we have nnz = Θ(n), which we will assume for 1 The literature also uses the terms depth =-=[3]-=- and critical-path length [4].CSR_SPMV_T(A,x,y) 1 n ← A.cols 2 for i ← 0 to n − 1 3 do y[i] ← 0 4 for i ← 0 to n − 1 5 do for k ← A.row_ptr[i] to A.row_ptr[i+1] − 1 6 do y[A.col_ind[k]] ← y[A.col_ind... |

139 |
A computer oriented geodetic data base and a new technique in file sequencing
- Morton
- 1966
(Show Context)
Citation Context ...mat uses nlgnnz+nnzlgn bits of index data when β = √ n. Thus far, we have not addressed the ordering of elements within each block or the ordering of blocks. Within a block, we use a ZMorton ordering =-=[29]-=-, storing first all those elements in the top-left quadrant, then the top-right, bottom-left, and finally bottom-right quadrants, using the same layout recursively within each quadrant. In fact, these... |

130 | Sparse matrices in MATLAB: Design and Implementation
- Gilbert, Moler, et al.
- 1992
(Show Context)
Citation Context ...[28] on sparse Gaussian elimination does not discuss data structures, but it is likely that Markowitz used such a format as well. CSR and CSC have since become ubiquitous in sparse matrix computation =-=[13, 16, 17, 21, 23, 32]-=-. The following lemma states the well-known bound on space used by the index data in the CSR format (and hence the CSC format as well). By index data, we mean all data other than the nonzeros — that i... |

102 | Geometric mesh partitioning: Implementation and experiments
- Gilbert, Miller, et al.
(Show Context)
Citation Context ...ces and Cage15 are from the University of Florida sparse matrix collection [12]. Grid3D200 is a 7-point finite difference mesh generated using the Matlab Mesh Partitioning and Graph Separator Toolbox =-=[22]-=-. The RMat23 matrix [26], which models scale-free graphs, is generated by using repeated Kronecker products [2]. We chose parameters A = 0.7, B = C = D = 0.1 for RMat23 in order to generate skewed mat... |

99 |
Direct methods for sparse linear systems
- Davis
- 2006
(Show Context)
Citation Context ...[28] on sparse Gaussian elimination does not discuss data structures, but it is likely that Markowitz used such a format as well. CSR and CSC have since become ubiquitous in sparse matrix computation =-=[13, 16, 17, 21, 23, 32]-=-. The following lemma states the well-known bound on space used by the index data in the CSR format (and hence the CSC format as well). By index data, we mean all data other than the nonzeros — that i... |

95 | Oski: A library of automatically tuned sparse matrix kernels
- Demmel, Yelick
(Show Context)
Citation Context ...ns on 13 different matrices from our benchmark test suite. CSB_SpMV and CSB_SpMV_T use compressed sparse blocks to perform Ax and ATx, respectively. CSR_SpMV (Serial) and CSR_SpMV_T (Serial) use OSKI =-=[39]-=- and compressed sparse rows without any matrix-specific optimizations. Star-P (y=Ax) and Star-P (y’=x’A) use Star-P [34], a parallel code based on CSR. The experiments were run on a ccNUMA architectur... |

91 | Optimization of sparse matrix-vector multiplication on emerging multicore platforms
- Williams, Oliker, et al.
(Show Context)
Citation Context ... were conducted on Opteron. marizes these results. The speedups are relative to the CSB code running on a single processor, which Figure 1 shows is competitive with serial CSR codes. In another study =-=[41]-=- on the same Opteron architecture, multicore-specific parallelization of the CSR code for 4 cores achieved comparable speedup to what we report here, albeit on a slightly different sparse-matrix test ... |

87 |
The elimination form of the inverse and its application to linear programming
- Markowitz
- 1957
(Show Context)
Citation Context ...e literature is an unnamed “scheme” presented in Table 1 of the 1967 article [36] by Tinney and Walker, although in 1963 Sato and Tinney [33] allude to what is probably CSR. Markowitz’s seminal paper =-=[28]-=- on sparse Gaussian elimination does not discuss data structures, but it is likely that Markowitz used such a format as well. CSR and CSC have since become ubiquitous in sparse matrix computation [13,... |

87 |
Direct solution of sparse network equations by optimally ordered triangular factorization
- Tinney, Walker
- 1967
(Show Context)
Citation Context ... is obtained by storing A T in CSR format. The earliest written description of CSR that we have been able to divine from the literature is an unnamed “scheme” presented in Table 1 of the 1967 article =-=[36]-=- by Tinney and Walker, although in 1963 Sato and Tinney [33] allude to what is probably CSR. Markowitz’s seminal paper [28] on sparse Gaussian elimination does not discuss data structures, but it is l... |

76 |
Recursive blocked algorithms and hybrid data structures for dense matrix library software
- Elmroth, Gustavson, et al.
(Show Context)
Citation Context ...essary for our algorithm to achieve good parallelism within a block. The choice of storing the nonzeros within blocks in a recursive layout is opposite to the common wisdom for storing dense matrices =-=[18]-=-. Although most compilers and architectures favor conventional row/column ordering for optimal prefetching, the choice of layout within the block becomes less significant for sparse blocks as they alr... |

75 | Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication
- Leskovec, Chakrabarti, et al.
- 2005
(Show Context)
Citation Context ...the University of Florida sparse matrix collection [12]. Grid3D200 is a 7-point finite difference mesh generated using the Matlab Mesh Partitioning and Graph Separator Toolbox [22]. The RMat23 matrix =-=[26]-=-, which models scale-free graphs, is generated by using repeated Kronecker products [2]. We chose parameters A = 0.7, B = C = D = 0.1 for RMat23 in order to generate skewed matrices. Webbase2001 is a ... |

72 | Improving the memory-system performance of sparsematrix vector multiplication
- Toledo
- 1997
(Show Context)
Citation Context ... regular and thus favorable to cacheline reuse and automatic prefetching. Strategies for reducing the bandwidth of a sparse matrix by permuting its rows and columns have been studied extensively (see =-=[11, 37]-=-, for example). Many matrices, however, cannot be permuted to have low bandwidth. For matrices with scattered nonzeros, CSB outperforms CSR, because CSR incurs many cache misses when accessing the x v... |

68 | A two-dimensional data distribution method for parallel sparse matrix-vector multiplication
- Vastenhouw, Bisseling
(Show Context)
Citation Context ...ation has focused on reducing communication volume in a distributedmemory setting, often by using graph or hypergraph partitioning techniques to find good data distributions for particular matrices ( =-=[7,38]-=-, for example). Good partitions generally exist for matrices whose structures arise from numerical discretizations of partial differential equations in two or three spatial dimensions. Our work, by co... |

63 | Sparsity: Optimization framework for sparse matrix kernels
- Im, Yelick, et al.
(Show Context)
Citation Context ...j), however, is simpler for a row-major or column-major ordering among blocks. Comparison with other blocking methods A blocked variant of CSR, called BCSR, has been used for improving register reuse =-=[24]-=-. In BCSR, the sparse matrix is divided into small dense blocks that are stored in consecutive memory locations. The pointers are maintained to the first block on each row of blocks. BCSR storage is c... |

29 |
Yale sparse matrix package i: the symmetric codes
- Eisenstat, Gursky, et al.
- 1982
(Show Context)
Citation Context |

28 | A fine-grain hypergraph model for 2D decomposition of sparse matrices
- Çatalyürek, Aykanat
- 2001
(Show Context)
Citation Context ...ation has focused on reducing communication volume in a distributedmemory setting, often by using graph or hypergraph partitioning techniques to find good data distributions for particular matrices ( =-=[7,38]-=-, for example). Good partitions generally exist for matrices whose structures arise from numerical discretizations of partial differential equations in two or three spatial dimensions. Our work, by co... |

18 | When cache blocking sparse matrix vector multiply works and why
- Nishtala, Vuduc, et al.
- 2007
(Show Context)
Citation Context ...ocks, whereas CSB stores a dense collection of sparse blocks. We conjecture that it would be advantageous to apply BCSR-style register blocking to each individual sparse block of CSB. Nishtala et al. =-=[30]-=- have proposed a data structure similar to CSB in the context of cache blocking. Our work differs from theirs in two ways. First, CSB is symmetric without favoring rows over columns. Second, our algor... |

15 |
webbase components and applications
- Stanford
(Show Context)
Citation Context ...by using repeated Kronecker products [2]. We chose parameters A = 0.7, B = C = D = 0.1 for RMat23 in order to generate skewed matrices. Webbase2001 is a crawl of the World Wide Web from the year 2001 =-=[8]-=-. Asic_320k Sme3Dc Parabolic_fem Rucci Torso kkt_power Figure 10: CSB_SPMV_T performance on Opteron (smaller matrices). Harpertown and with gcc 4.3 on Nehalem, all with optimization flags -O2 -fno-rtt... |

15 |
Accelerating sparse matrix computations via data compression
- Willcock, Lumsdaine
- 2006
(Show Context)
Citation Context ...ral instead of matrix specific, we did not employ speculative low-level optimizations such as software prefetching, pipelining, or matrix-specific optimizations such as index and/or value compression =-=[25, 40]-=-, but we believe that CSB and our algorithms should not adversely affect incorporation of these approaches. Choosing the block size β We investigated different strategies to choose the block size that... |

13 |
Reducers and other Cilk++ hyperobjects
- Frigo, Halpern, et al.
- 2009
(Show Context)
Citation Context ...e of Cilk++ to test whether the first call has completed before making the second recursive call, and we allocate the temporary as appropriate. This test may also be implemented using Cilk++ reducers =-=[19]-=-. Sparse-matrix test suite We conducted experiments on a diverse set of sparse matrices from real applications including circuit simulation, finite-element computations, linear programming, and web-co... |

9 |
Hpcs scalable synthetic compact applications graph analysis (SSCA2) benchmark v2.2
- Bader, Gilbert, et al.
- 2007
(Show Context)
Citation Context ...ifference mesh generated using the Matlab Mesh Partitioning and Graph Separator Toolbox [22]. The RMat23 matrix [26], which models scale-free graphs, is generated by using repeated Kronecker products =-=[2]-=-. We chose parameters A = 0.7, B = C = D = 0.1 for RMat23 in order to generate skewed matrices. Webbase2001 is a crawl of the World Wide Web from the year 2001 [8]. Asic_320k Sme3Dc Parabolic_fem Rucc... |

9 | On the representation and multiplication of hypersparse matrices
- Buluc, Gilbert
- 2008
(Show Context)
Citation Context ...plicity of presentation, we shall assume that β is an exact power of 2 and that it divides n; relaxing these assumptions is straightforward. Many or most of the individual blocks Ai j are hypersparse =-=[6]-=-, meaning that the ratio of nonzeros to matrix dimension is asymptotically 0. For example, if β = √ n and nnz = cn, the average block has dimension √ n and only c nonzeros. The space to store a block ... |

9 |
Converting to and from dilated integers
- Raman, Wise
(Show Context)
Citation Context ... our experiments, the overall best performance was achieved when β satisfies the equation ⌈lg √ n⌉ ≤ lgβ ≤ 3+⌈lg √ n⌉. Merely setting β to a hard-coded value, however, is not robust 6 Recent research =-=[31]-=- addresses these conversions.MFlops/sec 500 400 300 200 100 0 32 p=8 p=4 p=2 64 128 256 512 1024 Block size 2048 4096 8192 16384 32768 Figure 7: The effect of block size parameter β on SpMV performan... |

8 |
Techniques for exploiting the Sparsity of Network Admittance Matrix
- Sato, Tinney
- 1963
(Show Context)
Citation Context ...ten description of CSR that we have been able to divine from the literature is an unnamed “scheme” presented in Table 1 of the 1967 article [36] by Tinney and Walker, although in 1963 Sato and Tinney =-=[33]-=- allude to what is probably CSR. Markowitz’s seminal paper [28] on sparse Gaussian elimination does not discuss data structures, but it is likely that Markowitz used such a format as well. CSR and CSC... |

8 | Sparse matrices in Matlab*P: Design and implementation
- Shah, Gilbert
- 2004
(Show Context)
Citation Context ...rm Ax and ATx, respectively. CSR_SpMV (Serial) and CSR_SpMV_T (Serial) use OSKI [39] and compressed sparse rows without any matrix-specific optimizations. Star-P (y=Ax) and Star-P (y’=x’A) use Star-P =-=[34]-=-, a parallel code based on CSR. The experiments were run on a ccNUMA architecture powered by AMD Opteron 8214 (Santa Rosa) processors. algorithms are efficient in these measures for matrices with arbi... |

7 | Seven at one stroke: Results from a cache-oblivious paradigm for scalable matrix algorithms
- Adams, Wise
- 2006
(Show Context)
Citation Context ...ce. As shown in Section 5, CSB supports ample parallelism for algorithms computing Ax and ATx, even on sparse and irregular matrices. Blocking is also used in dense matrices. The Morton-hybrid layout =-=[1,27]-=-, for example, uses a parameter equivalent to our parameter β for selecting the block size. Whereas in CSB we store elements in a Morton ordering within blocks and an arbitrary ordering among blocks, ... |

6 |
Analyzing block locality in Mortonorder and Morton-hybrid matrices
- Lorton, Wise
(Show Context)
Citation Context ...ce. As shown in Section 5, CSB supports ample parallelism for algorithms computing Ax and ATx, even on sparse and irregular matrices. Blocking is also used in dense matrices. The Morton-hybrid layout =-=[1,27]-=-, for example, uses a parameter equivalent to our parameter β for selecting the block size. Whereas in CSB we store elements in a Morton ordering within blocks and an arbitrary ordering among blocks, ... |

4 |
Sparse matrix storage formats
- Dongarra
- 2000
(Show Context)
Citation Context ...) format stores the nonzeros (and ideally only the nonzeros) of each matrix row in consecutive memory locations, and it stores an index to the first stored element of each row. In one popular variant =-=[14]-=-, CSR maintains one floating-point array val[nnz] and two integer arrays, col_ind[nnz] and row_ptr[n] to store the matrix A = (ai j). The row_ptr array stores the index of each row in val. That is, if... |

4 | Optimizing sparse matrix-vector multiplication using index and value compression
- Kourtis, Goumas, et al.
- 2008
(Show Context)
Citation Context ...ral instead of matrix specific, we did not employ speculative low-level optimizations such as software prefetching, pipelining, or matrix-specific optimizations such as index and/or value compression =-=[25, 40]-=-, but we believe that CSB and our algorithms should not adversely affect incorporation of these approaches. Choosing the block size β We investigated different strategies to choose the block size that... |

3 |
Computer Solution of Sparse Positive Definite Systems
- George, Liu
- 1981
(Show Context)
Citation Context |

1 |
Matrix-vector and matrix-matrix multiplication
- Dongarra, Koev, et al.
- 2000
(Show Context)
Citation Context ...lg(nr) span without affecting the asymptotic work, thereby achieving parallelism Θ(nnz/lgn) in all cases. Computing ATx serially can be accomplished by simply interchanging the row and column indices =-=[15]-=-, yielding the pseudocode shown in Figure 3. The work of procedure CSR_SPMV_T is Θ(nnz), the same as CSR_SPMV. Parallelizing CSR_SPMV_T is not straightforward, however. We shall review several strateg... |

1 |
Available fromhttp://www.cilk.com
- Arts, Inc, et al.
- 2009
(Show Context)
Citation Context ...chmark matrices we used to test the algorithms, the machines on which we ran our tests, and the other codes with which we compared our algorithms. Implementation We parallelized our code using Cilk++ =-=[9]-=-, which is a faithful extension of C++ for multicore and shared-memory parallel programming. Cilk++ is based on the earlier MIT Cilk system [20], and it employs dynamic load balancing and provably opt... |