## Optimizing the performance of sparse matrix-vector multiplication (2000)

Citations: | 56 - 2 self |

### BibTeX

@TECHREPORT{Im00optimizingthe,

author = {Eun-jin Im},

title = {Optimizing the performance of sparse matrix-vector multiplication},

institution = {},

year = {2000}

}

### Years of Citing Articles

### OpenURL

### Abstract

Copyright 2000 by Eun-Jin Im

### Citations

3992 |
Computer Architecture: A Quantitative Approach
- Hennessy, Patterson
- 2007
(Show Context)
Citation Context ... cache is not under complete control of the software; hardware controls the selection of data values in each level of cache according to its policies on replacement, associativity, and write strategy =-=[34]-=-. More importantly, the caches can hold thousands of values, while registers only hold tens of values, so it is not practical to fill in a rectangular block of a sparse matrix so that the correspondin... |

960 |
Advanced Compiler Design and Implementation
- MUCHNICK
- 1997
(Show Context)
Citation Context ... the data representation as well. Tiling is often combined with other loop transformations such as unrolling, loop interchange, and loop skew. These are described in an advanced compiler text such as =-=[51]-=-, although we refer to the reader tos94 several other papers on more details and recent work on tiling (and a closely related notion called shackling) [18, 50, 16, 12, 44, 71, 41, 42]. This work on co... |

741 |
A set of Level 3 Basic Linear Algebra Subprograms
- Dongarra, Croz, et al.
- 1990
(Show Context)
Citation Context ...al models rather than the analytical ones used in this related work. 7.1.2 Hand-Optimized Libraries A set of basic matrix-vector operations for dense and banded matrices has been standardized in BLAS =-=[45, 24, 23]-=-, and its reference implementation is available from Netlib [52]. These basic routines are hand-coded by programmers that work for the machine vendors, who are well-trained in both the details of the ... |

535 |
Basic linear algebra subprograms for fortran usage
- Lawson, Hanson, et al.
- 1979
(Show Context)
Citation Context ...al models rather than the analytical ones used in this related work. 7.1.2 Hand-Optimized Libraries A set of basic matrix-vector operations for dense and banded matrices has been standardized in BLAS =-=[45, 24, 23]-=-, and its reference implementation is available from Netlib [52]. These basic routines are hand-coded by programmers that work for the machine vendors, who are well-trained in both the details of the ... |

510 | M.: The cache performance and optimizations of blocked algorithms
- Lam, Rothberg, et al.
- 1991
(Show Context)
Citation Context ... by computing on a small block of the array befores8 moving onto the next. This optimization is commonly known as blocking or tiling and can be performed either by an optimizing compiler or a library =-=[44, 71, 26, 50, 19, 14, 15, 17, 22, 1]-=-. Even for dense matrix-vector multiplication, the potential for reuse is relatively low, because there are only two floating point operations for every element of the matrix. Figure 2.1 shows the bas... |

451 | FFTW: An adaptive software architecture for the FFT
- Frigo, Johnson
- 1998
(Show Context)
Citation Context ...Automatically Tuned Linear Algebra Software) [70] at ORNL. Both of these generate optimized BLAS 3 routines for a given dense matrix size and machine. FFTW (the Fastest Fourier Transform in the West) =-=[28, 27]-=- from MIT generates optimized FFT routines. PHiPAC The PHiPAC project uses search over a set of possible optimizations to produce a matrix-matrix multiplication routine that is highly tuned to a parti... |

445 | An extended set of fortran basic linear algebra subroutines
- Dongarra, Croz, et al.
- 1988
(Show Context)
Citation Context ...o of floating point operations to memory operations is low. This is true for dense as well as sparse matrices. Using terminologys2 from the well-known Basic Linear Algebra Subprograms (BLAS) standard =-=[22]-=-, matrixvector operations make up the BLAS-2 class, while matrix-matrix operations make up the BLAS-3 class. Dense BLAS-2 operations perform O(n2) operations on O(n2) data (n \Thetasn square matrices)... |

371 | J.J.: Automatically tuned linear algebra software
- Whaley, Dongarra
- 1998
(Show Context)
Citation Context ...sed for small dense blocks, for example, which appear within the register-blocked sparse code. In Sparsity we currently produce a single machine-independent implementation of this kernel. ATLAS ATLAS =-=[69]-=- stands for Automatic Tuning of Linear Algebra Software. ATLAS's approach differs from PHiPAC as follows: all machine dependent code (and searching) iss97 done for square matrices assumed to reside in... |

352 |
ScaLAPACK Users’ Guide
- Blackford, Choi, et al.
- 1997
(Show Context)
Citation Context ...ques used for high performance. The BLAS approach has successfully been used to produce application-level libraries for dense linear algebra, such as LAPACK [1] (Linear Algebra PACKage) and ScaLAPACK =-=[10]-=- (Scalable Linear Algebra PACKage), which use these BLAS routines as building blocks.s95 There are two problems with trying this approach for sparse matrices. First, even for the dense case the effort... |

300 | The University of Florida sparse matrix collection
- Davis, Hu
(Show Context)
Citation Context ...numbered from 1 to 46 in the first column, and the matrices will be referenced by this line number throughout this thesis. Most of the sparse matrices are collected from Tim Davis's matrix collection =-=[20]-=- at the University of Florida. For comparison, we have also included the set of sparse matrices used in Xiaoye Li's thesis [47] in our set. In her thesis, the matrices are used for hand-coded memory h... |

297 | Improving data locality with loop transformations
- McKinley, Carr, et al.
- 1996
(Show Context)
Citation Context ... by computing on a small block of the array befores8 moving onto the next. This optimization is commonly known as blocking or tiling and can be performed either by an optimizing compiler or a library =-=[44, 71, 26, 50, 19, 14, 15, 17, 22, 1]-=-. Even for dense matrix-vector multiplication, the potential for reuse is relatively low, because there are only two floating point operations for every element of the matrix. Figure 2.1 shows the bas... |

256 | SPARSKIT: A basic tool kit for sparse matrix computations
- SAAD
- 1990
(Show Context)
Citation Context ...mory bandwidth reduction, and found some benefits on symmetric multiprocessors [38], but very little effect on uniprocessors.s100 Sparse Matrix Packages for Multiprocessors Yousef Saad built SPARSKIT =-=[61]-=- and, with Andrei Malevsky, PSPARSLIB [62]. SPARSKIT is a collection of FORTRAN subroutines for sparse matrix computations. It includes format conversion routines among various sparse matrix formats, ... |

202 | K.S.M.: Tile size selection using cache organization and data layout
- Coleman, Kinley
- 1995
(Show Context)
Citation Context ... by computing on a small block of the array befores8 moving onto the next. This optimization is commonly known as blocking or tiling and can be performed either by an optimizing compiler or a library =-=[44, 71, 26, 50, 19, 14, 15, 17, 22, 1]-=-. Even for dense matrix-vector multiplication, the potential for reuse is relatively low, because there are only two floating point operations for every element of the matrix. Figure 2.1 shows the bas... |

156 |
Scientific Computing An Introductory Survey
- Heath
- 2002
(Show Context)
Citation Context ...idx array by a factor of r \Thetasc and turns the loads of indices into simple arithmetic operations, such as integer addition. We have observed that applications such as Finite Element Methods (FEM) =-=[32]-=- naturally store small dense blocks in sparse matrix when they construct the matrix. However, most matrices do not have uniform block structure throughout, so some zero values are filled in when const... |

154 | A fast Fourier transform compiler
- Frigo
- 1999
(Show Context)
Citation Context ...Automatically Tuned Linear Algebra Software) [70] at ORNL. Both of these generate optimized BLAS 3 routines for a given dense matrix size and machine. FFTW (the Fastest Fourier Transform in the West) =-=[28, 27]-=- from MIT generates optimized FFT routines. PHiPAC The PHiPAC project uses search over a set of possible optimizations to produce a matrix-matrix multiplication routine that is highly tuned to a parti... |

147 | Motion segmentation and tracking using normalized cuts
- Shi, Malik
- 1998
(Show Context)
Citation Context ...zos [29, 30, 31, 49, 3] or block Arnoldi [64, 63, 46, 3]. Another application is image segmentation in videos, where a set of vectors is used as the starting guess for a subsequent frame in the video =-=[66]-=-. Multiplying a sparse matrix by a set of vectors has much more potential for memory hierarchy optimizations than the matrix-vector case. Matrix-vector multiplication accesses each matrix element only... |

143 | Data-centric multi-level blocking
- Kodukula, Ahmed, et al.
- 1997
(Show Context)
Citation Context ...scribed in an advanced compiler text such as [51], although we refer to the reader tos94 several other papers on more details and recent work on tiling (and a closely related notion called shackling) =-=[18, 50, 16, 12, 44, 71, 41, 42]-=-. This work on compiler-based memory hierarchy optimizations has produced tools that are very sophisticated in performing analysis of the code to determine legal transformations, and apply certain opt... |

136 |
Improving Locality and Parallelism in Nested Loops
- Wolf
- 1992
(Show Context)
Citation Context |

98 |
PETSc 2.0 users manual
- Balay, Gropp, et al.
- 1999
(Show Context)
Citation Context ...e by locating cliques and identical nodes and by reordering the matrix columns and rows. By identifying large blocks, it is able to use a higher level dense BLAS routine for better performance. PETSc =-=[5]-=-, a toolkit for scientific computations, uses this BlockSolve95 library. Spark98 at CMU [53] provides various versions of sparse matrix-vector multiplication code for shared memory and message passing... |

98 | Improving the ratio of memory operations to floating-point operations in loops
- Carr, Kennedy
- 1994
(Show Context)
Citation Context ...scribed in an advanced compiler text such as [51], although we refer to the reader tos94 several other papers on more details and recent work on tiling (and a closely related notion called shackling) =-=[18, 50, 16, 12, 44, 71, 41, 42]-=-. This work on compiler-based memory hierarchy optimizations has produced tools that are very sophisticated in performing analysis of the code to determine legal transformations, and apply certain opt... |

93 |
Vorst. Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide
- Bai, Demmel, et al.
- 2000
(Show Context)
Citation Context ...est[0]; d1 = dest[1]; for (j=row.start[i]; j!row.start[i+1]; j++,col.idx++,value+=4) - d0 += value[0] * src[*col.idx+0]; d0 += value[1] * src[*col.idx+1]; d1 += value[2] * src[*col.idx+0]; d1 += value=-=[3]-=- * src[*col.idx+1]; "" dest[0] = d0; dest[1] = d1; "" "" Figure 3.3: Loop-unrolled, register-blocked code: Row start, col idx, and value arrays represent the data structure for a register-blocked spar... |

85 | A shifted block Lanczos algorithm for solving sparse symmetric generalized eigenproblems
- Grimes, Lewis, et al.
- 1994
(Show Context)
Citation Context ... is important in several applications. In particular, it occurs in practice when there are multiple right-hand sides in an iterative solver, or in blocked eigenvalue algorithms, such as block Lanczos =-=[29, 30, 31, 49, 3]-=- or block Arnoldi [64, 63, 46, 3]. Another application is image segmentation in videos, where a set of vectors is used as the starting guess for a subsequent frame in the video [66]. Multiplying a spa... |

80 |
The block Lanczos methods for computing eigenvalues
- Golub, Underwood
- 1977
(Show Context)
Citation Context ... is important in several applications. In particular, it occurs in practice when there are multiple right-hand sides in an iterative solver, or in blocked eigenvalue algorithms, such as block Lanczos =-=[29, 30, 31, 49, 3]-=- or block Arnoldi [64, 63, 46, 3]. Another application is image segmentation in videos, where a set of vectors is used as the starting guess for a subsequent frame in the video [66]. Multiplying a spa... |

71 | LAPACK Users’ Guide, Third Edition - Anderson, Bai, et al. - 1999 |

60 |
A storage efficient WY representation for products of householder transformations
- Schreiber, Loan
- 1989
(Show Context)
Citation Context ...code that takes a different approach, is described below.) Even for dense matrices, the approach works only on low level kernels; algorithms such as LU factorization with pivoting or QR decomposition =-=[65, 9]-=- are still open problems. The second problem is that, even for dense matrices, the optimization space is large and complex, making it difficult to find the best implementation. For example, given a se... |

46 | A sparse matrix library in C++ for high performance architectures
- Dongarra, Lumsdaine, et al.
- 1994
(Show Context)
Citation Context ...es a generic numerical algorithm library, including MV++ (C++ matrix/vector classes) [56], SparseLib++ (sparse matrix computation routines on these classes) [60], and IML++ (Iterative Method Library) =-=[25]-=-. The latest work, TNT (Template Numerical Toolkit) [57], is a successor to all of the others and is still under construction. TNT provides an integrated collection of generic matrix/vector classes ba... |

44 |
Comparative analysis of the Cuthill-Mckee and the reverse Cuthill-Mckee ordering algorithms for sparse matrices
- Liu, Sherman
- 1976
(Show Context)
Citation Context ... a random ordering, sometimes better or worse compared to the ordering given in the original matrix. He also observed that nested-bisection ordering generates more blocks, while reverse Cuthill-McKee =-=[48]-=- ordering decreases the number of blocks found. While Toledo's work is similar to our work on register blocking, it is a much more narrow study of 13 matrices on one machine (IBM RS/6000) and does not... |

38 | Using Strassen's algorithm to accelerate the solution of linear systems
- Bailey, Lee, et al.
- 1991
(Show Context)
Citation Context ... possible optimizations to produce a matrix-matrix multiplication routine that is highly tuned to a particular machine. Even excluding algorithmic variants of multiplication such as Strassen's method =-=[4, 36, 67]-=-, this routine has a large design space with many parameters such a tile sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction ... |

38 | Tuning Strassen’s matrix multiplication for Memory E ciency
- Thottlehodi, Chatterjee, et al.
- 1998
(Show Context)
Citation Context ... possible optimizations to produce a matrix-matrix multiplication routine that is highly tuned to a particular machine. Even excluding algorithmic variants of multiplication such as Strassen's method =-=[4, 36, 67]-=-, this routine has a large design space with many parameters such a tile sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction ... |

35 |
Automatic data structure selection and transformation for sparse matrix computations
- Bik, Wijshoff
- 1996
(Show Context)
Citation Context ...enough that modern compiler analyses can analyze the code quite effectively. Thes101 "compiler" is a special-purpose compiler that produces code appropriate for a sparse matrix. A. J. C. Bik's thesis =-=[6, 7]-=- describes a sparse compiler that takes a dense matrix routine as input, along with a sparse matrix, and generates a sparse version of the code along with an automatically selected sparse matrix data ... |

35 | Sparse Gaussian elimination on high performance computers
- Li
- 1996
(Show Context)
Citation Context ... of the sparse matrices are collected from Tim Davis's matrix collection [20] at the University of Florida. For comparison, we have also included the set of sparse matrices used in Xiaoye Li's thesis =-=[47]-=- in our set. In her thesis, the matrices are used for hand-coded memory hierarchy optimizations for LU factorization. The first matrix is a 1000 \Thetas1000 dense matrix, which was also included in he... |

29 | Turnbull T., “Implementation of Strassen’s Algorithm for Matrix Multiplication
- Huss-Lederman, Jacobson, et al.
- 1996
(Show Context)
Citation Context ... possible optimizations to produce a matrix-matrix multiplication routine that is highly tuned to a particular machine. Even excluding algorithmic variants of multiplication such as Strassen's method =-=[4, 36, 67]-=-, this routine has a large design space with many parameters such a tile sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction ... |

29 |
hMETIS: A Hypergraph Partitioning Package
- Karypis, Kumar
- 1998
(Show Context)
Citation Context ...a matrix, each row (column) becomes one hyper-edge, and one hyper-edge connects multiple nodes which are nonzero columns (rows) in the row (column). We used the hMETIS hyper-graph partitioner package =-=[40]-=-, developed at the University of Minnesota, to bisect a hyper-graph. This package implements a multilevel algorithm [33] for graph partitioning. In figure 4.13 we show the measurement of the ratio for... |

28 |
P.E.: BlockSolve95 users manual: Scalable library software for the parallel solution of sparse linear systems
- Jones, Plassmann
- 1995
(Show Context)
Citation Context ...ory. This library allows a user to apply standard distributed memory techniques such as locally numbered submatrices and ghost variables, using the notion of a global distributed matrix. BlockSolve95 =-=[55]-=-, developed at Argonne National Laboratory, is a scalable parallel software library primarily intended for the solution of sparse linear systems that arise from physical models, especially problems in... |

27 |
Compiler Support for Sparse Matrix Computations
- Bik
- 1996
(Show Context)
Citation Context ...enough that modern compiler analyses can analyze the code quite effectively. Thes101 "compiler" is a special-purpose compiler that produces code appropriate for a sparse matrix. A. J. C. Bik's thesis =-=[6, 7]-=- describes a sparse compiler that takes a dense matrix routine as input, along with a sparse matrix, and generates a sparse version of the code along with an automatically selected sparse matrix data ... |

27 | Optimizing sparse matrix vector multiplication on smps
- Im, Yelick
- 1999
(Show Context)
Citation Context ...e the performance on distributed-memory and shared-memory systems. In previous work, we also considered reordering for memory bandwidth reduction, and found some benefits on symmetric multiprocessors =-=[38]-=-, but very little effect on uniprocessors.s100 Sparse Matrix Packages for Multiprocessors Yousef Saad built SPARSKIT [61] and, with Andrei Malevsky, PSPARSLIB [62]. SPARSKIT is a collection of FORTRAN... |

26 | Compiler blockability of dense matrix factorizations
- Carr, Lehoucq
- 1997
(Show Context)
Citation Context |

23 |
A block Arnoldi-Chebyshev method for computing the leading eigenpairs of large sparse unsymmetric matrices
- Sadkane
- 1993
(Show Context)
Citation Context .... In particular, it occurs in practice when there are multiple right-hand sides in an iterative solver, or in blocked eigenvalue algorithms, such as block Lanczos [29, 30, 31, 49, 3] or block Arnoldi =-=[64, 63, 46, 3]-=-. Another application is image segmentation in videos, where a set of vectors is used as the starting guess for a subsequent frame in the video [66]. Multiplying a sparse matrix by a set of vectors ha... |

21 |
I/O complexity: the red blue pebble game
- Hong, Kung
- 1981
(Show Context)
Citation Context ...r to pick for performance, whereas in FFTW a cache-oblivious algorithm is used, based on a recursive data structure that asymptotically minimizes the number of cache misses, according to a theorem in =-=[35]-=-. The above projects all aimed to optimize the regular operations on arrays. Aside from machine-specific parameters, the only problem-specific parameter is the size of the matrix or vector. However, i... |

16 |
Block Arnoldi and Davidson methods for unsymmetric large eigenvalue problems
- SADKANE
(Show Context)
Citation Context .... In particular, it occurs in practice when there are multiple right-hand sides in an iterative solver, or in blocked eigenvalue algorithms, such as block Lanczos [29, 30, 31, 49, 3] or block Arnoldi =-=[64, 63, 46, 3]-=-. Another application is image segmentation in videos, where a set of vectors is used as the starting guess for a subsequent frame in the video [66]. Multiplying a sparse matrix by a set of vectors ha... |

15 |
Portable High Performance GEMM-based Level 3
- Kagstrom, Ling, et al.
- 1993
(Show Context)
Citation Context ... practice. Since its original release, ATLAS has been extended to support nearly all the BLAS, not just matrix-multiplication. This is done for other Level 3 BLAS using GEMMbased BLAS as described in =-=[39]-=-. In addition, some LAPACK routines (LU, Cholesky and QR decomposition) are now directly supported by the search procedure. In other words, the search procedure attempts to maximize not the speed of m... |

15 | An experimental evaluation of tiling and shackling for memory hierarchy management
- Kodukula, Pingali, et al.
- 1999
(Show Context)
Citation Context ...scribed in an advanced compiler text such as [51], although we refer to the reader tos94 several other papers on more details and recent work on tiling (and a closely related notion called shackling) =-=[18, 50, 16, 12, 44, 71, 41, 42]-=-. This work on compiler-based memory hierarchy optimizations has produced tools that are very sophisticated in performing analysis of the code to determine legal transformations, and apply certain opt... |

13 | Static tiling for heterogeneous computing platforms
- Boulet, Dongarra, et al.
- 1999
(Show Context)
Citation Context |

9 |
Adaptive Blocking in the QR Factorization
- Bischof
- 1989
(Show Context)
Citation Context ...code that takes a different approach, is described below.) Even for dense matrices, the approach works only on low level kernels; algorithms such as LU factorization with pivoting or QR decomposition =-=[65, 9]-=- are still open problems. The second problem is that, even for dense matrices, the optimization space is large and complex, making it difficult to find the best implementation. For example, given a se... |

9 |
Aztec user’s guide: Version 1.1
- Hutchinson, Shadid, et al.
- 1995
(Show Context)
Citation Context .... While communication optimizations are a form of memory hierarchy optimization for parallel machines, the issues are quite different because the messages are entirely under programmer control. Aztec =-=[37]-=- is a parallel iterative library for solving linear systems, developed at Sandia National Laboratory. This library allows a user to apply standard distributed memory techniques such as locally numbere... |

5 |
Documentation for the Basic Linear Algebra Subprograms
- Forum
- 1999
(Show Context)
Citation Context ...ero element in the matrix. The formats of sparse matrices are diverse, and they are discussed in more detail in chapter 7. Among the suggested nine formats in the BLAS Technical Forum standardization =-=[11]-=- we focus on compressed sparse row (CSR) format in our study because it is general and relatively efficient. As shown in the example in figure 2.2, in CSR format, the nonzero elements are stored in ro... |

5 |
Implementation of an implicitly restarted block Arnoldi method
- Lehoucq, Maschhoff
- 1997
(Show Context)
Citation Context .... In particular, it occurs in practice when there are multiple right-hand sides in an iterative solver, or in blocked eigenvalue algorithms, such as block Lanczos [29, 30, 31, 49, 3] or block Arnoldi =-=[64, 63, 46, 3]-=-. Another application is image segmentation in videos, where a set of vectors is used as the starting guess for a subsequent frame in the video [66]. Multiplying a sparse matrix by a set of vectors ha... |

3 |
BLZPACK: Decsription and User’s guide
- Marques
- 1995
(Show Context)
Citation Context ... is important in several applications. In particular, it occurs in practice when there are multiple right-hand sides in an iterative solver, or in blocked eigenvalue algorithms, such as block Lanczos =-=[29, 30, 31, 49, 3]-=- or block Arnoldi [64, 63, 46, 3]. Another application is image segmentation in videos, where a set of vectors is used as the starting guess for a subsequent frame in the video [66]. Multiplying a spa... |

3 | Ordering unstructured meshes for sparse matrix computations on leading parallel systems
- Oliker, Li, et al.
- 2000
(Show Context)
Citation Context ... work is similar to our work on register blocking, it is a much more narrow study of 13 matrices on one machine (IBM RS/6000) and does not attempt to automate the optimization process. Oliker et. al. =-=[54]-=- also performed research on the efficiency of sparse matrix-vector multiplication using various ordering/partitioning algorithms on parallel systems. They compared the following three algorithms again... |

2 |
Improving data locality for caches
- Essenghir
- 1993
(Show Context)
Citation Context |