## Optimization of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms (2007)

### Cached

### Download Links

Venue: | In Proc. SC2007: High performance computing, networking, and storage conference |

Citations: | 104 - 23 self |

### BibTeX

@INPROCEEDINGS{Williams07optimizationof,

author = {Samuel Williams and Leonid Oliker and Richard Vuduc and John Shalf and Katherine Yelick and James Demmel},

title = {Optimization of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms},

booktitle = {In Proc. SC2007: High performance computing, networking, and storage conference},

year = {2007},

pages = {10--16}

}

### OpenURL

### Abstract

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) – one of the most heavily used kernels in scientific computing – across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms. 1.

### Citations

4273 |
Computer Architecture. A Quantitive Approach
- Hennessy, Patterson
- 1996
(Show Context)
Citation Context ...ory-bound numerical algorithms. 1 Introduction Industry has moved to chip multiprocessor (CMP) system design in order to better manage trade-offs among performance, energy efficiency, and reliability =-=[5, 10]-=-. However, the diversity of CMP solutions raises difficult questions about how different designs compare, for which applications each design is best-suited, and how to implement software to best utili... |

314 |
Design challenges of technology scaling
- Borkar
- 1999
(Show Context)
Citation Context ...ory-bound numerical algorithms. 1 Introduction Industry has moved to chip multiprocessor (CMP) system design in order to better manage trade-offs among performance, energy efficiency, and reliability =-=[5, 10]-=-. However, the diversity of CMP solutions raises difficult questions about how different designs compare, for which applications each design is best-suited, and how to implement software to best utili... |

172 |
Efficient management of parallelism in object oriented numerical software libraries
- Balay, Gropp, et al.
- 1997
(Show Context)
Citation Context ...ctors, variable block and diagonal structures, and locality-enhancing reordering. OSKI is a serial library, but is being integrated into higher-level parallel linear solver libraries, including PETSc =-=[3]-=- and Trilinos [29]. This paper compares our multicore implementations against a generic “off-the-shelf” approach in which we take the MPI-based distributed memory SpMV in PETSc 2.3.0 and replace the s... |

106 | Oski: A library of automatically tuned sparse matrix kernels
- Vuduc, Demmel, et al.
(Show Context)
Citation Context ...ategies — explicitly programmed and tuned for these multicore environments — attain significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations =-=[24]-=-. This work is a substantial expansion of our previous work [30]. We include new optimizations, new architectures (Barcelona, Victoria Falls, and the Cell PPE) as well as the first 128 thread trials o... |

105 |
A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations
- Rose
- 1972
(Show Context)
Citation Context ...l and temporal locality by rectangular cache blocking [11], diagonal cache blocking [20], and reordering the rows and columns of the matrix. Besides classical bandwidth-reducing reordering techniques =-=[18]-=-, recent work has proposed sophisticated 2-D partitioning schemes with theoretical guarantees on communication volume [22], and traveling salesman-based reordering to create dense block substructure [... |

74 | Improving the memory-system performance of sparse-matrix vector multiplication
- Toledo
- 1997
(Show Context)
Citation Context ...tuning of the kind proposed in this paper, even applied to just a CSR SpMV, are also possible. Recent work on low-level tuning of SpMV by unroll-and-jam [15], software pipelining [6], and prefetching =-=[21]-=- influence our work. 43 Experimental Testbed Our work examines several leading CMP system designs in the context of the demanding SpMV algorithm. In this section, we briefly describe the evaluated sy... |

71 | A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication
- Vastenhouw, Bisseling
- 2005
(Show Context)
Citation Context ...mns of the matrix. Besides classical bandwidth-reducing reordering techniques [18], recent work has proposed sophisticated 2-D partitioning schemes with theoretical guarantees on communication volume =-=[22]-=-, and traveling salesman-based reordering to create dense block substructure [17]. Higher-level kernels and solvers provide opportunities to reuse the matrix itself, in contrast to non-symmetric SpMV.... |

65 | Synergistic processing in cell’s multicore architecture
- Gschwind, Hofstee, et al.
(Show Context)
Citation Context ... core (Power Processing Element / PPE) to handle OS and control functions, combined with up to eight simpler SIMD cores (Synergistic Processing Elements / SPEs) for the computationally intensive work =-=[8,9]-=-. The SPEs differ considerably from conventional core architectures due to their use of a disjoint software controlled local memory instead of the conventional hardware-managed cache hierarchy employe... |

65 | Sparsity: Optimization framework for sparse matrix kernels
- Im, Yelick, et al.
(Show Context)
Citation Context ...to the complex behavior of performance on modern machines. OSKI hides the complexity of making this choice, using techniques extensively documented in the SPARSITY sparse-kernel auto-tuning framework =-=[11]-=-. These techniques include register- and cache-level blocking, exploiting symmetry, multiple vectors, variable block and diagonal structures, and locality-enhancing reordering. OSKI is a serial librar... |

60 | Automatic performance tuning of sparse matrix kernels
- Vuduc
- 2003
(Show Context)
Citation Context ... index overhead. These patterns include blocks [11], variable blocking [6] (mixtures of differently-sized blocks), diagonals, which may be especially wellsuited to machines with SIMD and vector units =-=[23, 27]-=-, dense subtriangles arising in sparse triangular solve [26], symmetry [14], general pattern compression [28], value compression [13], and combinations. Others have considered improving spatial and te... |

47 | Improving the Performance of Sparse Matrix-Vector Multiplication
- Pinar, Heath
- 1999
(Show Context)
Citation Context ...], recent work has proposed sophisticated 2-D partitioning schemes with theoretical guarantees on communication volume [22], and traveling salesman-based reordering to create dense block substructure =-=[17]-=-. Higher-level kernels and solvers provide opportunities to reuse the matrix itself, in contrast to non-symmetric SpMV. Such kernels include block kernels and solvers that multiply the matrix by multi... |

44 | Characterizing the behavior of sparse algorithms on caches - Temam, Jalby - 1992 |

38 |
et al, “The landscape of parallel computing research: A view from
- Asanovic
- 2006
(Show Context)
Citation Context ...ing memory bandwidth requirements should be more effective than improving single thread performance, since it is easier and cheaper to double the number of cores rather than double the DRAM bandwidth =-=[1]-=-. In a standard coordinate approach, 16 bytes of storage are required for each matrix nonzero: 8 bytes for the double-precision nonzero, plus 4 bytes each for row and column coordinates. Our data stru... |

28 | Segmented operations for sparse matrix computation on vector multiprocessors
- Blelloch, Heroux, et al.
- 1993
(Show Context)
Citation Context ...ut is of little value on the out-of-order superscalars. Finally, the code can be further optimized using a branchless implementation, which is in effect a segmented scan of vector-length equal to one =-=[4]-=-. On Cell, we implement this technique using the select bits instruction. A branchless BCOO implementation simply requires resetting the running sum (selecting between the last sum or next value of Y ... |

25 | On improving the performance of sparse matrix-vector multiplication
- White, Sadayappan
- 1997
(Show Context)
Citation Context ... index overhead. These patterns include blocks [11], variable blocking [6] (mixtures of differently-sized blocks), diagonals, which may be especially wellsuited to machines with SIMD and vector units =-=[23, 27]-=-, dense subtriangles arising in sparse triangular solve [26], symmetry [14], general pattern compression [28], value compression [13], and combinations. Others have considered improving spatial and te... |

19 | Performance models for evaluation and automatic tuning of symmetric sparse matrixvector multiply
- Lee, Vuduc, et al.
- 2004
(Show Context)
Citation Context ...tures of differently-sized blocks), diagonals, which may be especially wellsuited to machines with SIMD and vector units [23, 27], dense subtriangles arising in sparse triangular solve [26], symmetry =-=[14]-=-, general pattern compression [28], value compression [13], and combinations. Others have considered improving spatial and temporal locality by rectangular cache blocking [11], diagonal cache blocking... |

19 | When cache blocking sparse matrix vector multiply works and why
- Nishtala, Vuduc, et al.
(Show Context)
Citation Context ...source and destination vectors in cache, potentially causing numerous capacity misses. Prior work shows that explicitly cache blocking the nonzeros into tiles (≈ 1K × 1K) can improve SpMV performance =-=[11, 16]-=-. We extend this idea by accounting for cache utilization, rather than only spanning a fixed number of columns. Specifically, we first quantify the number of cache lines available for blocking, 12and... |

19 | Sparse tiling for stationary iterative methods
- MM, Carter, et al.
(Show Context)
Citation Context ...es to reuse the matrix itself, in contrast to non-symmetric SpMV. Such kernels include block kernels and solvers that multiply the matrix by multiple dense vectors [11], AT Ax [25], and matrix powers =-=[19, 23]-=-. Better low-level tuning of the kind proposed in this paper, even applied to just a CSR SpMV, are also possible. Recent work on low-level tuning of SpMV by unroll-and-jam [15], software pipelining [6... |

18 | Towards a fast parallel sparse matrix-vector multiplication
- Geus, Röllin
(Show Context)
Citation Context ...xtensive. A number evaluate techniques that compress the data structure by recognizing patterns in order to eliminate the integer index overhead. These patterns include blocks [11], variable blocking =-=[6]-=- (mixtures of differently-sized blocks), diagonals, which may be especially wellsuited to machines with SIMD and vector units [23, 27], dense subtriangles arising in sparse triangular solve [26], symm... |

18 |
Accelerating sparse matrix computations via data compression
- WILLCOCK, LUMSDAINE
(Show Context)
Citation Context ..., diagonals, which may be especially wellsuited to machines with SIMD and vector units [23, 27], dense subtriangles arising in sparse triangular solve [26], symmetry [14], general pattern compression =-=[28]-=-, value compression [13], and combinations. Others have considered improving spatial and temporal locality by rectangular cache blocking [11], diagonal cache blocking [20], and reordering the rows and... |

16 | Scientific computing kernels on the Cell processor
- Williams
(Show Context)
Citation Context ...ock (not just each thread) separately. As Cell is a SIMD architecture, it is expensive to implement 1x1 blocking, especially in double precision. In addition, since Cell streams nonzeros into buffers =-=[31]-=-, it is far easier to implement BCOO than BCSR. Thus to maximize our productivity, for each architecture we specify the minimum and maximum block size, as well as a mask 14that enables each format. O... |

12 | multiprocessing and the cell broadband engine - Chip - 2006 |

10 | Automatic performance tuning and analysis of sparse triangular solve
- Vuduc, Kamil, et al.
- 2002
(Show Context)
Citation Context ...ocking [6] (mixtures of differently-sized blocks), diagonals, which may be especially wellsuited to machines with SIMD and vector units [23, 27], dense subtriangles arising in sparse triangular solve =-=[26]-=-, symmetry [14], general pattern compression [28], value compression [13], and combinations. Others have considered improving spatial and temporal locality by rectangular cache blocking [11], diagonal... |

9 |
pOSKI: An extensible autotuning framework to perform optimized SpMVs on multicore architectures
- Jain
- 2008
(Show Context)
Citation Context ... that matrix and platform dependent tuning of SpMV for multicore is at least as impor27tant as suggested in prior work [24]. Future work will include the integration of these optimizations into OSKI =-=[12]-=-, as well as continued exploration of optimizations for SpMV and other important numerical kernels on the latest multicore systems. 7 Acknowledgments We would like to express our gratitude to both For... |

8 | Little’s Law and High Performance Computing
- Bailey
- 1997
(Show Context)
Citation Context ...le with standard prefetch schemes on conventional cache hierarchies, but also makes the programming model more complex. In particular, the hardware provides enough concurrency to satisfy Little’s Law =-=[2]-=-. In addition, this approach eliminates both conflict misses and write fills, but the capacity miss concept must be handled in software. 8Each SPE is a dual issue SIMD architecture: one slot can issu... |

6 |
Optimizing sparse matrix vector multiply using unroll-and-jam
- Mellor-Crummey, Garvin
(Show Context)
Citation Context ..., and matrix powers [19, 23]. Better low-level tuning of the kind proposed in this paper, even applied to just a CSR SpMV, are also possible. Recent work on low-level tuning of SpMV by unroll-and-jam =-=[15]-=-, software pipelining [6], and prefetching [21] influence our work. 43 Experimental Testbed Our work examines several leading CMP system designs in the context of the demanding SpMV algorithm. In thi... |

5 | Memory hierarchy optimizations and bounds for sparse A T Ax - Vuduc, Gyulassy, et al. - 2003 |

3 |
Nectarios Koziris. Optimizing Sparse Matrix-Vector Multiplication Using Index and Value Compression
- Kourtis, Goumas
- 2008
(Show Context)
Citation Context ...e especially wellsuited to machines with SIMD and vector units [23, 27], dense subtriangles arising in sparse triangular solve [26], symmetry [14], general pattern compression [28], value compression =-=[13]-=-, and combinations. Others have considered improving spatial and temporal locality by rectangular cache blocking [11], diagonal cache blocking [20], and reordering the rows and columns of the matrix. ... |

2 |
Improving sparse matrix-vector product kernel performance and availabillity
- Willenbring, Anda, et al.
- 2006
(Show Context)
Citation Context ...lock and diagonal structures, and locality-enhancing reordering. OSKI is a serial library, but is being integrated into higher-level parallel linear solver libraries, including PETSc [3] and Trilinos =-=[29]-=-. This paper compares our multicore implementations against a generic “off-the-shelf” approach in which we take the MPI-based distributed memory SpMV in PETSc 2.3.0 and replace the serial SpMV compone... |

1 |
Report on sparsematrix performance analysis. Intel report
- Gou, Liao, et al.
- 2006
(Show Context)
Citation Context ...ss obvious why the extremely powerful Clovertown core can only utilize 2.4 GB/s (22%) of its memory bandwidth, when the FSB can theoretically deliver 10.6 GB/s. Intel’s analysis of SpMV on Clovertown =-=[7]-=- suggested the coherency traffic for the snoopy FSB protocol is comparable in volume to read traffic. As a result, the effective FSB performance is cut in half. The other architectures examined here p... |