#### DMCA

## When cache blocking sparse matrix vector multiply works and why (2004)

### Cached

### Download Links

Venue: | In Proceedings of the PARA’04 Workshop on the State-of-the-art in Scientific Computing |

Citations: | 28 - 5 self |

### Citations

478 | Automatically Tuned Linear Algebra Software,” - Whaley, Dongarra - 1997 |

314 | Sparsekit: a basic tool kit for sparse matrix computations - Saad - 1990 |

138 | A scalable cross-platform infrastructure for application performance tuning using hardware counters.
- BROWNE, DONGARRA, et al.
- 2000
(Show Context)
Citation Context ...cking to have a significant effect. The properties of the 14 matrices that were chosen are referenced in Table 1. We evaluate the performance model in which we use true hardware counters through PAPI =-=[2]-=- to predict the performance (henceforth called the PAPI model) and compare it to the model in which we use estimates of lower and upper bound of cache and TLB misses (henceforth termed the analytic lo... |

76 | Automatic performance tuning of sparse matrix kernels.
- Vuduc
- 2003
(Show Context)
Citation Context ...ce memory traffic. This paper focuses on performance models for cache blocking (Sect. 2) and asks the fundamental questions of what limits exist on such performance tuning, extending our prior models =-=[15]-=- by accounting for the TLB (translation look aside buffer, i.e., a buffer of most recently used virtual-to-physical address translations), enabling accurate selection of optimal cache block sizes. Cac... |

65 | Optimizing the Performance of Sparse Matrix-Vector Multiplication
- Im
- 2000
(Show Context)
Citation Context ...timizing dense matrix kernels (dense BLAS) [16,1], performance depends on the nonzero structure of the matrix which may not be known until run-time. In prior work on the Sparsity system (version 1.0) =-=[7]-=-, Im developed an algorithm generator and search strategy for SpM×V that was quite effective in practice. The Sparsity generators employed a variety of performance optimization techniques, including r... |

57 | Performance optimizations and bounds for sparse matrix-vector multiply,
- Vudoc, Demmel, et al.
- 2002
(Show Context)
Citation Context ...e found in practice that this initial estimate of TLB misses is sufficiently accurate to predict good block sizes. 3.1 Overall performance model The overall performance model is similar to the one in =-=[14]-=- except that we have added one more latency term to account for the TLB misses. We model execution time as follows. First, since we want an upper bound on performance (lower bound on time), we assume ... |

50 | Modeling application performance by convolving machine signatures with application profiles. In:
- Snavely, Carrington, et al.
- 2001
(Show Context)
Citation Context ...he cache and memory latencies were derived [15] from published processor manuals, curve fitting, and experimental work using the Saavedra-Barrera memory system microbenchmark [10] and MAPS benchmarks =-=[11]-=-. Due to space limitations we present a summary of the full data [8]. Figure 3 shows an evaluation of the models in Sect. 3.TheBase Performance line is the performance without cache blocking while Bes... |

47 | Characterizing the behavior of sparse algorithms on caches
- Temam, Jalby
- 1992
(Show Context)
Citation Context ...o sparse kernels due to the presence of indirect and irregular memory accesses known only at run-time. Nevertheless, there have been a number of notable attempts to model performance. Temam and Jalby =-=[12]-=-, Heras et al. [6], and Fraguela et al. [3] have developed sophisticatedsWhen cache blocking of sparse matrix vector multiply works and why 299 probabilistic cache miss models, but assume uniform dist... |

46 |
CPU Performance Evaluation and Execution Time Prediction Using Narrow Spectrum Benchmarking
- Saavedra-Barrera
- 1992
(Show Context)
Citation Context ...nd upper bound models). The cache and memory latencies were derived [15] from published processor manuals, curve fitting, and experimental work using the Saavedra-Barrera memory system microbenchmark =-=[10]-=- and MAPS benchmarks [11]. Due to space limitations we present a summary of the full data [8]. Figure 3 shows an evaluation of the models in Sect. 3.TheBase Performance line is the performance without... |

9 |
PHiPAC: A portable, high-performance, ANSI C coding methodology and its application to matrix multiply. LAPACK working note 111
- Bilmes, Asanović, et al.
- 1996
(Show Context)
Citation Context ...tures. It is not unusual to see SpM×V run at under 10% of the peak floating point performance of a single processor [15, Fig. 1]. Moreover, in contrast to optimizing dense matrix kernels (dense BLAS) =-=[16,1]-=-, performance depends on the nonzero structure of the matrix which may not be known until run-time. In prior work on the Sparsity system (version 1.0) [7], Im developed an algorithm generator and sear... |

9 |
Memory hierarchy performance prediction for sparse blocked algorithms
- Fraguela, Doallo, et al.
- 1999
(Show Context)
Citation Context ...rect and irregular memory accesses known only at run-time. Nevertheless, there have been a number of notable attempts to model performance. Temam and Jalby [12], Heras et al. [6], and Fraguela et al. =-=[3]-=- have developed sophisticatedsWhen cache blocking of sparse matrix vector multiply works and why 299 probabilistic cache miss models, but assume uniform distribution of non-zero entries. These models ... |

7 |
Towards realistic bounds for implicit CFD codes
- Gropp, Kasushik, et al.
- 1999
(Show Context)
Citation Context ...ing the TLB, and (2) we explicitly model the execution time in addition to cache misses. Gropp et al. use bounds similar to the ones we develop to analyze and tune a computational fluid dynamics code =-=[4]-=-; Heber et al. present a detailed performance study of a fracture mechanics code on Itanium [5]. This paper considers tuning for matrices that come from a variety of other domains, and explores perfor... |

6 |
Modeling and improving locality for irregular problems: sparse matrix-vector product on cache memories as a case study
- Heras, Perez, et al.
- 1999
(Show Context)
Citation Context ...e to the presence of indirect and irregular memory accesses known only at run-time. Nevertheless, there have been a number of notable attempts to model performance. Temam and Jalby [12], Heras et al. =-=[6]-=-, and Fraguela et al. [3] have developed sophisticatedsWhen cache blocking of sparse matrix vector multiply works and why 299 probabilistic cache miss models, but assume uniform distribution of non-ze... |

6 | Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply
- Nishtala, Vuduc, et al.
(Show Context)
Citation Context ...r domains, and explores performance modeling for cache block size selection. Due to space limitations we only present the high level intuition and summary data. We refer the reader to the full report =-=[8]-=- for details. The software and algorithms described in this paper are available in OSKI (the Optimized Sparse Kernel Interface) [13]. OSKI is a collection of low-level C primitives that provide automa... |

5 |
Fracture mechanics on the Intel Itanium architecture: A case study
- Heber, Dolgert, et al.
- 2001
(Show Context)
Citation Context ...et al. use bounds similar to the ones we develop to analyze and tune a computational fluid dynamics code [4]; Heber et al. present a detailed performance study of a fracture mechanics code on Itanium =-=[5]-=-. This paper considers tuning for matrices that come from a variety of other domains, and explores performance modeling for cache block size selection. Due to space limitations we only present the hig... |

1 |
OSKI: Optimized Sparse Kernel Interface (2005) http://bebop.cs.berkeley. edu/oski
- Vuduc
(Show Context)
Citation Context ...l intuition and summary data. We refer the reader to the full report [8] for details. The software and algorithms described in this paper are available in OSKI (the Optimized Sparse Kernel Interface) =-=[13]-=-. OSKI is a collection of low-level C primitives that provide automatically tuned computational kernels on sparse matrices, for use in solver libraries and applications. 2 Summary of the cache blockin... |

1 |
OSKI: Optimized Sparse Kernel Interface
- Vuduc
- 2005
(Show Context)
Citation Context ...on and summary data. We refer the reader to the full report [8] for details. The software and algorithms described in this paper are available in OSKI (the Optimized Sparse Kernel Interface) by Vuduc =-=[13]-=-. OSKI is a collection of low-level C primitives that provide automatically tuned computational kernels on sparse matrices, for use in solver libraries and applications. 2 Summary of the Cache Blockin... |