#### DMCA

## A BENCHMARK FOR REGISTER-BLOCKED SPARSE MATRIX-VECTOR MULTIPLY

### Cached

### Download Links

### Citations

4934 | Computer Architecture: A Quantitative Approach - Hennessy, Patterson - 2007 |

2282 | Iterative Methods for Sparse Linear Systems - Saad - 1996 |

931 | Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers.
- Jouppi
- 1990
(Show Context)
Citation Context ... are running at the same time. Setting the source vector as large as Lk−1 should cause enough capacity conflicts in this level to force use of the Lk cache. However, if the system uses a victim cache =-=[16]-=-, or some other technique for avoiding conflict misses at the L(k − 1) level, this may not suffice. We thus chose n as follows: n = 2 Lk−1 . (2) D 3.5. Number of nonzeros. The amount of memory consume... |

870 |
A set of Level 3 Basic Linear Algebra Subprograms.
- Dongarra, Croz, et al.
- 1990
(Show Context)
Citation Context ...s (BLAS) describe matrix-vector and matrix-matrix operations as Level 2 and Level 3, respectively, to reflect the increased degree of potential optimization latent in each type of operation (see e.g. =-=[11]-=-). In practice, Level 3 operations tend to run at a high fraction of the peak floating-point rate (around 80%), whereas dense matrix-vector multiply (a Level 2 operation) ranges from around 9% of peak... |

602 | FFTW: An adaptive software architecture for the FFT
- Frigo, Johnson
- 1998
(Show Context)
Citation Context ...vely in a general application. Nevertheless, the ATLAS project for automatic optimization of the BLAS (and certain LAPACK routines) takes advantage of both SSE and SSE2 [26], as does FFTW version 3.0 =-=[12]-=-. There are also librariesGahvari and Hoemmen 9 available for small matrix-vector manipulations using SSE and SSE2 (e.g. [8, 18]). Intel’s Itanium 2 processor, released in 2003, only supports the ori... |

254 | A portable programming interface for performance evaluation on modern processors,
- Browne, Dongarra, et al.
- 2000
(Show Context)
Citation Context ... processors. Our model should be supported by hardware counts of cache and TLB misses and page faults. We plan to use platformindependent performance counter interfaces such as PAPI whenever possible =-=[3]-=-. We need access to data on TLB misses and page faults because these have the most potential to disturb the expected load and store latencies that allow us to compare results. 2.4. Memory hierarchy ch... |

187 | The microarchitecture of the pentium 4 processor
- Hinton, Sager, et al.
(Show Context)
Citation Context ...on both implement SSE. The second version of SSE, SSE2, extends SSE to parallel operations on two 64-bit double-precision values. Intel’s Pentium 4 and AMD’s Opteron both implement these instructions =-=[1, 14]-=-. Intel carefully points out that only specialized scientific and medical visualizations need the additional precision [4], which suggests the difficulty of using SSE2 effectively in a general applica... |

76 | Measuring cache and TLB performance and their effect on benchmark runtimes.
- Saavedra, Smith
- 1995
(Show Context)
Citation Context ... performance. Saavedra introduced a set of “microbenchmarks” whichGahvari and Hoemmen 4 uses STREAM-like array accesses, but varying the array sizes and access strides, to probe the memory hierarchy =-=[22, 21]-=-. The Memory Access Patterns (MAPS) code carries out similar operations on a single node of a multiprocessor, and the results are used to predict parallel performance for a wide range of scientific ap... |

65 | Optimizing the Performance of Sparse Matrix-Vector Multiplication
- Im
- 2000
(Show Context)
Citation Context ...arse data structure and SMVM algorithm itself, to take advantage of dense locality on a small scale. This is the approach taken by the Sparsity automatic tuning system, which will be summarized below =-=[15]-=-. Our benchmark uses SMVM algorithms generated by Sparsity. 1.1. Unoptimized and optimized sparse matrix formats. Although many sparse matrix storage formats are in use, the compressed sparse row (CSR... |

58 |
BAutomated empirical optimization of software and
- Whalley, Petitet, et al.
- 2001
(Show Context)
Citation Context ...ifficulty of using SSE2 effectively in a general application. Nevertheless, the ATLAS project for automatic optimization of the BLAS (and certain LAPACK routines) takes advantage of both SSE and SSE2 =-=[26]-=-, as does FFTW version 3.0 [12]. There are also librariesGahvari and Hoemmen 9 available for small matrix-vector manipulations using SSE and SSE2 (e.g. [8, 18]). Intel’s Itanium 2 processor, released... |

57 | Performance optimizations and bounds for sparse matrix-vector multiply,
- Vudoc, Demmel, et al.
- 2002
(Show Context)
Citation Context ...ion of the peak floating-point rate (around 80%), whereas dense matrix-vector multiply (a Level 2 operation) ranges from around 9% of peak for the Sun UltraSPARC II to 21% for the Itanium 2 (see e.g. =-=[25]-=-). A typical DGEMV rate might be 510 Mflops/s for a 1.5 GHz Pentium 4, using Intel’s MKL BLAS implementation. In contrast, most sparse matrix formats require (algorithmically) sequential accesses to t... |

50 | Modeling application performance by convolving machine signatures with application profiles. In:
- Snavely, Carrington, et al.
- 2001
(Show Context)
Citation Context ...y Access Patterns (MAPS) code carries out similar operations on a single node of a multiprocessor, and the results are used to predict parallel performance for a wide range of scientific applications =-=[23]-=-. Even computer architecture textbooks have examples of such codes (see [13, pp. 513-515], for example). Analyzing the output of MAPS-like codes can be used to determine a processor’s memory hierarchy... |

46 |
CPU Performance Evaluation and Execution Time Prediction Using Narrow Spectrum Benchmarking
- Saavedra-Barrera
- 1992
(Show Context)
Citation Context ... performance. Saavedra introduced a set of “microbenchmarks” whichGahvari and Hoemmen 4 uses STREAM-like array accesses, but varying the array sizes and access strides, to probe the memory hierarchy =-=[22, 21]-=-. The Memory Access Patterns (MAPS) code carries out similar operations on a single node of a multiprocessor, and the results are used to predict parallel performance for a wide range of scientific ap... |

40 |
Sustainable memory bandwidth is current high performance computers. http://reality.sgi.com/mccalpin asd/papers/bandwidth.ps
- McCalpin
- 1995
(Show Context)
Citation Context ...this scheme [19, 10]. The second approach uses simpler loops that resemble SMVM operations and may correlate with SMVM performance; examples include STREAM Triad and indirect indexed variants thereof =-=[17]-=-. The third technique uses memory hierarchy parameters and a mathematical or heuristic performance model. Examples of each of these methods are discussed below. We chose the first of them, because the... |

35 | Compiling for SIMD Within A Register.
- Fisher, Dietz
- 1998
(Show Context)
Citation Context ...on a single larger register (usually 64 or 128 bits) divided into several shorter data values. This began with instruction sets for short integers (such as Alpha MAX, SPARC VIS and Intel Pentium MMX) =-=[9]-=-, and evolved into dual floating-point functional units for graphics applications [13, pp. 109–110]. Examples of processors with SIMD floating-point units grace many desktops and higher-end embedded s... |

20 |
Intel Architecture Optimization Reference Manual
- Corporation
- 1999
(Show Context)
Citation Context ...The Streaming SIMD Extensions (SSE) to the IA-32 ISA, for example, specify parallel floating-point operations on two 32-bit single-precision values, packed into a single 64bit floating-point register =-=[5]-=-. The MIPS-3D extensions to the MIPS 64 ISA perform a similar task [24], as do the AltiVec instructions in IBM’s PowerPC 970 (except with four 32-bit values instead of two) [7]. Intel’s Pentium III an... |

19 |
Intel Itanium2 processor reference manual for software development and optimization. Order number: 251110003, Intel Corporation
- Corporation
- 2004
(Show Context)
Citation Context ...’s Itanium 2 processor, released in 2003, only supports the original SSE instruction set, although its two fused multiply and add (FMAC) functional units offer potentially twice the bandwidth of SSE2 =-=[6]-=-. Furthermore, the Itanium 2 supports twice as many loads per cycle (four) as FMAC operations (two), unlike (for example) the Pentium 4’s NetBurst microarchitecture, which can only execute a single SI... |

3 |
Optimized matrix library for use with the intel R○ pentium R○ 4 processor’s sse2 instructions
- Devir
- 2008
(Show Context)
Citation Context ...ines) takes advantage of both SSE and SSE2 [26], as does FFTW version 3.0 [12]. There are also librariesGahvari and Hoemmen 9 available for small matrix-vector manipulations using SSE and SSE2 (e.g. =-=[8, 18]-=-). Intel’s Itanium 2 processor, released in 2003, only supports the original SSE instruction set, although its two fused multiply and add (FMAC) functional units offer potentially twice the bandwidth ... |

2 |
SSE2: Perform a double-precision 3D transform
- Corporation
- 2001
(Show Context)
Citation Context ...es. Intel’s Pentium 4 and AMD’s Opteron both implement these instructions [1, 14]. Intel carefully points out that only specialized scientific and medical visualizations need the additional precision =-=[4]-=-, which suggests the difficulty of using SSE2 effectively in a general application. Nevertheless, the ATLAS project for automatic optimization of the BLAS (and certain LAPACK routines) takes advantage... |

2 |
SparseBench: A sparse iterative benchmark, version 0.9.7
- Dongarra, Eijkhout, et al.
- 2000
(Show Context)
Citation Context ...nce. Load average obtained with UNIX “uptime” command. Pentium M and Pentium 4 are single-user workstations, Athlon is a lightly used server, and the other machines are more heavily used. this scheme =-=[19, 10]-=-. The second approach uses simpler loops that resemble SMVM operations and may correlate with SMVM performance; examples include STREAM Triad and indirect indexed variants thereof [17]. The third tech... |

2 |
libSIMD: The single instruction multiple data library
- Nicholson
- 2003
(Show Context)
Citation Context ...ines) takes advantage of both SSE and SSE2 [26], as does FFTW version 3.0 [12]. There are also librariesGahvari and Hoemmen 9 available for small matrix-vector manipulations using SSE and SSE2 (e.g. =-=[8, 18]-=-). Intel’s Itanium 2 processor, released in 2003, only supports the original SSE instruction set, although its two fused multiply and add (FMAC) functional units offer potentially twice the bandwidth ... |