#### DMCA

## Efficient Sparse Matrix-Vector Multiplication on x86based Many-core Processors

Venue: | In 27th International Conference on Supercomputing (ICS |

Citations: | 14 - 0 self |

### Citations

153 | Optimization of sparse matrix-vector multiplication on emerging multicore platforms.
- Williams, Oliker, et al.
- 2009
(Show Context)
Citation Context ...ctors. While SpMV is considered one of the most important computational kernels, it usually performs poorly on modern architectures, achieving less than 10% of the peak performance of microprocessors =-=[7, 26]-=-. Achieving higher performance usually requires carefully choosing the sparse matrix storage format and fully utilizing the underlying system architecture. Recently, Intel announced the Intel R© Xeon ... |

142 | Implementing Sparse Matrix-vector Multiplication on Throughput-oriented Processors
- Bell, Garland
- 2009
(Show Context)
Citation Context ...ision performance, 76 GB/s STREAM Triad bandwidth. For K20X, we used two SpMV codes: cuSPARSE v5.03 and CUSP v0.3 [2], which use a variety of matrix formats, CSR, COO, DIA, BCSR, ELLPACK, and ELL/HYB =-=[1]-=-. For each of the test matrices, we ran experiments with both codes using all the formats. For brevity, we only report the best of these results. For the dual SNB-EP system, we tested three SpMV codes... |

84 | Sparsity: Optimization framework for sparse matrix kernels - Im, Yelick, et al. - 2004 |

76 | Automatic performance tuning of sparse matrix kernels.
- Vuduc
- 2003
(Show Context)
Citation Context ...SpMV optimization has been extensively studied over decades on various architectures. Relevant for us is optimizations for CPUs and GPUs. For a comprehensive review, we refer to several survey papers =-=[23, 8, 26, 7]-=-. Blocking is widely used for optimizing SpMV on CPUs. Depending on the motivation, block methods can be divided into two major categories. In the first category, blocking improves the spatial and tem... |

65 | Model-driven autotuning of sparse matrix-vector multiply on GPUs”.
- Choi, Singh, et al.
- 2010
(Show Context)
Citation Context ...rn of blocks of sparse matrices in order to avoid explicitly storing zeros [27, 3]. 4.2.2 Finite-Window Sorting On GPUs, row sorting can be used to increase the nonzero density of slices for SELLPACK =-=[5, 11]-=-. Rows of the sparse matrix are sorted in descending order of number of nonzeros per row. As adjacent rows in the sorted matrix have similar numbers of nonzeros, storing the sorted matrix in SELLPACK ... |

65 | Improving performance of sparse matrix-vector multiplication,
- Pinar, Heath
- 1999
(Show Context)
Citation Context ... including register [13, 8], cache [8, 16] and TLB [16]. In the second category, block structures are discovered in order to eliminate integer index overhead, thus reducing the bandwidth requirements =-=[24, 20]-=-. Besides blocking, other techniques have also been proposed to reduce the bandwidth requirements of SpMV. These techniques broadly include matrix reordering [17], value and index compression [25, 10]... |

46 |
Cusp: Generic parallel algorithms for sparse matrix and graph computations,
- Bell, Garland
- 2010
(Show Context)
Citation Context ...5-2680 (Sandy Brige EP), 20 MB L2 cache, 32 GB DDR memory, 346 Gflops peak double precision performance, 76 GB/s STREAM Triad bandwidth. For K20X, we used two SpMV codes: cuSPARSE v5.03 and CUSP v0.3 =-=[2]-=-, which use a variety of matrix formats, CSR, COO, DIA, BCSR, ELLPACK, and ELL/HYB [1]. For each of the test matrices, we ran experiments with both codes using all the formats. For brevity, we only re... |

30 | Auto-tuning Performance on Multicore Computers
- Williams
- 2008
(Show Context)
Citation Context ... still perform arithmetic on padding zeros. We note that earlier work has also used bit arrays for storing the sparsity pattern of blocks of sparse matrices in order to avoid explicitly storing zeros =-=[27, 3]-=-. 4.2.2 Finite-Window Sorting On GPUs, row sorting can be used to increase the nonzero density of slices for SELLPACK [5, 11]. Rows of the sparse matrix are sorted in descending order of number of non... |

29 |
Automatically tuning sparse matrix-vector multiplication for GPU architectures,” in High Performance Embedded Architectures
- Monakov, Lokhmotov, et al.
- 2010
(Show Context)
Citation Context ...ds on the distribution of nonzeros. When the number of nonzeros per row varies considerably, the performance degrades due to the overhead of the padding zeros. To address this problem, Monakov et al. =-=[15]-=- proposed the sliced ELLPACK (SELLPACK) format for GPUs, which partitions the sparse matrix into row slices and packs each slice separately, thus requiring less zero padding than ELLPACK. Table 2 also... |

28 | When cache blocking of sparse matrix vector multiply works and why.
- Nishtala, Vuduc, et al.
- 2007
(Show Context)
Citation Context ...ind that they are only weakly related and that it is better to select c before selecting w. The choice of number of block columns c is dependent on the structure of the sparse matrix. Nishtala et al. =-=[16]-=- describes situations where cache blocking is beneficial, including the presence of very long rows. We found column blocking to be beneficial for three of our test matrices: Rail4284, 12month1 and Spa... |

26 |
Accelerating sparse matrix computations via data compression,”
- Willcock, Lumsdaine
- 2006
(Show Context)
Citation Context ...[24, 20]. Besides blocking, other techniques have also been proposed to reduce the bandwidth requirements of SpMV. These techniques broadly include matrix reordering [17], value and index compression =-=[25, 10]-=-, and exploiting symmetry [3]. Due to the increasing popularity of GPUs, in recent years numerous matrix formats and optimization techniques have been proposed to improve the performance of SpMV on GP... |

24 | H.J.: Fast sparse matrix-vector multiplication by exploiting variable block structure. In: High Performance Computing and Communications
- Vuduc, Moon
- 2005
(Show Context)
Citation Context ... including register [13, 8], cache [8, 16] and TLB [16]. In the second category, block structures are discovered in order to eliminate integer index overhead, thus reducing the bandwidth requirements =-=[24, 20]-=-. Besides blocking, other techniques have also been proposed to reduce the bandwidth requirements of SpMV. These techniques broadly include matrix reordering [17], value and index compression [25, 10]... |

22 | Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication, in:
- Buluc, Williams, et al.
- 2011
(Show Context)
Citation Context ...on techniques used for CPUs and GPUs. 4.1.1 Register Blocking Register blocking [8] is one of the most effective optimization techniques for SpMV on CPUs, and is central to many sparse matrix formats =-=[3, 4, 26]-=-. In this section, we explain why this technique is not appropriate for KNC. In register blocking, adjacent nonzeros of the matrix are grouped into small dense blocks to facilitate SIMD, and to reuse ... |

22 | Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations.
- Oliker, Li, et al.
- 2002
(Show Context)
Citation Context ...ducing the bandwidth requirements [24, 20]. Besides blocking, other techniques have also been proposed to reduce the bandwidth requirements of SpMV. These techniques broadly include matrix reordering =-=[17]-=-, value and index compression [25, 10], and exploiting symmetry [3]. Due to the increasing popularity of GPUs, in recent years numerous matrix formats and optimization techniques have been proposed to... |

19 |
Fast optimal load balancing algorithms for 1D partitioning.
- Pinar, Aykanat
- 2004
(Show Context)
Citation Context ...thod was originally designed for distributed memory systems, we adapted it to KNC with minor changes. To reduce the cost of re-partitioning, we used the 1D partitioning algorithm of Pinar and Aykanat =-=[19]-=-. Experiments show that the adaptive tuning converges in fewer than 10 executions. 6. EXPERIMENTAL RESULTS 6.1 Load Balancing Results We first test the load balancing techniques discussed in Section 5... |

17 | Performance evaluation of the sparse matrix-vector multiplication on modern architectures.
- Goumas, Kourtis, et al.
- 2009
(Show Context)
Citation Context ...ctors. While SpMV is considered one of the most important computational kernels, it usually performs poorly on modern architectures, achieving less than 10% of the peak performance of microprocessors =-=[7, 26]-=-. Achieving higher performance usually requires carefully choosing the sparse matrix storage format and fully utilizing the underlying system architecture. Recently, Intel announced the Intel R© Xeon ... |

17 | Optimizing sparse matrix-vector product computations using unroll and jam
- Mellor-Crummey, Garvin
(Show Context)
Citation Context ... major categories. In the first category, blocking improves the spatial and temporal locality of the SpMV kernel by exploiting data reuse at various levels in the memory hierarchy, including register =-=[13, 8]-=-, cache [8, 16] and TLB [16]. In the second category, block structures are discovered in order to eliminate integer index overhead, thus reducing the bandwidth requirements [24, 20]. Besides blocking,... |

12 | Scheduling task parallelism on multi-socket
- Porterfield, Wheeler, et al.
(Show Context)
Citation Context ...etween the thief thread and the victim thread. To address the performance issues of both work-sharing and workstealing schedulers, we design a hybrid work-sharing/work-stealing scheduler, inspired by =-=[18]-=-. In the hybrid dynamic scheduler, the sparse matrix is first partitioned into P tasks with equal numbers of nonzeros. The tasks are then evenly distributed to N task queues, each of which is shared b... |

11 |
clSpMV: a cross-platform OpenCL SpMV framework on GPUs.
- Su, Keutzer
- 2012
(Show Context)
Citation Context ...f indices giving the beginning of each row are stored in rowptr. 3.1 Test Matrices Table 1 lists sparse matrices used in our performance evaluation. These are all the matrices used in previous papers =-=[26, 21, 9]-=- that are larger than the 30 MB aggregate L2 cache of KNC (using 60 cores). A dense matrix stored in sparse format is also included. These matrices come from a wide variety of applications with differ... |

11 |
A new approach for sparse matrix vector product on
- Vazquez, Fernandez, et al.
- 2011
(Show Context)
Citation Context ...zero padding. To further reduce the memory bandwidth associated with the padding zeros, we can avoid storing zeros by storing instead the length (number of nonzeros) of each row, as done in ELLPACK-R =-=[22]-=- for GPUs. Here, each CUDA thread multiplies elements in a row until the end of the row is reached. For SIMD processing on CPUs, we can store the length of each column of the dense ELLPACK arrays, cor... |

10 |
Implementing blocked sparse matrix-vector multiplication on nvidia gpus
- Monakov, Avetisyan
(Show Context)
Citation Context ...blocked SpMV implementations on GPUs. Jee et al. [5] proposed the BELLPACK matrix format which partitions the matrix into small dense blocks and organizes the blocks in ELLPACK format. Monakov et al. =-=[14]-=- proposed a format also using small dense blocks, but augments it with ELLPACK format for nonzeros outside this structure. 8. CONCLUSIONS This paper presented an efficient implementation of SpMV for t... |

8 | Performance optimization and modeling of blocked sparse kernels
- Buttari, Eijkhout, et al.
- 2005
(Show Context)
Citation Context ...hs, e. g., Rajat31 and Rucci1, the compute bound performance is lower than the bandwidth bound performance, suggesting that their performance is actually bounded by computation. Previous work on SpMV =-=[7, 4]-=- attributes the poor performance of matrices with short rows to loop overheads. For wide SIMD, we argue that this is also because of the low SIMD efficiency of the CSR kernel, i. e., low fraction of u... |

5 |
Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation,” in
- Kreutzer, Hager, et al.
- 2012
(Show Context)
Citation Context ...rn of blocks of sparse matrices in order to avoid explicitly storing zeros [27, 3]. 4.2.2 Finite-Window Sorting On GPUs, row sorting can be used to increase the nonzero density of slices for SELLPACK =-=[5, 11]-=-. Rows of the sparse matrix are sorted in descending order of number of nonzeros per row. As adjacent rows in the sorted matrix have similar numbers of nonzeros, storing the sorted matrix in SELLPACK ... |

5 |
Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems
- Lee, Eigenmann
- 2008
(Show Context)
Citation Context ...erformance data from earlier calls to SpMV to repartition the matrix to achieve better load balance. This idea, which we call adaptive load balancing, was first proposed for SpMV by Lee and Eigenmann =-=[12]-=-. In their load balancer, the execution time of each thread is measured after each execution of SpMV. The normalized cost of each row of the sparse matrix is then approximated as the execution time of... |

4 | SpMV: A MemoryBound Application on the GPU Stuck Between a Rock and a Hard Place,”
- Davis, Chung
- 2012
(Show Context)
Citation Context ...ned as Pebw = 2 nnz Mmin/B where Mmin is the minimum memory traffic of SpMV assuming perfect reuse of vectors x and y. Prior work has shown that SpMV is memory bandwidth bound on modern architectures =-=[7, 26, 6]-=-. Thus, it is expected that for all matrices the compute bound performance is larger than the memory bandwidth bound performance. The ideal balanced performance represents the achievable peak performa... |

4 | A comparative study of blocking storage methods for sparse matrices on multicore architectures
- Karakasis, Goumas, et al.
- 2009
(Show Context)
Citation Context ...f indices giving the beginning of each row are stored in rowptr. 3.1 Test Matrices Table 1 lists sparse matrices used in our performance evaluation. These are all the matrices used in previous papers =-=[26, 21, 9]-=- that are larger than the 30 MB aggregate L2 cache of KNC (using 60 cores). A dense matrix stored in sparse format is also included. These matrices come from a wide variety of applications with differ... |

4 |
Exploiting Compression Opportunities to Improve SpMxV Performance on Shared Memory Systems.
- Kourtis, Goumas, et al.
- 2010
(Show Context)
Citation Context ...[24, 20]. Besides blocking, other techniques have also been proposed to reduce the bandwidth requirements of SpMV. These techniques broadly include matrix reordering [17], value and index compression =-=[25, 10]-=-, and exploiting symmetry [3]. Due to the increasing popularity of GPUs, in recent years numerous matrix formats and optimization techniques have been proposed to improve the performance of SpMV on GP... |