## Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors (1993)

Citations: | 28 - 4 self |

### BibTeX

@MISC{Blelloch93segmentedoperations,

author = {Guy Blelloch and Michael A. Heroux and Marco Zagha},

title = {Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors},

year = {1993}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper we present a new technique for sparse matrix multiplication on vector multiprocessors based on the efficient implementation of a segmented sum operation. We describe how the segmented sum can be implemented on vector multiprocessors such that it both fully vectorizes within each processor and parallelizes across processors. Because of our method's insensitivity to relative row size, it is better suited than the Ellpack/Itpack or the Jagged Diagonal algorithms for matrices which have a varying number of non-zero elements in each row. Furthermore, our approach requires less preprocessing (no more time than a single sparse matrix-vector multiplication), less auxiliary storage, and uses a more convenient data representation (an augmented form of the standard compressed sparse row format). We have implemented our algorithm (SEGMV) on the Cray Y-MP C90, and have compared its performance with other methods on a variety of sparse matrices from the Harwell-Boeing collection and in...

### Citations

967 |
Performance Fortran Forum. High Performance Fortran language specification
- High
- 1993
(Show Context)
Citation Context ...on [28]. Because of their usefulness for such problems, hardware support was included in the Connection Machine CM-5 [26] for segmented scans, and the proposed High Performance Fortran (HPF) standard =-=[21]-=- contains scan intrinsics (called PREFIX and SUFFIX) with an optional argument for specifying segments. 3.2 Using Segmented Scans for Sparse Matrix Multiplication A sparse matrix can be represented as... |

617 | The NAS Parallel Benchmarks
- Bailey, Barszcz, et al.
- 1991
(Show Context)
Citation Context ...least affected by the structure of the matrices. In addition to measuring times for matrix multiplication on the test matrices, we have used SEGMV as the core for the NAS Conjugate Gradient benchmark =-=[3, 4]-=-. On 16 processors of the C90, the benchmark using our algorithm achieves 3.5 Gigaflops 1 . 1 However, our implementation uses assembly language which is not permitted for official NAS results. 0 500 ... |

293 |
Sparse matrix test problems
- Du®, Grimes, et al.
- 1989
(Show Context)
Citation Context ...lysis In this section we present performance results for sparse matrix multiplication. The timings were run on a Cray Y-MP C90 in dedicated mode using sample problems from the Harwell-Boeing test set =-=[17]-=- and industrial application codes. We also present a performance model for the SEGMV implementation that accurately predicts its running time. 4.1 Test Suite Table 1 describes the test problems we use... |

278 | Parallel prefix computation
- Ladner, Fischer
(Show Context)
Citation Context ...mma1 \Phi a i ; 1 ! isn Although this calculation appears to be serial because of the the loop-carried dependence, if \Phi is associative, it can be calculated efficiently in parallel in log 2 n time =-=[24, 25]-=-. This method also leads to an efficient algorithm for vector machines [8, 11]. The usefulness of scans in array computations has long been realized and the scan operations play a crucial role in the ... |

271 | SPARSKIT: a Basic Tool Kit for Sparse Matrix Computation, http://www.cs.umn.edu/Research/arpa/SPARSKIT/paper.ps
- SAAD
- 1994
(Show Context)
Citation Context ...cture for this type of structured sparse matrix would store only these non-zero diagonals along with offset values to indicate where each diagonal belongs in the matrix (see the DIA data structure in =-=[33, 20]-=-). Feature Extraction: Another approach to optimizing sparse matrix operations is to decompose the matrix into additive submatrices and store each submatrix in a separate data structure. For example, ... |

267 |
Vector Models for Data-Parallel Computing
- Blelloch
- 1990
(Show Context)
Citation Context ...on the efficient implementation of segmented sums. A segmented sum views a vector as partitioned into contiguous segments, each potentially of different sizes, and sums the values within each segment =-=[6]-=-. Unlike the algorithms mentioned above, SEGMV is insensitive to relative row size and is therefore well suited for matrices that are very irregular. Such matrices are becoming more common with the in... |

175 |
A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations
- KOGGE, STONE
- 1973
(Show Context)
Citation Context ...mma1 \Phi a i ; 1 ! isn Although this calculation appears to be serial because of the the loop-carried dependence, if \Phi is associative, it can be calculated efficiently in parallel in log 2 n time =-=[24, 25]-=-. This method also leads to an efficient algorithm for vector machines [8, 11]. The usefulness of scans in array computations has long been realized and the scan operations play a crucial role in the ... |

162 | Scans as Primitive Parallel Operations
- BLELLOCH
- 1989
(Show Context)
Citation Context ...maximum, minimum, logical-or, and logical-and. The segmented scan operations take an array of values, and in addition take a second argument that specifies how this array is partitioned into segments =-=[5]-=-. A scan is executed independently within each segment. For example: VAL = ( 5 1 3 4 3 9 2 6 ) FLAG = ( T F T F F F T F ) ADD SCAN(VAL,FLAG) = ( 5 6 3 7 10 19 2 8 ) In this example, the FLAG array has... |

86 | NAS Parallel Benchmark Results
- Bailey, Barszcz, et al.
- 1994
(Show Context)
Citation Context ...least affected by the structure of the matrices. In addition to measuring times for matrix multiplication on the test matrices, we have used SEGMV as the core for the NAS Conjugate Gradient benchmark =-=[3, 4]-=-. On 16 processors of the C90, the benchmark using our algorithm achieves 3.5 Gigaflops 1 . 1 However, our implementation uses assembly language which is not permitted for official NAS results. 0 500 ... |

39 | Scan primitives for vector computers
- Chatterjee, Blelloch, et al.
- 1990
(Show Context)
Citation Context ...of the the loop-carried dependence, if \Phi is associative, it can be calculated efficiently in parallel in log 2 n time [24, 25]. This method also leads to an efficient algorithm for vector machines =-=[8, 11]-=-. The usefulness of scans in array computations has long been realized and the scan operations play a crucial role in the APL programming language. Associative operators that are often used include ad... |

38 |
CVL: A C vector library
- Blelloch, Chatterjee, et al.
- 1993
(Show Context)
Citation Context ...another sparse matrix. A segmented copy operation can be used to distribute a different value to the elements of each row. These operations have been efficiently implemented for a variety of machines =-=[6, 7, 10]-=-. 5.4 CSC SEGMV and Symmetric Matrices Segmented vector operations can also be used to implement a column-oriented version of sparse matrix multiplication. This could be used along with a row-oriented... |

29 |
Direct Methods for Sparse Matrices, Monographs on Numerical Analysis
- Duff, Erisman, et al.
- 1989
(Show Context)
Citation Context ... data structure is needed in order to handle a variety of sparse matrix patterns. There are many general-purpose sparse data structures, but we only discuss a few of them here. For more examples, see =-=[1, 16, 18, 19, 20, 22, 27, 30, 31, 32, 33, 34]-=-. Compressed Sparse Row: One of the most commonly used data structures is the compressed sparse row (CSR) format. The CSR format stores the entries of the matrix row-by-row in a scalar-valued array VA... |

27 |
Solving sparse triangular systems on parallel computers
- Anderson, Saad
- 1989
(Show Context)
Citation Context ...arately [32]. A further enhancement, called the Jagged Diagonal (JAD) method, sorts the rows based on row size, and decreases the number of rows processed as the algorithm proceeds across the columns =-=[2]-=-. The problem with all these methods is that they can require significant preprocessing time and they usually require that the matrix be reordered and copied, therefore requiring extra memory. Also, e... |

21 |
A high performance algorithm using preprocessing for sparse matrix-vector multiplication
- Agarwal, Gustavson, et al.
- 1992
(Show Context)
Citation Context ...pose the matrix into additive submatrices and store each submatrix in a separate data structure. For example, the Feature Extraction Based Algorithm (FEBA) presented by Agarwal, Gustavson, and Zubair =-=[1]-=- recursively extracts structural features from a given sparse matrix structure and uses the additive property of matrix multiplication to compute the matrix-vector product as a sequence of operations.... |

21 | Network learning on the Connection Machine
- Blelloch, Rosenberg
- 1987
(Show Context)
Citation Context ...he matrix is only used a few times (as is often the case with adaptive meshes). Segmented operations have been used for sparse-matrix vector multiplication with good success on the Connection Machine =-=[9, 36]-=-, but the application to vector multiprocessors is new. We have implemented the SEGMV algorithm on the Cray Y-MP C90 and have compared its running time to various other algorithms on several sparse ma... |

17 |
Compiling Data-Parallel Programs for Efficient Execution on Shared-Memory Multiprocessors
- Chatterjee
- 1991
(Show Context)
Citation Context ...another sparse matrix. A segmented copy operation can be used to distribute a different value to the elements of each row. These operations have been efficiently implemented for a variety of machines =-=[6, 7, 10]-=-. 5.4 CSC SEGMV and Symmetric Matrices Segmented vector operations can also be used to implement a column-oriented version of sparse matrix multiplication. This could be used along with a row-oriented... |

15 |
Cray Y-MP C90: System features and early benchmark results, Parallel Comput 18
- Oed
- 1992
(Show Context)
Citation Context ...e segmented computation and describe the SEGMV algorithm. Section 4 contains results comparing SEGMV to other techniques, using a collection of real-life test problems run on a Cray Y-MP C90 computer =-=[29]-=-. Section 5 discusses generalizations of the basic sparse matrix-vector multiplication algorithm. Finally, in Section 6, we present our conclusions. 2 Previous Work This section discusses previous tec... |

14 | A proposal for a sparse BLAS toolkit
- Heroux
- 1992
(Show Context)
Citation Context ...cture for this type of structured sparse matrix would store only these non-zero diagonals along with offset values to indicate where each diagonal belongs in the matrix (see the DIA data structure in =-=[33, 20]-=-). Feature Extraction: Another approach to optimizing sparse matrix operations is to decompose the matrix into additive submatrices and store each submatrix in a separate data structure. For example, ... |

9 |
Solution of Linear Systems with Striped Sparse Matrices
- Melhem
- 1986
(Show Context)
Citation Context ... data structure is needed in order to handle a variety of sparse matrix patterns. There are many general-purpose sparse data structures, but we only discuss a few of them here. For more examples, see =-=[1, 16, 18, 19, 20, 22, 27, 30, 31, 32, 33, 34]-=-. Compressed Sparse Row: One of the most commonly used data structures is the compressed sparse row (CSR) format. The CSR format stores the entries of the matrix row-by-row in a scalar-valued array VA... |

8 |
Efficient Parallel Processing of Image Contours
- Chen, Davis, et al.
- 1993
(Show Context)
Citation Context ...ithms for problems with irregular structures, including sparse matrix routines [5, 6]. Other uses of segmented scans include computer graphics [15], object recognition [37], processing image contours =-=[12]-=-, parallel quicksort [5], machine learning [9], and network optimization [28]. Because of their usefulness for such problems, hardware support was included in the Connection Machine CM-5 [26] for segm... |

8 |
3D image synthesis on the Connection Machine
- Crow, Demos, et al.
- 1989
(Show Context)
Citation Context ...ed scans can be used to implement many dataparallel algorithms for problems with irregular structures, including sparse matrix routines [5, 6]. Other uses of segmented scans include computer graphics =-=[15]-=-, object recognition [37], processing image contours [12], parallel quicksort [5], machine learning [9], and network optimization [28]. Because of their usefulness for such problems, hardware support ... |

8 |
Object recognition using the Connection Machine
- Tucker, Feynman, et al.
- 1988
(Show Context)
Citation Context ...mplement many dataparallel algorithms for problems with irregular structures, including sparse matrix routines [5, 6]. Other uses of segmented scans include computer graphics [15], object recognition =-=[37]-=-, processing image contours [12], parallel quicksort [5], machine learning [9], and network optimization [28]. Because of their usefulness for such problems, hardware support was included in the Conne... |

7 | Solving Linear Recurrences with Loop Raking
- Blelloch, Chatterjee, et al.
- 1992
(Show Context)
Citation Context ...of the the loop-carried dependence, if \Phi is associative, it can be calculated efficiently in parallel in log 2 n time [24, 25]. This method also leads to an efficient algorithm for vector machines =-=[8, 11]-=-. The usefulness of scans in array computations has long been realized and the scan operations play a crucial role in the APL programming language. Associative operators that are often used include ad... |

7 |
Sparse matrix multiplication on vector computers
- Erhel
- 1990
(Show Context)
Citation Context ... data structure is needed in order to handle a variety of sparse matrix patterns. There are many general-purpose sparse data structures, but we only discuss a few of them here. For more examples, see =-=[1, 16, 18, 19, 20, 22, 27, 30, 31, 32, 33, 34]-=-. Compressed Sparse Row: One of the most commonly used data structures is the compressed sparse row (CSR) format. The CSR format stores the entries of the matrix row-by-row in a scalar-valued array VA... |

6 |
NSPCG User’s Guide
- Oppe, Joubert, et al.
(Show Context)
Citation Context |

5 |
A new storage scheme for an efficient implementation of the sparse matrix-vector product
- Fernandes, Girdinio
- 1989
(Show Context)
Citation Context |

4 | Recent vectorization and parallelization of ITPACKV
- Kincaid, Oppe
- 1990
(Show Context)
Citation Context |

4 |
Sparse matrix vector multiplication techniques on
- Peters
- 1409
(Show Context)
Citation Context ...s the columns. This works well when the longest row is not much longer than the average row. A variation of this method collects the rows into groupsbased on sizes and processes each group separately =-=[32]-=-. A further enhancement, called the Jagged Diagonal (JAD) method, sorts the rows based on row size, and decreases the number of rows processed as the algorithm proceeds across the columns [2]. The pro... |

4 |
Implementing the multiprefix operation on parallel and vector computers
- Sheffler
- 1993
(Show Context)
Citation Context |

2 |
ITPACKV 2C user's guide
- Kincaid, Oppe, et al.
- 1984
(Show Context)
Citation Context ...-matrix vector multiplicationas a kernel, researchers have developed several methods for improving its performance on parallel and vector machines. An early technique, which is used by Ellpack/Itpack =-=[23]-=-, pads all rows so they have an equal number of elements and then vectorizes across the columns. This works well when the longest row is not much longer than the average row. A variation of this metho... |

2 | Three-dimensional finite-element analyses: implications for computer architectures
- Taylor, Ranade, et al.
- 1991
(Show Context)
Citation Context ... the matrix. Thus, it is difficult to construct and modify, and makes expressing a standard forward/back solve or other operations difficult. 2.3 Hardware Techniques Taylor, Ranade, and Messerschmitt =-=[35]-=- have proposed a modification to the compressed sparse column (CSC) data structure that stores the non-zero elements of each column with an additional zero value placed between each set of column valu... |

1 |
Data structures for network algorithms on massively parallel architectures
- Nielsen, Zenios
- 1992
(Show Context)
Citation Context ...s [5, 6]. Other uses of segmented scans include computer graphics [15], object recognition [37], processing image contours [12], parallel quicksort [5], machine learning [9], and network optimization =-=[28]-=-. Because of their usefulness for such problems, hardware support was included in the Connection Machine CM-5 [26] for segmented scans, and the proposed High Performance Fortran (HPF) standard [21] co... |