Results 11  20
of
26
Optimizing Sparse MatrixMultiple Vectors Multiplication for Nuclear Configuration Interaction Calculations
"... Abstract—Obtaining highly accurate predictions on the properties of light atomic nuclei using the configuration interaction (CI) approach requires computing a few extremal eigenpairs of the manybody nuclear Hamiltonian matrix. In the Manybody Fermion Dynamics for nuclei (MFDn) code, a block eigen ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Obtaining highly accurate predictions on the properties of light atomic nuclei using the configuration interaction (CI) approach requires computing a few extremal eigenpairs of the manybody nuclear Hamiltonian matrix. In the Manybody Fermion Dynamics for nuclei (MFDn) code, a block eigensolver is used for this purpose. Due to the large size of the sparse matrices involved, a significant fraction of the time spent on the eigenvalue computations is associated with the multiplication of a sparse matrix (and the transpose of that matrix) with multiple vectors (SpMM and SpMM T). Existing implementations of SpMM and SpMM T significantly underperform expectations. Thus, in this paper, we present and analyze optimized implementations of SpMM and SpMM T. We base our implementation on the compressed sparse blocks (CSB) matrix format and target systems with multicore architectures. We develop a performance model that allows us to understand and estimate the performance characteristics of our SpMM kernel implementations, and demonstrate the efficiency of our implementation on a series of realworld matrices extracted from MFDn. In particular, we obtain 34 × speedup on the requisite operations over good implementations based on the commonly used compressed sparse row (CSR) matrix format. The improvements in the SpMM kernel suggest we may attain roughly a 40 % speed up in the overall execution time of the block eigensolver used in MFDn.
Fast Matrixvector Multiplications for Largescale Logistic Regression on Sharedmemory Systems
"... Abstract—Sharedmemory systems such as regular desktops now possess enough memory to store large data. However, the training process for data classification can still be slow if we do not fully utilize the power of multicore CPUs. Many existing works proposed parallel machine learning algorithms by ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Sharedmemory systems such as regular desktops now possess enough memory to store large data. However, the training process for data classification can still be slow if we do not fully utilize the power of multicore CPUs. Many existing works proposed parallel machine learning algorithms by modifying serial ones, but convergence analysis may be complicated. Instead, we do not modify machine learning algorithms, but consider those that can take the advantage of parallel matrix operations. We particularly investigate the use of parallel sparse matrixvector multiplications in a Newton method for largescale logistic regression. Various implementations from easy to sophisticated ones are analyzed and compared. Results indicate that under suitable settings excellent speedup can be achieved. Keywordssparse matrix; parallel matrixvector multiplication; classification; Newton method I.
Efficient Multithreaded Untransposed, Transposed or Symmetric Sparse MatrixVector Multiplication with the Recursive Sparse Blocks Format
, 2014
"... In earlier work we have introduced the “Recursive Sparse Blocks ” (RSB) sparse matrix storage scheme oriented towards cache efficient matrixvector multiplication (SpMV) and triangular solution (SpSV) on cache based shared memory parallel computers. Both the transposed (SpMV T) and symmetric (SymSpM ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In earlier work we have introduced the “Recursive Sparse Blocks ” (RSB) sparse matrix storage scheme oriented towards cache efficient matrixvector multiplication (SpMV) and triangular solution (SpSV) on cache based shared memory parallel computers. Both the transposed (SpMV T) and symmetric (SymSpMV) matrixvector multiply variants are supported. RSB stands for a metaformat: it recursively partitions a rectangular sparse matrix in quadrants; leaf submatrices are stored in an appropriate traditional format — either Compressed Sparse Rows (CSR) or Coordinate (COO). In this work, we compare the performance of our RSB implementation of SpMV, SpMV T, SymSpMV to that of the stateoftheart Intel Math Kernel Library (MKL) CSR implementation on the recent Intel’s Sandy Bridge processor. Our results with a few dozens of real world large matrices suggest the efficiency of the approach: in all of the cases, RSB’s SymSpMV (and in most cases, SpMV T as well) took less than half of MKL CSR’s time; SpMV ’s advantage was smaller. Furthermore, RSB’s SpMV T is more scalable than MKL’s CSR, in that it performs almost as well as SpMV. Additionally, we include comparisons to the stateofthe art format Compressed Sparse Blocks (CSB) implementation. We observed RSB to be slightly superior to CSB in SpMV T, slightly inferior in SpMV, and better (in most cases by a factor of two or more) in SymSpMV. Although RSB is a nontraditional storage format and thus needs a special constructor, it can be assembled from CSR or any other similar rowordered representation arrays in the time of a few dozens of matrixvector multiply executions. Thanks to its significant advantage over MKL’s CSR routines for symmetric or transposed matrixvector multiplication, in most of the observed cases the assembly cost has been observed to amortize with fewer than fifty iterations.
An Extended Compression Format for the Optimization of Sparse MatrixVector Multiplication
"... Sparse matrixvector multiplication (SpM×V) has been characterized as one of the most significant computational scientific kernels. The key algorithmic characteristic of the SpM×V kernel, that inhibits it from achieving high performance, is its very low flop:byte ratio. In this paper, we present an ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Sparse matrixvector multiplication (SpM×V) has been characterized as one of the most significant computational scientific kernels. The key algorithmic characteristic of the SpM×V kernel, that inhibits it from achieving high performance, is its very low flop:byte ratio. In this paper, we present an extended and integrated compressed storage format, called Compressed Sparse eXtended (CSX), that is able to detect and encode simultaneously multiple commonly encountered substructures inside a sparse matrix. Relying on aggressive compression techniques of the sparse matrix’s indexing structure, CSX is able to considerably reduce the memory footprint of a sparse matrix, therefore alleviating the pressure to the memory subsystem. In a diverse set of sparse matrices, CSX was able to provide a more than 40 % average performance improvement over the standard CSR format in symmetric shared memory architectures and surpassed 20 % improvement in NUMA architectures, significantly outperforming other CSR alternatives. Additionally, it was able to adapt successfully to the nonzero element structure of the considered matrices, exhibiting very stable performance. Finally, in the context of a ‘reallife’ multiphysics simulation software, CSX was able to accelerate the SpM×V component nearly 40% and the total solver time approximately 15 % after 1000 linear system iterations.
An Improved Sparse MatrixVector Multiply Based on Recursive Sparse Blocks Layout
"... Abstract. The Recursive Sparse Blocks (RSB) is a sparse matrix layout designed for coarse grained parallelism and reduced cache misses when operating with matrices, which are larger than a computer’s cache. By laying out the matrix in sparse, non overlapping blocks, we allow for the shared memory pa ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. The Recursive Sparse Blocks (RSB) is a sparse matrix layout designed for coarse grained parallelism and reduced cache misses when operating with matrices, which are larger than a computer’s cache. By laying out the matrix in sparse, non overlapping blocks, we allow for the shared memory parallel execution of transposed SParse MatrixVector multiply (SpMV), with higher efficiency than the traditional Compressed Sparse Rows (CSR) format. In this note we cover two issues. First, we propose two improvements to our original approach. Second, we look at the performance of standard and transposed shared memory parallel SpMV for unsymmetric matrices, using the proposed approach. We find that our implementation’s performance is competitive with that of both the highly optimized, proprietary Intel MKL Sparse BLAS library’s CSR routines, and the Compressed Sparse Blocks (CSB) research prototype. 1 Introduction and Related Work Many scientific/computational problems require the solution of systems of partial
Optimization by Runtime Specialization for Sparse MatrixVector Multiplication
, 2014
"... Runtime specialization optimizes programs based on partial information available only at run time. It is applicable when some input data is used repeatedly while other input data varies. This technique has the potential of generating highly efficient codes. In this paper, we explore the potential f ..."
Abstract
 Add to MetaCart
Runtime specialization optimizes programs based on partial information available only at run time. It is applicable when some input data is used repeatedly while other input data varies. This technique has the potential of generating highly efficient codes. In this paper, we explore the potential for obtaining speedups for sparse matrixdense vector multiplication using runtime specialization, in the case where a single matrix is to be multiplied by many vectors. We experiment with five methods involving runtime specialization, comparing them to methods that do not (including Intel’s MKL library). For this work, our focus is the evaluation of the speedups that can be obtained with runtime specialization without considering the overheads of the code generation. Our experiments use 23 matrices from the Matrix Market and Florida collections, and run on five different machines. In 94 of those 115 cases, the specialized code runs faster than any version without specialization. If we only use specialization, the average speedup with respect to Intel’s MKL library ranges from 1.44x to 1.77x, depending on the machine. We have also found that the best method depends on the matrix and machine; no method is best for all matrices and machines.
Algebraic Domain Decomposition Methods for Highly Heterogeneous Problems
, 2013
"... We consider the solving of linear systems arising from porous media flow simulations with high heterogeneities. Using a Newton algorithm to handle the nonlinearity leads to the solving of a sequence of linear systems with different but similar matrices and right hand sides. The parallel solver is a ..."
Abstract
 Add to MetaCart
We consider the solving of linear systems arising from porous media flow simulations with high heterogeneities. Using a Newton algorithm to handle the nonlinearity leads to the solving of a sequence of linear systems with different but similar matrices and right hand sides. The parallel solver is a Schwarz domain decomposition method. The unknowns are partitioned with a criterion based on the entries of the input matrix. This leads to substantial gains compared to a partition based only on the adjacency graph of the matrix. From the information generated during the solving of the first linear system, it is possible to build a coarse space for a twolevel domain decomposition algorithm that leads to an acceleration of the convergence of the subsequent linear systems. We compare two coarse spaces: a classical approach and a new one adapted to parallel
Shell: A Spatial Decomposition Data Structure for 3D Curve Traversal on Manycore Architectures (Regular Submission)
"... Shared memory manycore processors such as GPUs have been extensively used in accelerating computationintensive algorithms and applications. When porting existing algorithms from sequential or other parallel architecture models to shared memory manycore architectures, nontrivial modifications are ..."
Abstract
 Add to MetaCart
(Show Context)
Shared memory manycore processors such as GPUs have been extensively used in accelerating computationintensive algorithms and applications. When porting existing algorithms from sequential or other parallel architecture models to shared memory manycore architectures, nontrivial modifications are often needed in order to match the execution patterns of the target algorithms with the characteristics of manycore architectures. 3D curve traversal is a fundamental process in many applications, and is commonly accelerated by spatial decomposition schemes captured in hierarchical data structures (e.g., kdtrees). However, curve traversal using hierarchical data structures needs to conduct repeated hierarchical searches. Such search process is timeconsuming on shared memory manycore architectures since it incurs considerable amounts of expensive memory accesses and execution divergence. In this paper, we propose a novel spatial decomposition based data structure, called Shell, which completely avoids hierarchical search for 3D curve traversal. In Shell, a structure is built on the boundary of each region in the decomposed space, which allows any curve traversing in a region to find the next neighboring region to traverse using table lookup schemes, without any hierarchical search. While our 3D curve traversal approach works for other spatial decomposition paradigms and manycore processors, we illustrate it using kdtree decomposition on GPU and compare with the fastest known kdtree searching algorithms for ray traversal. Analysis and experimental results show that our approach improves ray traversal performance considerably over the kdtree searching approaches.
Fluid Pinchoff
"... This 4608 2 image of a combustion simulation result was rendered by a hybridparallel (MPI+pthreads) raycasting volume rendering implementation running on 216,000 cores of the JaguarPF supercomputer. Combustion simulation data courtesy of J. Bell and M. Day ..."
Abstract
 Add to MetaCart
This 4608 2 image of a combustion simulation result was rendered by a hybridparallel (MPI+pthreads) raycasting volume rendering implementation running on 216,000 cores of the JaguarPF supercomputer. Combustion simulation data courtesy of J. Bell and M. Day