Results 1 -
4 of
4
Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations
"... Abstract—Obtaining highly accurate predictions on the prop-erties of light atomic nuclei using the configuration interaction (CI) approach requires computing a few extremal eigenpairs of the many-body nuclear Hamiltonian matrix. In the Many-body Fermion Dynamics for nuclei (MFDn) code, a block eigen ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Obtaining highly accurate predictions on the prop-erties of light atomic nuclei using the configuration interaction (CI) approach requires computing a few extremal eigenpairs of the many-body nuclear Hamiltonian matrix. In the Many-body Fermion Dynamics for nuclei (MFDn) code, a block eigensolver is used for this purpose. Due to the large size of the sparse matrices involved, a significant fraction of the time spent on the eigenvalue computations is associated with the multiplication of a sparse matrix (and the transpose of that matrix) with multiple vectors (SpMM and SpMM T). Existing implementations of SpMM and SpMM T significantly underperform expectations. Thus, in this paper, we present and analyze optimized implementations of SpMM and SpMM T. We base our implementation on the compressed sparse blocks (CSB) matrix format and target systems with multi-core architectures. We develop a performance model that allows us to understand and estimate the perfor-mance characteristics of our SpMM kernel implementations, and demonstrate the efficiency of our implementation on a series of real-world matrices extracted from MFDn. In particular, we obtain 3-4 × speedup on the requisite operations over good implementations based on the commonly used compressed sparse row (CSR) matrix format. The improvements in the SpMM kernel suggest we may attain roughly a 40 % speed up in the overall execution time of the block eigensolver used in MFDn.
Fast Matrix-vector Multiplications for Large-scale Logistic Regression on Shared-memory Systems
"... Abstract—Shared-memory systems such as regular desktops now possess enough memory to store large data. However, the training process for data classification can still be slow if we do not fully utilize the power of multi-core CPUs. Many existing works proposed parallel machine learning algorithms by ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—Shared-memory systems such as regular desktops now possess enough memory to store large data. However, the training process for data classification can still be slow if we do not fully utilize the power of multi-core CPUs. Many existing works proposed parallel machine learning algorithms by modi-fying serial ones, but convergence analysis may be complicated. Instead, we do not modify machine learning algorithms, but consider those that can take the advantage of parallel matrix operations. We particularly investigate the use of parallel sparse matrix-vector multiplications in a Newton method for large-scale logistic regression. Various implementations from easy to sophisticated ones are analyzed and compared. Results indicate that under suitable settings excellent speedup can be achieved. Keywords-sparse matrix; parallel matrix-vector multiplication; classification; Newton method I.
Supervisors
"... van Nieuwpoort for their professional supervision and sincere interest in making this a better work. I should never forget our long discussions over a cup of tea and ”bitterkoekjes ” after which I felt both academically enlightened and high-spirited. Modern radio telescopes, such as the Low Frequenc ..."
Abstract
- Add to MetaCart
(Show Context)
van Nieuwpoort for their professional supervision and sincere interest in making this a better work. I should never forget our long discussions over a cup of tea and ”bitterkoekjes ” after which I felt both academically enlightened and high-spirited. Modern radio telescopes, such as the Low Frequency Array (LOFAR) in the north of the Netherlands, process the signal from the sky in software rather than ex-pensive special purpose hardware, This gives the astronomers an unprecedented flexibility to perform a vast amount of various scientific experiments. However, designing the actual software that would give optimal performance for many dif-ferent experiments, possibly also running on different hardware is a challenging task. Since optimizing the software by hand to fit the various experiments and hardware is unfeasible, we employ a technique called parameter auto-tuning to find the optimal solution. Auto-tuning is based on a construction of a more generic software which has the ability to explore its parameter space and choose the values
Improving Memory Hierarchy Utilisation for Stencil Computations on Multicore Machines∗
, 2013
"... Although modern supercomputers are composed of multicore machines, one can find scientists that still execute their legacy applications which were developed to monocore cluster where memory hierarchy is dedicated to a sole core. The main objective of this paper is to propose and evaluate an algorith ..."
Abstract
- Add to MetaCart
(Show Context)
Although modern supercomputers are composed of multicore machines, one can find scientists that still execute their legacy applications which were developed to monocore cluster where memory hierarchy is dedicated to a sole core. The main objective of this paper is to propose and evaluate an algorithm that identify an effi-cient blocksize to be applied on MPI stencil computations on multicore machines. Under the light of an extensive experimental analysis, this work shows the benefits of identifying blocksizes that will dividing data on the various cores and suggest a methodology that explore the memory hierarchy available in modern machines. 1