Results 1 
4 of
4
Optimizing Sparse MatrixMultiple Vectors Multiplication for Nuclear Configuration Interaction Calculations
"... Abstract—Obtaining highly accurate predictions on the properties of light atomic nuclei using the configuration interaction (CI) approach requires computing a few extremal eigenpairs of the manybody nuclear Hamiltonian matrix. In the Manybody Fermion Dynamics for nuclei (MFDn) code, a block eigen ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Obtaining highly accurate predictions on the properties of light atomic nuclei using the configuration interaction (CI) approach requires computing a few extremal eigenpairs of the manybody nuclear Hamiltonian matrix. In the Manybody Fermion Dynamics for nuclei (MFDn) code, a block eigensolver is used for this purpose. Due to the large size of the sparse matrices involved, a significant fraction of the time spent on the eigenvalue computations is associated with the multiplication of a sparse matrix (and the transpose of that matrix) with multiple vectors (SpMM and SpMM T). Existing implementations of SpMM and SpMM T significantly underperform expectations. Thus, in this paper, we present and analyze optimized implementations of SpMM and SpMM T. We base our implementation on the compressed sparse blocks (CSB) matrix format and target systems with multicore architectures. We develop a performance model that allows us to understand and estimate the performance characteristics of our SpMM kernel implementations, and demonstrate the efficiency of our implementation on a series of realworld matrices extracted from MFDn. In particular, we obtain 34 × speedup on the requisite operations over good implementations based on the commonly used compressed sparse row (CSR) matrix format. The improvements in the SpMM kernel suggest we may attain roughly a 40 % speed up in the overall execution time of the block eigensolver used in MFDn.
Fast Matrixvector Multiplications for Largescale Logistic Regression on Sharedmemory Systems
"... Abstract—Sharedmemory systems such as regular desktops now possess enough memory to store large data. However, the training process for data classification can still be slow if we do not fully utilize the power of multicore CPUs. Many existing works proposed parallel machine learning algorithms by ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract—Sharedmemory systems such as regular desktops now possess enough memory to store large data. However, the training process for data classification can still be slow if we do not fully utilize the power of multicore CPUs. Many existing works proposed parallel machine learning algorithms by modifying serial ones, but convergence analysis may be complicated. Instead, we do not modify machine learning algorithms, but consider those that can take the advantage of parallel matrix operations. We particularly investigate the use of parallel sparse matrixvector multiplications in a Newton method for largescale logistic regression. Various implementations from easy to sophisticated ones are analyzed and compared. Results indicate that under suitable settings excellent speedup can be achieved. Keywordssparse matrix; parallel matrixvector multiplication; classification; Newton method I.
Supervisors
"... van Nieuwpoort for their professional supervision and sincere interest in making this a better work. I should never forget our long discussions over a cup of tea and ”bitterkoekjes ” after which I felt both academically enlightened and highspirited. Modern radio telescopes, such as the Low Frequenc ..."
Abstract
 Add to MetaCart
(Show Context)
van Nieuwpoort for their professional supervision and sincere interest in making this a better work. I should never forget our long discussions over a cup of tea and ”bitterkoekjes ” after which I felt both academically enlightened and highspirited. Modern radio telescopes, such as the Low Frequency Array (LOFAR) in the north of the Netherlands, process the signal from the sky in software rather than expensive special purpose hardware, This gives the astronomers an unprecedented flexibility to perform a vast amount of various scientific experiments. However, designing the actual software that would give optimal performance for many different experiments, possibly also running on different hardware is a challenging task. Since optimizing the software by hand to fit the various experiments and hardware is unfeasible, we employ a technique called parameter autotuning to find the optimal solution. Autotuning is based on a construction of a more generic software which has the ability to explore its parameter space and choose the values
Improving Memory Hierarchy Utilisation for Stencil Computations on Multicore Machines∗
, 2013
"... Although modern supercomputers are composed of multicore machines, one can find scientists that still execute their legacy applications which were developed to monocore cluster where memory hierarchy is dedicated to a sole core. The main objective of this paper is to propose and evaluate an algorith ..."
Abstract
 Add to MetaCart
(Show Context)
Although modern supercomputers are composed of multicore machines, one can find scientists that still execute their legacy applications which were developed to monocore cluster where memory hierarchy is dedicated to a sole core. The main objective of this paper is to propose and evaluate an algorithm that identify an efficient blocksize to be applied on MPI stencil computations on multicore machines. Under the light of an extensive experimental analysis, this work shows the benefits of identifying blocksizes that will dividing data on the various cores and suggest a methodology that explore the memory hierarchy available in modern machines. 1