• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Autotuning sparse matrix-vector multiplication for multicore (2012)

by R Byun, J and Lin, K Yelick, J Demmel
Add To MetaCart

Tools

Sorted by:
Results 1 - 4 of 4

Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations

by Hasan Metin Aktulga, Samuel Williams, Chao Yang
"... Abstract—Obtaining highly accurate predictions on the prop-erties of light atomic nuclei using the configuration interaction (CI) approach requires computing a few extremal eigenpairs of the many-body nuclear Hamiltonian matrix. In the Many-body Fermion Dynamics for nuclei (MFDn) code, a block eigen ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Abstract—Obtaining highly accurate predictions on the prop-erties of light atomic nuclei using the configuration interaction (CI) approach requires computing a few extremal eigenpairs of the many-body nuclear Hamiltonian matrix. In the Many-body Fermion Dynamics for nuclei (MFDn) code, a block eigensolver is used for this purpose. Due to the large size of the sparse matrices involved, a significant fraction of the time spent on the eigenvalue computations is associated with the multiplication of a sparse matrix (and the transpose of that matrix) with multiple vectors (SpMM and SpMM T). Existing implementations of SpMM and SpMM T significantly underperform expectations. Thus, in this paper, we present and analyze optimized implementations of SpMM and SpMM T. We base our implementation on the compressed sparse blocks (CSB) matrix format and target systems with multi-core architectures. We develop a performance model that allows us to understand and estimate the perfor-mance characteristics of our SpMM kernel implementations, and demonstrate the efficiency of our implementation on a series of real-world matrices extracted from MFDn. In particular, we obtain 3-4 × speedup on the requisite operations over good implementations based on the commonly used compressed sparse row (CSR) matrix format. The improvements in the SpMM kernel suggest we may attain roughly a 40 % speed up in the overall execution time of the block eigensolver used in MFDn.
(Show Context)

Citation Context

...e core operations supported by the autotuned sequential sparse matrix library, OSKI [16]. OSKI’s parallel successor, pOSKI, currently does not currently support SpMM although it is a work in progress =-=[17]-=-. Liu et al. [18] recently investigated strategies to improve the performance of SpMM1 using SIMD instructions such as AVX/SSE that are available in modern multicore machines. Their driving applicatio...

Fast Matrix-vector Multiplications for Large-scale Logistic Regression on Shared-memory Systems

by Mu-chu Lee, Chih-jen Lin
"... Abstract—Shared-memory systems such as regular desktops now possess enough memory to store large data. However, the training process for data classification can still be slow if we do not fully utilize the power of multi-core CPUs. Many existing works proposed parallel machine learning algorithms by ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Abstract—Shared-memory systems such as regular desktops now possess enough memory to store large data. However, the training process for data classification can still be slow if we do not fully utilize the power of multi-core CPUs. Many existing works proposed parallel machine learning algorithms by modi-fying serial ones, but convergence analysis may be complicated. Instead, we do not modify machine learning algorithms, but consider those that can take the advantage of parallel matrix operations. We particularly investigate the use of parallel sparse matrix-vector multiplications in a Newton method for large-scale logistic regression. Various implementations from easy to sophisticated ones are analyzed and compared. Results indicate that under suitable settings excellent speedup can be achieved. Keywords-sparse matrix; parallel matrix-vector multiplication; classification; Newton method I.

Supervisors

by Rob V. Nieuwpoort, Ana Lucia Varbanescu, I Ana, Lucia Varbanescu, Rob V
"... van Nieuwpoort for their professional supervision and sincere interest in making this a better work. I should never forget our long discussions over a cup of tea and ”bitterkoekjes ” after which I felt both academically enlightened and high-spirited. Modern radio telescopes, such as the Low Frequenc ..."
Abstract - Add to MetaCart
van Nieuwpoort for their professional supervision and sincere interest in making this a better work. I should never forget our long discussions over a cup of tea and ”bitterkoekjes ” after which I felt both academically enlightened and high-spirited. Modern radio telescopes, such as the Low Frequency Array (LOFAR) in the north of the Netherlands, process the signal from the sky in software rather than ex-pensive special purpose hardware, This gives the astronomers an unprecedented flexibility to perform a vast amount of various scientific experiments. However, designing the actual software that would give optimal performance for many dif-ferent experiments, possibly also running on different hardware is a challenging task. Since optimizing the software by hand to fit the various experiments and hardware is unfeasible, we employ a technique called parameter auto-tuning to find the optimal solution. Auto-tuning is based on a construction of a more generic software which has the ability to explore its parameter space and choose the values
(Show Context)

Citation Context

...oblem [20, 41, 14]. The second group includes domain specific tuners which can handle a range of problems from the same domain - for example, linear algebra matrix operations and stencil calculations =-=[12, 40, 22]-=-. Although being problem and domain specific, and not closely related to our work, these two groups can still have an inspirational value by showing successful auto-tuning methodologies. For instance,...

Improving Memory Hierarchy Utilisation for Stencil Computations on Multicore Machines∗

by Alexandre Sena, Aline Nascimento, Cristina Boeres, Vinod E. F. Rebello , 2013
"... Although modern supercomputers are composed of multicore machines, one can find scientists that still execute their legacy applications which were developed to monocore cluster where memory hierarchy is dedicated to a sole core. The main objective of this paper is to propose and evaluate an algorith ..."
Abstract - Add to MetaCart
Although modern supercomputers are composed of multicore machines, one can find scientists that still execute their legacy applications which were developed to monocore cluster where memory hierarchy is dedicated to a sole core. The main objective of this paper is to propose and evaluate an algorithm that identify an effi-cient blocksize to be applied on MPI stencil computations on multicore machines. Under the light of an extensive experimental analysis, this work shows the benefits of identifying blocksizes that will dividing data on the various cores and suggest a methodology that explore the memory hierarchy available in modern machines. 1
(Show Context)

Citation Context

...work proposed here, the work does not infer how the dimensions of the blocking are calculated. Also, a good performance is only achieved if there is enough cache capacity to hold the blocked data. In =-=[5]-=- the auto-tuning framework pOSKI for sparse linear algebra kernels on multicore systems is proposed as an extension of a previous work devised for cache-based superscalar uniprocessors. pOSKI applies ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University