Results 11  20
of
415,255
Toward scalable matrix multiply on multithreaded architectures
 In EuroPar ’07: Proceedings of the Thirteenth International European Conference on Parallel and Distributed Computing
, 2007
"... Abstract. We show empirically that some of the issues that affected the design of linear algebra libraries for distributed memory architectures will also likely affect such libraries for shared memory architectures with many simultaneous threads of execution, including SMP architectures and future m ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
multicore processors. The alwaysimportant matrixmatrix multiplication is used to demonstrate that a simple onedimensional data partitioning is suboptimal in the context of dense linear algebra operations and hinders scalability. In addition we advocate the publishing of lowlevel interfaces to supporting
A Fast 3×N Matrix Multiply Routine for Calculation of Protein RMSD
, 2014
"... . CCBY 4.0 International licensepeerreviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/008631doi: bioRxiv preprint first posted online Sep. 2, 2014; The bottleneck for the rapid calculation of the rootmean ..."
Abstract
 Add to MetaCart
meansquare deviation in atomic coordinates (RMSD) between pairs of protein structures for large numbers of conformations is the evaluation of a 3×N x N × 3 matrix product over conformation pairs. Here we describe two matrix multiply routines specialized for the 3×N case that are able to significantly outperform (by up
A Comparison of C++ Sockets and Corba in a Distributed Matrix Multiply Application
, 1999
"... (Maximum 200 words.) This project has two primary purposes. The first, is to implement a distributed matrix multiply algorithm using C++ sockets, andCorba objects with the objective of discovering what additional overhead, if any, exists in aCorba implementation. Secondly, attempt to improve the sp ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Maximum 200 words.) This project has two primary purposes. The first, is to implement a distributed matrix multiply algorithm using C++ sockets, andCorba objects with the objective of discovering what additional overhead, if any, exists in aCorba implementation. Secondly, attempt to improve
Autotuning and specialization: Speeding up matrix multiply for small matrices with compiler technology
 IN THE FOURTH INTERNATIONAL WORKSHOP ON AUTOMATIC PERFORMANCE TUNING
, 2009
"... Autotuning technology has emerged recently as a systematic process for evaluating alternative implementations of a computation to select the best.performing solution for a particular architecture. Specialization optimizes code customized to a particular class of input data. This paper presents a co ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
sizes on hand. In a case study of nek5000, a spectral element based code that extensively uses the specialized matrix multiply, we demonstrate a performance improvement for the full application of 36%.
Automatically tuned linear algebra software
 CONFERENCE ON HIGH PERFORMANCE NETWORKING AND COMPUTING
, 1998
"... This paper describes an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. The production of such software for machines ranging from desktop workstations to embedded processors can be a tedious and ..."
Abstract

Cited by 468 (26 self)
 Add to MetaCart
and time consuming process. The work described here can help in automating much of this process. We will concentrate our e orts on the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS). In particular, the work presented here is for general matrix multiply, DGEMM. However
CoSaMP: Iterative signal recovery from incomplete and inaccurate samples
 California Institute of Technology, Pasadena
, 2008
"... Abstract. Compressive sampling offers a new paradigm for acquiring signals that are compressible with respect to an orthonormal basis. The major algorithmic challenge in compressive sampling is to approximate a compressible signal from noisy samples. This paper describes a new iterative recovery alg ..."
Abstract

Cited by 747 (13 self)
 Add to MetaCart
algorithm called CoSaMP that delivers the same guarantees as the best optimizationbased approaches. Moreover, this algorithm offers rigorous bounds on computational cost and storage. It is likely to be extremely efficient for practical problems because it requires only matrix–vector multiplies
Emmerald: A Fast MatrixMatrix Multiply Using Intel's SSE Instructions
 Concurrency and Computation: Practice and Experience
, 2001
"... Generalised matrixmatrix multiplication forms the kernel of many mathematical algorithms, hence a faster matrixmatrix multiply immediately benefits these algorithms. In this paper we implement efficient matrix multiplication for large matrices using the Intel Pentium single instruction multiple ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Generalised matrixmatrix multiplication forms the kernel of many mathematical algorithms, hence a faster matrixmatrix multiply immediately benefits these algorithms. In this paper we implement efficient matrix multiplication for large matrices using the Intel Pentium single instruction
Economics 105; optional Mike Lovell Fall, 1992 Matrix Multipliers
"... The elementary multiplier concept claims only to explain how an increase in final demand, such as a step up in the level of government spendi g on goods and services, will affect the level of aggregate Gross Domestic Product. The multiplier negl cts the obvious limitations involved in working with s ..."
Abstract
 Add to MetaCart
The elementary multiplier concept claims only to explain how an increase in final demand, such as a step up in the level of government spendi g on goods and services, will affect the level of aggregate Gross Domestic Product. The multiplier negl cts the obvious limitations involved in working
Design of a HighSpeed Matrix Multiplier Based on Balanced WordWidth Decomposition and Karatsuba Multiplication
"... Abstract — This paper presents a flexible 2x2 matrix multiplier architecture. The architecture is based on wordwidth decomposition for flexible but highspeed operation. The elements in the matrices are successively decomposed so that a set of small multipliers and simple adders are used to generat ..."
Abstract
 Add to MetaCart
Abstract — This paper presents a flexible 2x2 matrix multiplier architecture. The architecture is based on wordwidth decomposition for flexible but highspeed operation. The elements in the matrices are successively decomposed so that a set of small multipliers and simple adders are used
Gemmw: A Portable Level 3 Blas Winograd Variant Of Strassen's MatrixMatrix Multiply Algorithm
, 1994
"... . Matrixmatrix multiplication is normally computed using one of the BLAS or a reinvention of part of the BLAS. Unfortunately, the BLAS were designed with small matrices in mind. When huge, well conditioned matrices are multiplied together, the BLAS perform like the blahs, even on vector machines. ..."
Abstract

Cited by 39 (1 self)
 Add to MetaCart
. Matrixmatrix multiplication is normally computed using one of the BLAS or a reinvention of part of the BLAS. Unfortunately, the BLAS were designed with small matrices in mind. When huge, well conditioned matrices are multiplied together, the BLAS perform like the blahs, even on vector machines
Results 11  20
of
415,255