Results 1  10
of
456,896
Improved FloatingPoint Matrix Multiplier *
"... Abstract – Floatingpoint matrix multiplier is widely used in scientific computations. A great deal of efforts has been made to achieve higher performance. The matrix multiplication consists of many multiplications and accumulations. Yang and Duh proposed a modular design of floatingpoint matrix mu ..."
Abstract
 Add to MetaCart
Abstract – Floatingpoint matrix multiplier is widely used in scientific computations. A great deal of efforts has been made to achieve higher performance. The matrix multiplication consists of many multiplications and accumulations. Yang and Duh proposed a modular design of floatingpoint matrix
Fast matrix multiplies using graphics hardware
, 2001
"... We present a technique for large matrixmatrix multiplies using low cost graphics hardware. The result is computed by literally visualizing the computations of a simple parallel processing algorithm. Current graphics hardware technology has limited precision and thus limits immediate applicability o ..."
Abstract

Cited by 122 (0 self)
 Add to MetaCart
We present a technique for large matrixmatrix multiplies using low cost graphics hardware. The result is computed by literally visualizing the computations of a simple parallel processing algorithm. Current graphics hardware technology has limited precision and thus limits immediate applicability
Towards an algorithm for matrix multiplier blocks
 in Proc. European Conf. Circuit Theory Design
, 2003
"... Abstract.The basic elements of an algorithm for designing multiplier blocks for matrices are presented. The new algorithm often produces results superior to the best of the older algorithms applied only to columns. ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Abstract.The basic elements of an algorithm for designing multiplier blocks for matrices are presented. The new algorithm often produces results superior to the best of the older algorithms applied only to columns.
Optimizing Matrix Multiply using PHiPAC: a Portable, HighPerformance, ANSI C Coding Methodology
, 1996
"... Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machinespecific hand tuning. We have developed a methodology whereby nearpeak performance on a wide range of systems can be achieved automatically for such routines. First, by analyz ..."
Abstract

Cited by 262 (24 self)
 Add to MetaCart
given system. We report on a BLAS GEMM compatible multilevel cacheblocked matrix multiply generator which produces code that achieves around 90% of peak on the Sparcstation20/61, IBM RS/6000590, HP 712/80i, SGI Power Challenge R8k, and SGI Octane R10k, and over 80% of peak on the SGI Indigo R4k
Optical Vector Matrix Multiplier for OnChip Computation
"... We present a novel design for an optical vector matrix multiplier that enables miniaturization of optical data and signal processing systems. A proofofconcept design was realized using LCD display pixels as binary controlling elements with the backlight as the source. It will be further improved b ..."
Abstract
 Add to MetaCart
We present a novel design for an optical vector matrix multiplier that enables miniaturization of optical data and signal processing systems. A proofofconcept design was realized using LCD display pixels as binary controlling elements with the backlight as the source. It will be further improved
The GPU on the MatrixMatrix Multiply: Performance Study and Contributions
"... Modern graphics processing units (GPUs) have been at the leading edge of increasing chiplevel parallelism over the last ten years, and the CUDA programming model has recently allowed us to exploit its power across many computational domains. Within them, dense linear algebra algorithms emerge lik ..."
Abstract
 Add to MetaCart
: The MatrixMatrix Multiply. Different programming approaches and optimization techniques have already been published in the literature, which we review and analyze to pursue further optimizations and unveil the potential of some hardware resources when programming the GPU under CUDA. Experimental results
CS 267 HOMEWORK 1: MATRIX MULTIPLY
"... in the same order (columnmajor), but the multiply operation requires that entries of either A or B be loaded with stride M. (Without loss of generality, assume the A matrix.) Large strides result in ineffective use of cache lines, since (for sufficiently large M) each consecutive entry in a row of ..."
Abstract
 Add to MetaCart
in the same order (columnmajor), but the multiply operation requires that entries of either A or B be loaded with stride M. (Without loss of generality, assume the A matrix.) Large strides result in ineffective use of cache lines, since (for sufficiently large M) each consecutive entry in a row
The PHIPAC v1.0 MatrixMultiply Distribution
, 1998
"... Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machinespecific hand tuning. We have developed a methodology whereby nearpeak performance on a wide range of systems can be achieved automatically for such routines. First, by analyz ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
given system. We report on a BLAS GEMM compatible multilevel cacheblocked matrix multiply generator which produces code that achieves around 90% of peak on the Sparcstation20/61, IBM RS/6000590, HP 712/80i, SGI Power Challenge R8k, and SGI Octane R10k, and over 80% of peak on the SGI Indigo R4k
Results 1  10
of
456,896