Results 1  10
of
107
Applied Numerical Linear Algebra
 Society for Industrial and Applied Mathematics
, 1997
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate ..."
Abstract

Cited by 532 (26 self)
 Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.
The Potential of the Cell Processor for Scientific Computing
 CF'06
, 2006
"... The slowing pace of commodity microprocessor performance improvements combined with everincreasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations o ..."
Abstract

Cited by 72 (6 self)
 Add to MetaCart
The slowing pace of commodity microprocessor performance improvements combined with everincreasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cachebased designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future highend computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cellâ€™s unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of doubleprecision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.
Summa: Scalable universal matrix multiplication algorithm
, 1997
"... In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance r ..."
Abstract

Cited by 66 (4 self)
 Add to MetaCart
In this paper, we give a straight forward, highly e cient, scalable implementation of common matrix multiplication operations. The algorithms are much simpler than previously published methods, yield better performance, and require less work space. MPI implementations are given, as are performance results on the Intel Paragon system. 1
Scalable and Modular Algorithms for FloatingPoint Matrix Multiplication on FPGAs
 In Proc. of The 18th International Parallel & Distributed Processing Symposium
, 2004
"... The abundant hardware resources on current FPGAs provide new opportunities to improve the performance of hardware implementations of scientific computations. In this paper, we propose two FPGAbased algorithms for floatingpoint matrix multiplication, a fundamental kernel in a number of scientific a ..."
Abstract

Cited by 48 (10 self)
 Add to MetaCart
The abundant hardware resources on current FPGAs provide new opportunities to improve the performance of hardware implementations of scientific computations. In this paper, we propose two FPGAbased algorithms for floatingpoint matrix multiplication, a fundamental kernel in a number of scientific applications. We analyze the design tradeoffs in implementing this kernel on FPGAs. Our algorithms employ a linear array architecture with a small control logic. This architecture effectively utilizes the hardware resources on the entire FPGA and reduces the routing complexity. The processing elements(PEs) used in our algorithms are modular so that floatingpoint units can be easily embedded into them. In our designs, the floatingpoint units are optimized to maximize the number of PEs integrated on the FPGA as well as the clock speed. Experimental results show that our algorithms achieve high clock speeds and provide good scalability. Our algorithms achieve superior sustained floatingpoint performance compared with existing FPGAbased implementations and stateoftheart processors. 1
Communication Lower Bounds for DistributedMemory Matrix Multiplication
, 2004
"... this paper. More speci cally, we use the de nitions of [10]: (g(n)) is the set of functions f(n) such that there exist positive constants c 1 , c2 , and n0 such that 0 c1 g(n) f(n) c2 g(n) for all n n0 ; O(g(n)) is de ned similarly using the weaker condition 0 f(n) c 2 g(n); g(n)) is de ..."
Abstract

Cited by 46 (1 self)
 Add to MetaCart
this paper. More speci cally, we use the de nitions of [10]: (g(n)) is the set of functions f(n) such that there exist positive constants c 1 , c2 , and n0 such that 0 c1 g(n) f(n) c2 g(n) for all n n0 ; O(g(n)) is de ned similarly using the weaker condition 0 f(n) c 2 g(n); g(n)) is de ned with the condition 0 c 1 g(n) f(n). The set o(g(n)) consists of functions f(n) such that for any c 2 > 0 there exists a constant n0 > 0 such that 0 f(n) c 2 g(n) for all n n0
Synthesis of HighPerformance Parallel Programs for a Class of Ab Initio Quantum Chemistry Models
 PROCEEDINGS OF THE IEEE
, 2005
"... ..."
The Multicomputer Toolbox Approach to Concurrent BLAS
 Proc. Scalable High Performance Computing Conf. (SHPCC
, 1993
"... Concurrent Basic Linear Algebra Subprograms (CBLAS) are a sensible approach to extending the successful Basic Linear Algebra Subprograms (BLAS) to multicomputers. We describe many of the issues involved in generalpurpose CBLAS. Algorithms for dense matrixvector and matrixmatrix multiplication on ..."
Abstract

Cited by 28 (8 self)
 Add to MetaCart
Concurrent Basic Linear Algebra Subprograms (CBLAS) are a sensible approach to extending the successful Basic Linear Algebra Subprograms (BLAS) to multicomputers. We describe many of the issues involved in generalpurpose CBLAS. Algorithms for dense matrixvector and matrixmatrix multiplication on general P \Theta Q logical process grids are presented, and experiments run demonstrating their performance characteristics. This work was supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy. Work performed under the auspices of the U. S. Department of Energy by the Lawrence Livermore National Laboratory under contract No. W7405ENG48. Submitted to the Concurrency: Practice & Experience. y Address correspondence to: Mississippi State University, Engineering Research Center, PO Box 6176, Mississippi State, MS 39762. 6013258435. tony@cs.msstate.edu. Falgout, Skjellum, Smith & Still  The Multicomputer Toolbo...
Optimizing FORTRAN90 Programs for Data Motion on Massively Parallel Systems
, 1992
"... This paper describes a general compiler optimization technique that reduces communication overhead for FORTRAN90 (and High Performance FORTRAN currently being drafted) implementations on massively parallel machines. The main sources of communication, or data motion, for the parallel implementation ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
This paper describes a general compiler optimization technique that reduces communication overhead for FORTRAN90 (and High Performance FORTRAN currently being drafted) implementations on massively parallel machines. The main sources of communication, or data motion, for the parallel implementation of a FORTRAN90 program are from array assignments (using the index triplet notation and vector indexing), array operators (e.g. CSHIFT, TRANSPOSE, etc.), and array parameter passing to and from subroutines. Coupled with the variety of ways arrays can be distributed, a FORTRAN90 implementor faces a rich space in which data motion can be organized. A model of data motion and an algebraic representation of data motion and data layout are presented. Yale Extension, a set of layout declarations for directing the compiler in distributing the data, is described. An array reference or an array operation extracted from the source FORTRAN90 program, given a particular data layout specified in Yale E...