Results 1  10
of
597
Parallel Numerical Linear Algebra
, 1993
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illust ..."
Abstract

Cited by 575 (26 self)
 Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.
An Extended Set of Fortran Basic Linear Algebra Subprograms
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
, 1986
"... This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrixvector operations which should provide for efficient and portable implementations of algorithms for high performance computers. ..."
Abstract

Cited by 474 (69 self)
 Add to MetaCart
This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrixvector operations which should provide for efficient and portable implementations of algorithms for high performance computers.
Performance of various computers using standard linear equations software
, 2009
"... This report compares the performance of different computer systems in solving dense systems of linear equations. The comparison involves approximately a hundred computers, ranging from the Earth Simulator to personal computers. ..."
Abstract

Cited by 353 (20 self)
 Add to MetaCart
(Show Context)
This report compares the performance of different computer systems in solving dense systems of linear equations. The comparison involves approximately a hundred computers, ranging from the Earth Simulator to personal computers.
Automated empirical optimizations of software and the ATLAS project
 PARALLEL COMPUTING
, 2001
"... This paper describes the automatically tuned linear algebra software (ATLAS) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term automated empirical optimization of software ..."
Abstract

Cited by 330 (38 self)
 Add to MetaCart
This paper describes the automatically tuned linear algebra software (ATLAS) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term automated empirical optimization of software (AEOS); this style of library management has been created in order to allow software to keep pace with the incredible rate of hardware advancement inherent in Moore's Law. ATLAS is the application of this new paradigm to linear algebra software, with the present emphasis on the basic linear algebra subprograms (BLAS), a widely used, performancecritical,
NetSolve: A Network Server for Solving Computational Science Problems
 The International Journal of Supercomputer Applications and High Performance Computing
, 1995
"... This paper presents a new system, called NetSolve, that allows users to access computational resources, such as hardware and software, distributed across the network. This project has been motivated by the need for an easytouse, efficient mechanism for using computational resources remotely. Ease ..."
Abstract

Cited by 283 (32 self)
 Add to MetaCart
This paper presents a new system, called NetSolve, that allows users to access computational resources, such as hardware and software, distributed across the network. This project has been motivated by the need for an easytouse, efficient mechanism for using computational resources remotely. Ease of use is obtained as a result of different interfaces, some of which do not require any programming effort from the user. Good performance is ensured by a loadbalancing policy that enables NetSolve to use the computational resource available as efficiently as possible. NetSolve is designed to run on any heterogeneous network and is implemented as a faulttolerant clientserver application. Keywords Distributed System, Heterogeneity, Load Balancing, ClientServer, Fault Tolerance, Linear Algebra, Virtual Library. University of Tennessee  Technical report No cs95313 Department of Computer Science, University of Tennessee, TN 37996 y Mathematical Science Section, Oak Ridge National La...
Optimizing Matrix Multiply using PHiPAC: a Portable, HighPerformance, ANSI C Coding Methodology
, 1996
"... Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machinespecific hand tuning. We have developed a methodology whereby nearpeak performance on a wide range of systems can be achieved automatically for such routines. First, by analyz ..."
Abstract

Cited by 237 (24 self)
 Add to MetaCart
Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machinespecific hand tuning. We have developed a methodology whereby nearpeak performance on a wide range of systems can be achieved automatically for such routines. First, by analyzing current machines and C compilers, we've developed guidelines for writing Portable, HighPerformance, ANSI C (PHiPAC, pronounced "feepack"). Second, rather than code by hand, we produce parameterized code generators. Third, we write search scripts that and the best parameters for a given system. We report on a BLAS GEMM compatible multilevel cacheblocked matrix multiply generator which produces code that achieves around 90% of peak on the Sparcstation20/61, IBM RS/6000590, HP 712/80i, SGI Power Challenge R8k, and SGI Octane R10k, and over 80% of peak on the SGI Indigo R4k. The resulting routines are competitive with vendoroptimized BLAS GEMMs.
Brook for GPUs: Stream Computing on Graphics Hardware
 ACM TRANSACTIONS ON GRAPHICS
, 2004
"... In this paper, we present Brook for GPUs, a system for generalpurpose computation on programmable graphics hardware. Brook extends C to include simple dataparallel constructs, enabling the use of the GPU as a streaming coprocessor. We present a compiler and runtime system that abstracts and virtua ..."
Abstract

Cited by 172 (8 self)
 Add to MetaCart
(Show Context)
In this paper, we present Brook for GPUs, a system for generalpurpose computation on programmable graphics hardware. Brook extends C to include simple dataparallel constructs, enabling the use of the GPU as a streaming coprocessor. We present a compiler and runtime system that abstracts and virtualizes many aspects of graphics hardware. In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to determine when the GPU can outperform the CPU for a particular algorithm. We evaluate our system with five applications, the SAXPY and SGEMV BLAS operators, image segmentation, FFT, and ray tracing. For these applications, we demonstrate that our Brook implementations perform comparably to handwritten GPU code and up to seven times faster than their CPU counterparts.
ARPACK Users Guide: Solution of Large Scale Eigenvalue Problems by Implicitly Restarted Arnoldi Methods.
, 1997
"... this document is intended to provide a cursory overview of the Implicitly Restarted Arnoldi/Lanczos Method that this software is based upon. The goal is to provide some understanding of the underlying algorithm, expected behavior, additional references, and capabilities as well as limitations of the ..."
Abstract

Cited by 160 (17 self)
 Add to MetaCart
(Show Context)
this document is intended to provide a cursory overview of the Implicitly Restarted Arnoldi/Lanczos Method that this software is based upon. The goal is to provide some understanding of the underlying algorithm, expected behavior, additional references, and capabilities as well as limitations of the software. 1.7 Dependence on LAPACK and BLAS
GEMMBased Level 3 BLAS: HighPerformance Model Implementations and Performance Evaluation Benchmark
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
, 1998
"... The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. Howev ..."
Abstract

Cited by 91 (8 self)
 Add to MetaCart
(Show Context)
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. However, it is possible to develop a portable and highperformance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning, all the other level 3 BLAS can be defined in terms of GEMM and a small amount of level 1 and level 2 computations. Our contribution is twofold. First, the model implementations in Fortran 77 of the GEMMbased level 3 BLAS are structured to reduced effectively data traffic in a memory hierarchy. Second, the GEMMbased level 3 BLAS performance evaluation benchmark is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMMbased model implementations.
An Updated Set of Basic Linear Algebra Subprograms (BLAS)
 ACM Transactions on Mathematical Software
, 2001
"... This paper summarizes the BLAS Technical Forum Standard, a speci #cation of a set of kernel routines for linear algebra, historically called the Basic Linear Algebra Subprograms and commonly known as the BLAS. The complete standard can be found in #1#, and on the BLAS Technical Forum webpage #http: ..."
Abstract

Cited by 89 (7 self)
 Add to MetaCart
(Show Context)
This paper summarizes the BLAS Technical Forum Standard, a speci #cation of a set of kernel routines for linear algebra, historically called the Basic Linear Algebra Subprograms and commonly known as the BLAS. The complete standard can be found in #1#, and on the BLAS Technical Forum webpage #http:##www.netlib.org#blas#blastforum##