Results 1  10
of
642
Parallel Numerical Linear Algebra
, 1993
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illust ..."
Abstract

Cited by 766 (23 self)
 Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.
An Extended Set of Fortran Basic Linear Algebra Subprograms
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
, 1986
"... This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrixvector operations which should provide for efficient and portable implementations of algorithms for high performance computers. ..."
Abstract

Cited by 526 (72 self)
 Add to MetaCart
This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrixvector operations which should provide for efficient and portable implementations of algorithms for high performance computers.
Performance of various computers using standard linear equations software
, 2009
"... This report compares the performance of different computer systems in solving dense systems of linear equations. The comparison involves approximately a hundred computers, ranging from the Earth Simulator to personal computers. ..."
Abstract

Cited by 409 (22 self)
 Add to MetaCart
(Show Context)
This report compares the performance of different computer systems in solving dense systems of linear equations. The comparison involves approximately a hundred computers, ranging from the Earth Simulator to personal computers.
NetSolve: A Network Server for Solving Computational Science Problems
 The International Journal of Supercomputer Applications and High Performance Computing
, 1995
"... This paper presents a new system, called NetSolve, that allows users to access computational resources, such as hardware and software, distributed across the network. This project has been motivated by the need for an easytouse, efficient mechanism for using computational resources remotely. Ease ..."
Abstract

Cited by 305 (32 self)
 Add to MetaCart
This paper presents a new system, called NetSolve, that allows users to access computational resources, such as hardware and software, distributed across the network. This project has been motivated by the need for an easytouse, efficient mechanism for using computational resources remotely. Ease of use is obtained as a result of different interfaces, some of which do not require any programming effort from the user. Good performance is ensured by a loadbalancing policy that enables NetSolve to use the computational resource available as efficiently as possible. NetSolve is designed to run on any heterogeneous network and is implemented as a faulttolerant clientserver application. Keywords Distributed System, Heterogeneity, Load Balancing, ClientServer, Fault Tolerance, Linear Algebra, Virtual Library. University of Tennessee  Technical report No cs95313 Department of Computer Science, University of Tennessee, TN 37996 y Mathematical Science Section, Oak Ridge National La...
Optimizing Matrix Multiply using PHiPAC: a Portable, HighPerformance, ANSI C Coding Methodology
, 1996
"... Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machinespecific hand tuning. We have developed a methodology whereby nearpeak performance on a wide range of systems can be achieved automatically for such routines. First, by analyz ..."
Abstract

Cited by 262 (24 self)
 Add to MetaCart
Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machinespecific hand tuning. We have developed a methodology whereby nearpeak performance on a wide range of systems can be achieved automatically for such routines. First, by analyzing current machines and C compilers, we've developed guidelines for writing Portable, HighPerformance, ANSI C (PHiPAC, pronounced "feepack"). Second, rather than code by hand, we produce parameterized code generators. Third, we write search scripts that and the best parameters for a given system. We report on a BLAS GEMM compatible multilevel cacheblocked matrix multiply generator which produces code that achieves around 90% of peak on the Sparcstation20/61, IBM RS/6000590, HP 712/80i, SGI Power Challenge R8k, and SGI Octane R10k, and over 80% of peak on the SGI Indigo R4k. The resulting routines are competitive with vendoroptimized BLAS GEMMs.
ARPACK Users Guide: Solution of Large Scale Eigenvalue Problems by Implicitly Restarted Arnoldi Methods.
, 1997
"... this document is intended to provide a cursory overview of the Implicitly Restarted Arnoldi/Lanczos Method that this software is based upon. The goal is to provide some understanding of the underlying algorithm, expected behavior, additional references, and capabilities as well as limitations of the ..."
Abstract

Cited by 215 (18 self)
 Add to MetaCart
(Show Context)
this document is intended to provide a cursory overview of the Implicitly Restarted Arnoldi/Lanczos Method that this software is based upon. The goal is to provide some understanding of the underlying algorithm, expected behavior, additional references, and capabilities as well as limitations of the software. 1.7 Dependence on LAPACK and BLAS
Brook for GPUs: Stream Computing on Graphics Hardware
 ACM TRANSACTIONS ON GRAPHICS
, 2004
"... In this paper, we present Brook for GPUs, a system for generalpurpose computation on programmable graphics hardware. Brook extends C to include simple dataparallel constructs, enabling the use of the GPU as a streaming coprocessor. We present a compiler and runtime system that abstracts and virtua ..."
Abstract

Cited by 204 (9 self)
 Add to MetaCart
(Show Context)
In this paper, we present Brook for GPUs, a system for generalpurpose computation on programmable graphics hardware. Brook extends C to include simple dataparallel constructs, enabling the use of the GPU as a streaming coprocessor. We present a compiler and runtime system that abstracts and virtualizes many aspects of graphics hardware. In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to determine when the GPU can outperform the CPU for a particular algorithm. We evaluate our system with five applications, the SAXPY and SGEMV BLAS operators, image segmentation, FFT, and ray tracing. For these applications, we demonstrate that our Brook implementations perform comparably to handwritten GPU code and up to seven times faster than their CPU counterparts.
ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers  Design Issues and Performance (Technical Paper)
, 1996
"... This paper outlines the content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers. The importance of developing standards for computational and message passing interfaces is discussed. We present the different componen ..."
Abstract

Cited by 174 (56 self)
 Add to MetaCart
This paper outlines the content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers. The importance of developing standards for computational and message passing interfaces is discussed. We present the different components and building blocks of ScaLAPACK. This paper outlines the difficulties inherent in producing correct codes for networks of heterogeneous processors. We define a theoretical model of parallel computers dedicated to linear algebra applications: the Distributed Linear Algebra Machine (DLAM). This model provides a convenient framework for developing parallel algorithms and investigating their scalability, performance and programmability. Extensive performance results on various platforms are presented and analyzed with the help of the DLAM. Finally, this paper briefly describes future directions for the ScaLAPACK library and concludes by suggesting alternative approaches to mathematical libra...
An overview of the trilinos project
 ACM Transactions on Mathematical Software
"... The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an objectoriented framework for the solution of largescale, complex multiphysics engineering and scientific problems. Trilinos addresses two fundament ..."
Abstract

Cited by 143 (17 self)
 Add to MetaCart
(Show Context)
The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries within an objectoriented framework for the solution of largescale, complex multiphysics engineering and scientific problems. Trilinos addresses two fundamental issues of developing software for these problems: (i) Providing a streamlined process and set of tools for development of new algorithmic implementations and (ii) promoting interoperability of independently developed software. Trilinos uses a twolevel software structure designed around collections of packages. A Trilinos package is an integral unit usually developed by a small team of experts in a particular algorithms area such as algebraic preconditioners, nonlinear solvers, etc. Packages exist underneath the Trilinos top level, which provides a common lookandfeel, including configuration, documentation, licensing, and bugtracking. Here we present the overall Trilinos design, describing our use of abstract interfaces and default concrete implementations. We discuss the services that Trilinos provides to a prospective package and how these services are used by various packages. We also illustrate how packages can be combined to rapidly develop new algorithms. Finally, we discuss how Trilinos facilitates highquality software engineering practices that are increasingly required from simulation software. Sandia is a multiprogram laboratory operated by Sandia Corporation, a LockheedMartin Company, for the United States Department of Energy under Contract DEAC0494AL85000. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and