Results 1  10
of
584
Applied Numerical Linear Algebra
 Society for Industrial and Applied Mathematics
, 1997
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate ..."
Abstract

Cited by 532 (26 self)
 Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.
The Cache Performance and Optimizations of Blocked Algorithms
 In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1991
"... Blocking is a wellknown optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This ..."
Abstract

Cited by 512 (5 self)
 Add to MetaCart
Blocking is a wellknown optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This paper presents cache performance data for blocked programs and evaluates several optimizations to improve this performance. The data is obtained by a theoretical model of data conflicts in the cache, which has been validated by large amounts of simulation. We show that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes. The conventional wisdom of trying to use the entire cache, or even a fixed fraction of the cache, is incorrect. If a fixed block size is used for a given cache size, the block size that minimizes the expected number of cache misses is very small. Tailoring the block size according to the matrix size and cache parameters can improve the average performance and reduce the variance in performance for different matrix sizes. Finally, whenever possible, it is beneficial to copy noncontiguous reused data into consecutive locations. 1
Automated empirical optimizations of software and the ATLAS project
 Parallel Computing
, 2001
"... This paper describes the automatically tuned linear algebra software �ATLAS) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term automated empirical optimization of software ..."
Abstract

Cited by 290 (35 self)
 Add to MetaCart
This paper describes the automatically tuned linear algebra software �ATLAS) project, as well as the fundamental principles that underly it. ATLAS is an instantiation of a new paradigm in high performance library production and maintenance, which we term automated empirical optimization of software �AEOS); this style of library management has been created in order to allow software to keep pace with the incredible rate of hardware advancement inherent in Moore's Law. ATLAS is the application of this new paradigm to linear algebra software, with the present emphasis on the basic linear algebra subprograms �BLAS), a widely used, performancecritical,
NetSolve: A Network Server for Solving Computational Science Problems
 The International Journal of Supercomputer Applications and High Performance Computing
, 1995
"... This paper presents a new system, called NetSolve, that allows users to access computational resources, such as hardware and software, distributed across the network. This project has been motivated by the need for an easytouse, efficient mechanism for using computational resources remotely. Ease ..."
Abstract

Cited by 264 (31 self)
 Add to MetaCart
This paper presents a new system, called NetSolve, that allows users to access computational resources, such as hardware and software, distributed across the network. This project has been motivated by the need for an easytouse, efficient mechanism for using computational resources remotely. Ease of use is obtained as a result of different interfaces, some of which do not require any programming effort from the user. Good performance is ensured by a loadbalancing policy that enables NetSolve to use the computational resource available as efficiently as possible. NetSolve is designed to run on any heterogeneous network and is implemented as a faulttolerant clientserver application. Keywords Distributed System, Heterogeneity, Load Balancing, ClientServer, Fault Tolerance, Linear Algebra, Virtual Library. University of Tennessee  Technical report No cs95313 Department of Computer Science, University of Tennessee, TN 37996 y Mathematical Science Section, Oak Ridge National La...
A column approximate minimum degree ordering algorithm
, 2000
"... Sparse Gaussian elimination with partial pivoting computes the factorization PAQ = LU of a sparse matrix A, where the row ordering P is selected during factorization using standard partial pivoting with row interchanges. The goal is to select a column preordering, Q, based solely on the nonzero patt ..."
Abstract

Cited by 255 (52 self)
 Add to MetaCart
Sparse Gaussian elimination with partial pivoting computes the factorization PAQ = LU of a sparse matrix A, where the row ordering P is selected during factorization using standard partial pivoting with row interchanges. The goal is to select a column preordering, Q, based solely on the nonzero pattern of A such that the factorization remains as sparse as possible, regardless of the subsequent choice of P. The choice of Q can have a dramatic impact on the number of nonzeros in L and U. One scheme for determining a good column ordering for A is to compute a symmetric ordering that reduces fillin in the Cholesky factorization of ATA. This approach, which requires the sparsity structure of ATA to be computed, can be expensive both in
Linear Algebra Operators for GPU Implementation of Numerical Algorithms
 ACM Transactions on Graphics
, 2003
"... In this work, the emphasis is on the development of strategies to realize techniques of numerical computing on the graphics chip. In particular, the focus is on the acceleration of techniques for solving sets of algebraic equations as they occur in numerical simulation. We introduce a framework for ..."
Abstract

Cited by 236 (9 self)
 Add to MetaCart
In this work, the emphasis is on the development of strategies to realize techniques of numerical computing on the graphics chip. In particular, the focus is on the acceleration of techniques for solving sets of algebraic equations as they occur in numerical simulation. We introduce a framework for the implementation of linear algebra operators on programmable graphics processors (GPUs), thus providing the building blocks for the design of more complex numerical algorithms. In particular, we propose a stream model for arithmetic operations on vectors and matrices that exploits the intrinsic parallelism and efficient communication on modern GPUs. Besides performance gains due to improved numerical computations, graphics algorithms benefit from this model in that the transfer of computation results to the graphics processor for display is avoided. We demonstrate the effectiveness of our approach by implementing direct solvers for sparse matrices, and by applying these solvers to multidimensional finite difference equations, i.e. the 2D wave equation and the incompressible NavierStokes equations.
Optimizing Matrix Multiply using PHiPAC: a Portable, HighPerformance, ANSI C Coding Methodology
, 1996
"... Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machinespecific hand tuning. We have developed a methodology whereby nearpeak performance on a wide range of systems can be achieved automatically for such routines. First, by analyz ..."
Abstract

Cited by 232 (24 self)
 Add to MetaCart
Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machinespecific hand tuning. We have developed a methodology whereby nearpeak performance on a wide range of systems can be achieved automatically for such routines. First, by analyzing current machines and C compilers, we've developed guidelines for writing Portable, HighPerformance, ANSI C (PHiPAC, pronounced "feepack"). Second, rather than code by hand, we produce parameterized code generators. Third, we write search scripts that and the best parameters for a given system. We report on a BLAS GEMM compatible multilevel cacheblocked matrix multiply generator which produces code that achieves around 90% of peak on the Sparcstation20/61, IBM RS/6000590, HP 712/80i, SGI Power Challenge R8k, and SGI Octane R10k, and over 80% of peak on the SGI Indigo R4k. The resulting routines are competitive with vendoroptimized BLAS GEMMs.
A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers
, 1992
"... This paper describes ScaLAPACK, a distributed memory version of the LAPACK software package for dense and banded matrix computations. Key design features are the use of distributed versions of the Level LAS as building blocks, and an ob ectbased interface to the library routines. The square block s ..."
Abstract

Cited by 161 (33 self)
 Add to MetaCart
This paper describes ScaLAPACK, a distributed memory version of the LAPACK software package for dense and banded matrix computations. Key design features are the use of distributed versions of the Level LAS as building blocks, and an ob ectbased interface to the library routines. The square block scattered decomposition is described. The implementation of a distributed memory version of the rightlooking LU factorization algorithm on the Intel Delta multicomputer is discussed, and performance results are presented that demonstrated the scalability of the algorithm.
ARPACK Users Guide: Solution of Large Scale Eigenvalue Problems by Implicitly Restarted Arnoldi Methods.
, 1997
"... this document is intended to provide a cursory overview of the Implicitly Restarted Arnoldi/Lanczos Method that this software is based upon. The goal is to provide some understanding of the underlying algorithm, expected behavior, additional references, and capabilities as well as limitations of the ..."
Abstract

Cited by 136 (14 self)
 Add to MetaCart
this document is intended to provide a cursory overview of the Implicitly Restarted Arnoldi/Lanczos Method that this software is based upon. The goal is to provide some understanding of the underlying algorithm, expected behavior, additional references, and capabilities as well as limitations of the software. 1.7 Dependence on LAPACK and BLAS
Multifrontal Parallel Distributed Symmetric and Unsymmetric Solvers
, 1998
"... We consider the solution of both symmetric and unsymmetric systems of sparse linear equations. A new parallel distributed memory multifrontal approach is described. To handle numerical pivoting efficiently, a parallel asynchronous algorithm with dynamic scheduling of the computing tasks has been dev ..."
Abstract

Cited by 119 (32 self)
 Add to MetaCart
We consider the solution of both symmetric and unsymmetric systems of sparse linear equations. A new parallel distributed memory multifrontal approach is described. To handle numerical pivoting efficiently, a parallel asynchronous algorithm with dynamic scheduling of the computing tasks has been developed. We discuss some of the main algorithmic choices and compare both implementation issues and the performance of the LDL T and LU factorizations. Performance analysis on an IBM SP2 shows the efficiency and the potential of the method. The test problems used are from the RutherfordBoeing collection and from the PARASOL end users.