Results 1  10
of
873
Parallel Numerical Linear Algebra
, 1993
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illust ..."
Abstract

Cited by 766 (23 self)
 Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.
The Cache Performance and Optimizations of Blocked Algorithms
 In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1991
"... Blocking is a wellknown optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This ..."
Abstract

Cited by 581 (4 self)
 Add to MetaCart
(Show Context)
Blocking is a wellknown optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This paper presents cache performance data for blocked programs and evaluates several optimizations to improve this performance. The data is obtained by a theoretical model of data conflicts in the cache, which has been validated by large amounts of simulation. We show that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes. The conventional wisdom of trying to use the entire cache, or even a fixed fraction of the cache, is incorrect. If a fixed block size is used for a given cache size, the block size that minimizes the expected number of cache misses is very small. Tailoring the block size according to the matrix size and cache parameters can improve the average performance and reduce the variance in performance for different matrix sizes. Finally, whenever possible, it is beneficial to copy noncontiguous reused data into consecutive locations. 1
Automatically tuned linear algebra software
 CONFERENCE ON HIGH PERFORMANCE NETWORKING AND COMPUTING
, 1998
"... This paper describes an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. The production of such software for machines ranging from desktop workstations to embedded processors can be a tedious and ..."
Abstract

Cited by 477 (30 self)
 Add to MetaCart
This paper describes an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. The production of such software for machines ranging from desktop workstations to embedded processors can be a tedious and time consuming process. The work described here can help in automating much of this process. We will concentrate our e orts on the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS). In particular, the work presented here is for general matrix multiply, DGEMM. However much ofthe technology and approach developed here can be applied to the other Level 3 BLAS and the general strategy can have an impact on basic linear algebra operations in general and may be extended to other important kernel operations.
Linear Algebra Operators for GPU Implementation of Numerical Algorithms
 ACM Transactions on Graphics
, 2003
"... In this work, the emphasis is on the development of strategies to realize techniques of numerical computing on the graphics chip. In particular, the focus is on the acceleration of techniques for solving sets of algebraic equations as they occur in numerical simulation. We introduce a framework for ..."
Abstract

Cited by 324 (9 self)
 Add to MetaCart
In this work, the emphasis is on the development of strategies to realize techniques of numerical computing on the graphics chip. In particular, the focus is on the acceleration of techniques for solving sets of algebraic equations as they occur in numerical simulation. We introduce a framework for the implementation of linear algebra operators on programmable graphics processors (GPUs), thus providing the building blocks for the design of more complex numerical algorithms. In particular, we propose a stream model for arithmetic operations on vectors and matrices that exploits the intrinsic parallelism and efficient communication on modern GPUs. Besides performance gains due to improved numerical computations, graphics algorithms benefit from this model in that the transfer of computation results to the graphics processor for display is avoided. We demonstrate the effectiveness of our approach by implementing direct solvers for sparse matrices, and by applying these solvers to multidimensional finite difference equations, i.e. the 2D wave equation and the incompressible NavierStokes equations.
A column approximate minimum degree ordering algorithm
, 2000
"... Sparse Gaussian elimination with partial pivoting computes the factorization PAQ = LU of a sparse matrix A, where the row ordering P is selected during factorization using standard partial pivoting with row interchanges. The goal is to select a column preordering, Q, based solely on the nonzero patt ..."
Abstract

Cited by 319 (54 self)
 Add to MetaCart
Sparse Gaussian elimination with partial pivoting computes the factorization PAQ = LU of a sparse matrix A, where the row ordering P is selected during factorization using standard partial pivoting with row interchanges. The goal is to select a column preordering, Q, based solely on the nonzero pattern of A such that the factorization remains as sparse as possible, regardless of the subsequent choice of P. The choice of Q can have a dramatic impact on the number of nonzeros in L and U. One scheme for determining a good column ordering for A is to compute a symmetric ordering that reduces fillin in the Cholesky factorization of ATA. This approach, which requires the sparsity structure of ATA to be computed, can be expensive both in
NetSolve: A Network Server for Solving Computational Science Problems
 The International Journal of Supercomputer Applications and High Performance Computing
, 1995
"... This paper presents a new system, called NetSolve, that allows users to access computational resources, such as hardware and software, distributed across the network. This project has been motivated by the need for an easytouse, efficient mechanism for using computational resources remotely. Ease ..."
Abstract

Cited by 305 (32 self)
 Add to MetaCart
This paper presents a new system, called NetSolve, that allows users to access computational resources, such as hardware and software, distributed across the network. This project has been motivated by the need for an easytouse, efficient mechanism for using computational resources remotely. Ease of use is obtained as a result of different interfaces, some of which do not require any programming effort from the user. Good performance is ensured by a loadbalancing policy that enables NetSolve to use the computational resource available as efficiently as possible. NetSolve is designed to run on any heterogeneous network and is implemented as a faulttolerant clientserver application. Keywords Distributed System, Heterogeneity, Load Balancing, ClientServer, Fault Tolerance, Linear Algebra, Virtual Library. University of Tennessee  Technical report No cs95313 Department of Computer Science, University of Tennessee, TN 37996 y Mathematical Science Section, Oak Ridge National La...
Optimizing Matrix Multiply using PHiPAC: a Portable, HighPerformance, ANSI C Coding Methodology
, 1996
"... Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machinespecific hand tuning. We have developed a methodology whereby nearpeak performance on a wide range of systems can be achieved automatically for such routines. First, by analyz ..."
Abstract

Cited by 262 (24 self)
 Add to MetaCart
Modern microprocessors can achieve high performance on linear algebra kernels but this currently requires extensive machinespecific hand tuning. We have developed a methodology whereby nearpeak performance on a wide range of systems can be achieved automatically for such routines. First, by analyzing current machines and C compilers, we've developed guidelines for writing Portable, HighPerformance, ANSI C (PHiPAC, pronounced "feepack"). Second, rather than code by hand, we produce parameterized code generators. Third, we write search scripts that and the best parameters for a given system. We report on a BLAS GEMM compatible multilevel cacheblocked matrix multiply generator which produces code that achieves around 90% of peak on the Sparcstation20/61, IBM RS/6000590, HP 712/80i, SGI Power Challenge R8k, and SGI Octane R10k, and over 80% of peak on the SGI Indigo R4k. The resulting routines are competitive with vendoroptimized BLAS GEMMs.
ARPACK Users Guide: Solution of Large Scale Eigenvalue Problems by Implicitly Restarted Arnoldi Methods.
, 1997
"... this document is intended to provide a cursory overview of the Implicitly Restarted Arnoldi/Lanczos Method that this software is based upon. The goal is to provide some understanding of the underlying algorithm, expected behavior, additional references, and capabilities as well as limitations of the ..."
Abstract

Cited by 215 (18 self)
 Add to MetaCart
(Show Context)
this document is intended to provide a cursory overview of the Implicitly Restarted Arnoldi/Lanczos Method that this software is based upon. The goal is to provide some understanding of the underlying algorithm, expected behavior, additional references, and capabilities as well as limitations of the software. 1.7 Dependence on LAPACK and BLAS
Multifrontal Parallel Distributed Symmetric and Unsymmetric Solvers
, 1998
"... We consider the solution of both symmetric and unsymmetric systems of sparse linear equations. A new parallel distributed memory multifrontal approach is described. To handle numerical pivoting efficiently, a parallel asynchronous algorithm with dynamic scheduling of the computing tasks has been dev ..."
Abstract

Cited by 187 (30 self)
 Add to MetaCart
We consider the solution of both symmetric and unsymmetric systems of sparse linear equations. A new parallel distributed memory multifrontal approach is described. To handle numerical pivoting efficiently, a parallel asynchronous algorithm with dynamic scheduling of the computing tasks has been developed. We discuss some of the main algorithmic choices and compare both implementation issues and the performance of the LDL T and LU factorizations. Performance analysis on an IBM SP2 shows the efficiency and the potential of the method. The test problems used are from the RutherfordBoeing collection and from the PARASOL end users.