Results 1  10
of
25
Applied Numerical Linear Algebra
 Society for Industrial and Applied Mathematics
, 1997
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate ..."
Abstract

Cited by 532 (26 self)
 Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We rst discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing e cient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.
A Survey of OutofCore Algorithms in Numerical Linear Algebra
 DIMACS SERIES IN DISCRETE MATHEMATICS AND THEORETICAL COMPUTER SCIENCE
, 1999
"... This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data acces ..."
Abstract

Cited by 59 (3 self)
 Add to MetaCart
This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data accesses and in data reuse, and on techniques for transforming algorithms that cannot be effectively scheduled. The survey covers outofcore algorithms for solving dense systems of linear equations, for the direct and iterative solution of sparse systems, for computing eigenvalues, for fast Fourier transforms, and for Nbody computations. The paper also discusses reasonable assumptions on memory size, approaches for the analysis of outofcore algorithms, and relationships between outofcore, cacheaware, and parallel algorithms.
A parallel variant of the GMRES(m
 In Proceedings of the 13th IMACS World Congress on Computational and Applied Mathematics. IMACS, Criterion
, 1991
"... Itions in one step are aU independent. As the innerproducts ..."
Abstract

Cited by 34 (1 self)
 Add to MetaCart
Itions in one step are aU independent. As the innerproducts
Minimizing Communication in Sparse Matrix Solvers
"... Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparsematrixvectormultiplications and Ω(k) vecto ..."
Abstract

Cited by 23 (9 self)
 Add to MetaCart
Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparsematrixvectormultiplications and Ω(k) vector operations like dot products, resulting in communication that grows by a factor of Ω(k) in both the memory and network. By reorganizing the sparsematrix kernel to compute a set of matrixvector products at once and reorganizing the rest of the algorithm accordingly, we can perform k iterations by sending O(log P) messages instead of O(k · log P) messages on a parallel machine, and reading the matrix A from DRAM to cache just once, instead of k times on a sequential machine. This reduces communication to the minimum possible. We combine these techniques to form a new variant of GMRES. Our sharedmemory implementation on an 8core Intel Clovertown gets speedups of up to 4.3 × over standard GMRES, without sacrificing convergence rate or numerical stability. 1.
Developments and Trends in the Parallel Solution of Linear Systems
 Parallel Computing
, 1999
"... In this review paper, we consider some important developments and trends in algorithm design for the solution of linear systems concentrating on aspects that involve the exploitation of parallelism. We briefly discuss the solution of dense linear systems, before studying the solution of sparse equat ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
In this review paper, we consider some important developments and trends in algorithm design for the solution of linear systems concentrating on aspects that involve the exploitation of parallelism. We briefly discuss the solution of dense linear systems, before studying the solution of sparse equations by direct and iterative methods. We consider preconditioning techniques for iterative solvers and discuss some of the present research issues in this field. Keywords: linear systems, dense matrices, sparse matrices, tridiagonal systems, parallelism, direct methods, iterative methods, Krylov methods, preconditioning. AMS(MOS) subject classifications: 65F05, 65F50. 1 Introduction Solution methods for systems of linear equations Ax = b; (1) where A is a coefficient matrix of order n and x and b are nvectors, are usually grouped into two distinct classes: direct methods and iterative methods. However, CCLRC  Rutherford Appleton Laboratory, Oxfordshire, England and CERFACS, Toulouse,...
Applying Automated Memory Analysis to Improve Iterative Algorithms
 SIAM J. Sci. Comput
, 2007
"... Historically, iterative solvers have been designed so as to minimize the number of floatingpoint operations. We propose instead that iterative solvers should be designed to minimize the amount of data that must be loaded from the memory hierarchy to the CPU. In this paper, we describe automated mem ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Historically, iterative solvers have been designed so as to minimize the number of floatingpoint operations. We propose instead that iterative solvers should be designed to minimize the amount of data that must be loaded from the memory hierarchy to the CPU. In this paper, we describe automated memory analysis, a technique to improve the memory efficiency of a sparse linear iterative solver. Our automated memory analysis uses a language processor to predict the data movement required for an iterative algorithm based upon a Matlab implementation. We demonstrate how automated memory analysis is used to reduce the execution time of a component of a global parallel ocean model. In particular, code modifications identified or evaluated through automated memory analysis enables a 46 % reduction in execution time for the conjugate gradient solver on a small serial problem. Further, we achieve a 9 % reduction in total execution time for the full model on 64 processors. The predictive capabilities of our automated memory analysis can be used to simplify the development of memory efficient numerical algorithms or software. 1
Performance Characterization and Evaluation of Parallel PDE Solvers
, 2006
"... Computer simulations that solve partial differential equations (PDEs) are common in many fields of science and engineering. To decrease the execution time of the simulations, the PDEs can be solved on parallel computers. For efficient parallel implementations, the characteristics of both the hardwar ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
Computer simulations that solve partial differential equations (PDEs) are common in many fields of science and engineering. To decrease the execution time of the simulations, the PDEs can be solved on parallel computers. For efficient parallel implementations, the characteristics of both the hardware and the PDE solver must be taken into account. In this thesis, we explore two ways to increase the efficiency of parallel PDE solvers. First, we use fullsystem simulation of a parallel computer to get detailed knowledge about cache memory usage for three parallel PDE solvers. The results reveal cases of bad cache memory locality. This insight can be used to improve the performance of the PDE solvers. Second, we study the adaptive mesh refinement (AMR) partitioning problem. Using AMR, computational resources are dynamically concentrated to areas in need of a high accuracy. Because of the dynamic
Parallel iterative solution methods for linear systems arising from discretized PDE's
 Lecture Notes on Parallel Iterative Methods for discretized PDE's. AGARD Special Course on Parallel Computing in CFD, available from http://www.math.ruu.nl/people/vorst/#lec
, 1995
"... In these notes we will present anoverview of a number of related iterative methods for the solution of linear systems of equations. These methods are socalled Krylov projection type methods and they include popular methods as Conjugate Gradients, BiConjugate Gradients, CGS, BiCGSTAB, QMR, LSQR an ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
In these notes we will present anoverview of a number of related iterative methods for the solution of linear systems of equations. These methods are socalled Krylov projection type methods and they include popular methods as Conjugate Gradients, BiConjugate Gradients, CGS, BiCGSTAB, QMR, LSQR and GMRES. We will showhow these methods can be derived from simple basic iteration formulas. We will not give convergence proofs, but we will refer for these, as far as available, to litterature. Iterative methods are often used in combination with socalled preconditioning operators (approximations for the inverses of the operator of the system to be solved). Since these preconditioners are not essential in the derivation of the iterative methods, we will not givemuch attention to them in these notes. However, in most of the actual iteration schemes, we have included them in order to facilitate the use of these schemes in actual computations. For the application of the iterative schemes one usually thinks of linear sparse systems, e.g., like those arising in the nite element or nite di erence approximations of (systems of) partial di erential equations. However, the structure of the operators plays no explicit role in any oftheseschemes, and these schemes might also successfully be used to solve certain large dense linear systems. Depending on the situation that might be attractive in terms of numbers of oating point operations. It will turn out that all of the iterative are parallelizable in a straight forward manner. However, especially for computers with a memory hierarchy (i.e., like cache or vector registers), and for distributed memory computers, the performance can often be improved signi cantly through rescheduling of the operations. We will discuss parallel implementations, and occasionally we will report on experimental ndings.