Results 1 -
8 of
8
The Design and Implementation of SOLAR, a Portable Library for Scalable Out-of-Core Linear Algebra Computations
- WORKSHOP ON I/O IN PARALLEL AND DISTRIBUTED SYSTEMS
, 1996
"... SOLAR is a portable high-performance library for out-of-core dense matrix computations. It combines portability with high performance by using existing high-performance in-core subroutine libraries and by using an optimized matrix input-output library. SOLAR works on parallel computers, workstations ..."
Abstract
-
Cited by 61 (4 self)
- Add to MetaCart
SOLAR is a portable high-performance library for out-of-core dense matrix computations. It combines portability with high performance by using existing high-performance in-core subroutine libraries and by using an optimized matrix input-output library. SOLAR works on parallel computers, workstations, and personal computers. It supports in-core computations on both shared-memory and distributed-memory machines, and its matrix input-output library supports both conventional I/O interfaces and parallel I/O interfaces. This paper discusses the overall design of SOLAR, its interfaces, and the design of several important subroutines. Experimental results show that SOLAR can factor on a single workstation an out-of-core positive-definite symmetric matrix at a rate exceeding 215 Mflops, and an out-of-core general matrix at a rate exceeding 195 Mflops. Less than 16 % of the running time is spent on I/O in these computations. These results indicate that SOLAR's portability does not compromise its performance. We expect that the combination of portability, modularity, and the use of a high-level I/O interface will make the library an important platform for research on out-of-core algorithms and on parallel I/O.
A Survey of Out-of-Core Algorithms in Numerical Linear Algebra
- DIMACS SERIES IN DISCRETE MATHEMATICS AND THEORETICAL COMPUTER SCIENCE
, 1999
"... This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data acces ..."
Abstract
-
Cited by 44 (2 self)
- Add to MetaCart
This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data accesses and in data reuse, and on techniques for transforming algorithms that cannot be effectively scheduled. The survey covers out-of-core algorithms for solving dense systems of linear equations, for the direct and iterative solution of sparse systems, for computing eigenvalues, for fast Fourier transforms, and for N-body computations. The paper also discusses reasonable assumptions on memory size, approaches for the analysis of out-of-core algorithms, and relationships between out-of-core, cache-aware, and parallel algorithms.
A Framework for Symmetric Band Reduction
, 1999
"... this paper, we generalize the ideas behind the RS-algorithms and the MHLalgorithm. We develop a band reduction algorithm that eliminates d subdiagonals of a symmetric banded matrix with semibandwidth b (d < b), in a fashion akin to the MHL tridiagonalization algorithm. Then, like the Rutishauser alg ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
this paper, we generalize the ideas behind the RS-algorithms and the MHLalgorithm. We develop a band reduction algorithm that eliminates d subdiagonals of a symmetric banded matrix with semibandwidth b (d < b), in a fashion akin to the MHL tridiagonalization algorithm. Then, like the Rutishauser algorithm, the band reduction algorithm is repeatedly used until the reduced matrix is tridiagonal. If d = b 1, it is the MHL-algorithm; and if d = 1 is used for each reduction step, it results in the Rutishauser algorithm. However, d need not be chosen this way; indeed, exploiting the freedom we have in choosing d leads to a class of algorithms for banded reduction and tridiagonalization with favorable computational properties. In particular, we can derive algorithms with
Parallel Tridiagonalization through Two-Step Band Reduction
- In Proceedings of the Scalable High-Performance Computing Conference
, 1994
"... We present a two-step variant of the "successive band reduction" paradigm for the tridiagonalization of symmetric matrices. Here we reduce a full matrix first to narrow-banded form and then to tridiagonal form. The first step allows easy exploitation of block orthogonal transformations. In the secon ..."
Abstract
-
Cited by 22 (12 self)
- Add to MetaCart
We present a two-step variant of the "successive band reduction" paradigm for the tridiagonalization of symmetric matrices. Here we reduce a full matrix first to narrow-banded form and then to tridiagonal form. The first step allows easy exploitation of block orthogonal transformations. In the second step, we employ a new blocked version of a banded matrix tridiagonalization algorithm by Lang. In particular, we are able to express the update of the orthogonal transformation matrix in terms of block transformations. This expression leads to an algorithm that is almost entirely based on BLAS-3 kernels and has greatly improved data movement and communication characteristics. We also present some performance results on the Intel Touchstone DELTA and the IBM SP1. 1 Introduction Reduction to tridiagonal form is a major step in eigenvalue computations for symmetric matrices. If the matrix is full, the conventional Householder tridiagonalization approachthereof [8] is the method of This work...
Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated Fine-Grained and Memory-Aware Kernels
"... This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a two-stage approach, where the tile matrix is first reduced to symmetr ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a two-stage approach, where the tile matrix is first reduced to symmetric band form prior to the final condensed structure. The challenging trade-off between algorithmic performance and task granularity has been tackled through a grouping technique, which consists of aggregating fine-grained and memory-aware computational tasks during both stages, while sustaining the applications overall high performance. A dynamic runtime environment system then schedules the different tasks in an out-of-order fashion. The performance for the tridiagonal reduction reported in this paper is unprecedented. Our implementation results in up to 50-fold and 12-fold improvement (130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 × 24000. 1.
On Tridiagonalizing and Diagonalizing Symmetric Matrices with Repeated Eigenvalues
- PREPRINT ANL/MCS-P5454-1095, MATHEMATICS AND COMPUTER SCIENCE DIVISION, ARGONNE NATIONAL LABORATORY
, 1995
"... We describe a divide-and-conquer tridiagonalization approach for matrices with repeated eigenvalues. Our algorithm hinges on the fact that, under easily constructively verifiable conditions, a symmetric matrix with bandwidth b and k distinct eigenvalues must be block diagonal with diagonal blocks ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We describe a divide-and-conquer tridiagonalization approach for matrices with repeated eigenvalues. Our algorithm hinges on the fact that, under easily constructively verifiable conditions, a symmetric matrix with bandwidth b and k distinct eigenvalues must be block diagonal with diagonal blocks of size at most bk. A slight modification of the usual orthogonal band-reduction algorithm allows us to reveal this structure, which then leads to potential parallelism in the form of independent diagonal blocks. Compared with the usual Householder reduction algorithm, the new approach exhibits improved data locality, significantly more scope for parallelism, and the potential to reduce arithmetic complexity by close to 50% for matrices that have only two numerically distinct eigenvalues. The actual improvement depends to a large extent on the number of distinct eigenvalues and a good estimate thereof. However, at worst the algorithm behaves like a successive bandreduction approach to tridia...
A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction
"... Abstract—We present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing efficient finegrained computational tasks as well as reducing overheads associat ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—We present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing efficient finegrained computational tasks as well as reducing overheads associated with their high-level scheduling during the so-called bulge chasing procedure that is an essential phase of a scalable bidiagonalization procedure. In essence, we coalesce multiple tasks in a way that reduces the time needed to switch execution context between the scheduler and useful computational tasks. At the same time, we maintain the crucial information about the tasks and their data dependencies between the coalescing groups. This is the necessary condition to preserve numerical correctness of the computation. We show our annihilation strategy based on multiple applications of single orthogonal
Quantitative Performance Modeling of Scientific Computations and Creating Locality in Numerical Algorithms
, 1995
"... you design an efficient out-of-core iterative algorithm? These are the two questions answered in this thesis. ..."
Abstract
- Add to MetaCart
you design an efficient out-of-core iterative algorithm? These are the two questions answered in this thesis.

