Results 1 
4 of
4
Locality Of Reference In Lu Decomposition With Partial Pivoting
 SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS
, 1997
"... This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of refer ..."
Abstract

Cited by 98 (9 self)
 Add to MetaCart
This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of reference in a known and widely used partitioned algorithm for LU decomposition called the rightlooking algorithm. The analysis reveals that the new algorithm performs a factor of $\Theta(\sqrt{M/n})$ fewer I/O operations (or cache misses) than the rightlooking algorithm, where $n$ is the order of the matrix and $M$ is the size of primary memory. The analysis also determines the optimal block size for the rightlooking algorithm. Experimental comparisons between the new algorithm and the rightlooking algorithm show that an implementation of the new algorithm outperforms a similarly coded rightlooking algorithm on six different RISC architectures, that the new algorithm performs fewer cache misses than any other algorithm tested, and that it benefits more from Strassen's matrixmultiplication algorithm.
A Survey of OutofCore Algorithms in Numerical Linear Algebra
 DIMACS SERIES IN DISCRETE MATHEMATICS AND THEORETICAL COMPUTER SCIENCE
, 1999
"... This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data acces ..."
Abstract

Cited by 60 (3 self)
 Add to MetaCart
This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data accesses and in data reuse, and on techniques for transforming algorithms that cannot be effectively scheduled. The survey covers outofcore algorithms for solving dense systems of linear equations, for the direct and iterative solution of sparse systems, for computing eigenvalues, for fast Fourier transforms, and for Nbody computations. The paper also discusses reasonable assumptions on memory size, approaches for the analysis of outofcore algorithms, and relationships between outofcore, cacheaware, and parallel algorithms.
Quantitative Performance Modeling of Scientific Computations and Creating Locality in Numerical Algorithms
, 1995
"... ... you design an efficient outofcore iterative algorithm? These are the two questions answered in this thesis. The first ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
... you design an efficient outofcore iterative algorithm? These are the two questions answered in this thesis. The first
Performance Comparison of the CRA Y XMP/24 with SDD and the CRA Y2
"... Abstract. The CRA Y2 is considered to be one of the most powerful supercomputers. Its stateoftheart technology features a faster clock and more memory than any other supercomputer available today. In this report the single processor performance of the CRA Y2 is compared with the older, more mat ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. The CRA Y2 is considered to be one of the most powerful supercomputers. Its stateoftheart technology features a faster clock and more memory than any other supercomputer available today. In this report the single processor performance of the CRA Y2 is compared with the older, more mature CRA Y XMr. Benchmark results are included for both the slow and the fast memory DRAM MOSCRA Y2. Our comparison is based on a kernel benchmark set aimed at evaluating the performance of these two machines on some standard tasks in scientific computing. Particular emphasis is placed on evaluating the impact of the availability of large real memory on the CRA Y2 versus fast secondary memory on the CRA Y XMP with SSD. Our benchmark includes large linear equation solvers and FFT routines, which test the capabilities of the different approaches to providing large memory. We find that in spite of its higher processor speed the CRA Y2 does not perform as well as the CRA Y XMP on the Fortran kernel benchmark. We also find that for largescale applications, which have regular and predictable memory access patterns. a highspeed secondary memory device such as the SSD can provide performance equal to the large real memory of the CRAY2.