Results 1  10
of
72
Cacheoblivious Btrees
, 2000
"... Abstract. This paper presents two dynamic search trees attaining nearoptimal performance on any hierarchical memory. The data structures are independent of the parameters of the memory hierarchy, e.g., the number of memory levels, the blocktransfer size at each level, and the relative speeds of me ..."
Abstract

Cited by 135 (22 self)
 Add to MetaCart
Abstract. This paper presents two dynamic search trees attaining nearoptimal performance on any hierarchical memory. The data structures are independent of the parameters of the memory hierarchy, e.g., the number of memory levels, the blocktransfer size at each level, and the relative speeds of memory levels. The performance is analyzed in terms of the number of memory transfers between two memory levels with an arbitrary blocktransfer size of B; this analysis can then be applied to every adjacent pair of levels in a multilevel memory hierarchy. Both search trees match the optimal search bound of Θ(1+logB+1 N) memory transfers. This bound is also achieved by the classic Btree data structure on a twolevel memory hierarchy with a known blocktransfer size B. The first search tree supports insertions and deletions in Θ(1 + logB+1 N) amortized memory transfers, which matches the Btree’s worstcase bounds. The second search tree supports scanning S consecutive elements optimally in Θ(1 + S/B) memory transfers and supports insertions and deletions in Θ(1 + logB+1 N + log2 N) amortized memory transfers, matching the performance of the Btree for B = B Ω(log N log log N).
CacheOblivious Algorithms
, 1999
"... This thesis presents "cacheoblivious" algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on hardware configuration parameters, such as cache size and cac ..."
Abstract

Cited by 79 (1 self)
 Add to MetaCart
This thesis presents "cacheoblivious" algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on hardware configuration parameters, such as cache size and cacheline length need to be tuned to minimize the number of cache misses. We show that the ordinary algorithms for matrix transposition, matrix multiplication, sorting, and Jacobistyle multipass filtering are not cache optimal. We present algorithms for rectangular matrix transposition, FFT, sorting, and multipass filters, which are asymptotically optimal on computers with multiple levels of caches. For a cache with size Z and cacheline length L, where Z =# (L 2 ), the number of cache misses for an m × n matrix transpose is #(1 + mn=L). The number of cache misses for either an npoint FFT or the sorting of n numbers is #(1 + (n=L)(1 + log Z n)). The cache complexity of computing n ...
A LocalityPreserving CacheOblivious Dynamic Dictionary
, 2002
"... This paper presents a simple dictionary structure designed for a hierarchical memory. The proposed data structure is cache oblivious and locality preserving. A cacheoblivious data structure has memory performance optimized for all levels of the memory hierarchy even though it has no memoryhierarc ..."
Abstract

Cited by 73 (21 self)
 Add to MetaCart
This paper presents a simple dictionary structure designed for a hierarchical memory. The proposed data structure is cache oblivious and locality preserving. A cacheoblivious data structure has memory performance optimized for all levels of the memory hierarchy even though it has no memoryhierarchyspeci c parameterization. A localitypreserving dictionary maintains elements of similar key values stored close together for fast access to ranges of data with consecutive keys.
Cacheoblivious priority queue and graph algorithm applications
 In Proc. 34th Annual ACM Symposium on Theory of Computing
, 2002
"... In this paper we develop an optimal cacheoblivious priority queue data structure, supporting insertion, deletion, and deletemin operations in O ( 1 B logM/B N) amortized memory B transfers, where M and B are the memory and block transfer sizes of any two consecutive levels of a multilevel memory hi ..."
Abstract

Cited by 68 (10 self)
 Add to MetaCart
In this paper we develop an optimal cacheoblivious priority queue data structure, supporting insertion, deletion, and deletemin operations in O ( 1 B logM/B N) amortized memory B transfers, where M and B are the memory and block transfer sizes of any two consecutive levels of a multilevel memory hierarchy. In a cacheoblivious data structure, M and B are not used in the description of the structure. The bounds match the bounds of several previously developed externalmemory (cacheaware) priority queue data structures, which all rely crucially on knowledge about M and B. Priority queues are a critical component in many of the best known externalmemory graph algorithms, and using our cacheoblivious priority queue we develop several cacheoblivious graph algorithms.
The Design and Implementation of SOLAR, a Portable Library for Scalable OutofCore Linear Algebra Computations
 WORKSHOP ON I/O IN PARALLEL AND DISTRIBUTED SYSTEMS
, 1996
"... SOLAR is a portable highperformance library for outofcore dense matrix computations. It combines portability with high performance by using existing highperformance incore subroutine libraries and by using an optimized matrix inputoutput library. SOLAR works on parallel computers, workstations ..."
Abstract

Cited by 64 (5 self)
 Add to MetaCart
SOLAR is a portable highperformance library for outofcore dense matrix computations. It combines portability with high performance by using existing highperformance incore subroutine libraries and by using an optimized matrix inputoutput library. SOLAR works on parallel computers, workstations, and personal computers. It supports incore computations on both sharedmemory and distributedmemory machines, and its matrix inputoutput library supports both conventional I/O interfaces and parallel I/O interfaces. This paper discusses the overall design of SOLAR, its interfaces, and the design of several important subroutines. Experimental results show that SOLAR can factor on a single workstation an outofcore positivedefinite symmetric matrix at a rate exceeding 215 Mflops, and an outofcore general matrix at a rate exceeding 195 Mflops. Less than 16 % of the running time is spent on I/O in these computations. These results indicate that SOLAR's portability does not compromise its performance. We expect that the combination of portability, modularity, and the use of a highlevel I/O interface will make the library an important platform for research on outofcore algorithms and on parallel I/O.
A Survey of OutofCore Algorithms in Numerical Linear Algebra
 DIMACS SERIES IN DISCRETE MATHEMATICS AND THEORETICAL COMPUTER SCIENCE
, 1999
"... This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data acces ..."
Abstract

Cited by 59 (3 self)
 Add to MetaCart
This paper surveys algorithms that efficiently solve linear equations or compute eigenvalues even when the matrices involved are too large to fit in the main memory of the computer and must be stored on disks. The paper focuses on scheduling techniques that result in mostly sequential data accesses and in data reuse, and on techniques for transforming algorithms that cannot be effectively scheduled. The survey covers outofcore algorithms for solving dense systems of linear equations, for the direct and iterative solution of sparse systems, for computing eigenvalues, for fast Fourier transforms, and for Nbody computations. The paper also discusses reasonable assumptions on memory size, approaches for the analysis of outofcore algorithms, and relationships between outofcore, cacheaware, and parallel algorithms.
Applying recursion to serial and parallel QR factorization leads to better performance
"... this paper may be copied or distributed royalty free without further permission by computerbased and other informationservice systems. Permission to republish any other portion of this paper must be obtained from the Editor. ..."
Abstract

Cited by 51 (4 self)
 Add to MetaCart
this paper may be copied or distributed royalty free without further permission by computerbased and other informationservice systems. Permission to republish any other portion of this paper must be obtained from the Editor.
Cacheoblivious algorithms and data structures
 IN LECTURE NOTES FROM THE EEF SUMMER SCHOOL ON MASSIVE DATA SETS
, 2002
"... A recent direction in the design of cacheefficient and diskefficient algorithms and data structures is the notion of cache obliviousness, introduced by Frigo, Leiserson, Prokop, and Ramachandran in 1999. Cacheoblivious algorithms perform well on a multilevel memory hierarchy without knowing any pa ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
A recent direction in the design of cacheefficient and diskefficient algorithms and data structures is the notion of cache obliviousness, introduced by Frigo, Leiserson, Prokop, and Ramachandran in 1999. Cacheoblivious algorithms perform well on a multilevel memory hierarchy without knowing any parameters of the hierarchy, only knowing the existence of a hierarchy. Equivalently, a single cacheoblivious algorithm is efficient on all memory hierarchies simultaneously. While such results might seem impossible, a recent body of work has developed cacheoblivious algorithms and data structures that perform as well or nearly as well as standard externalmemory structures which require knowledge of the cache/memory size and block transfer size. Here we describe several of these results with the intent of elucidating the techniques behind their design. Perhaps the most exciting of these results are the data structures, which form general building blocks immediately
Recursive Array Layouts and Fast Matrix Multiplication
, 1999
"... The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size var ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional columnmajor or rowmajor array layouts to incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts to improve performance and reduce variability. Previous work on recursive matrix multiplication is extended to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. While recursive layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.22.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms. For a purely sequential implementation, it is possible to reorder computation to conserve memory space and improve performance between ...