## Cache-oblivious algorithms (Extended Abstract) (1999)

### Cached

### Download Links

- [www-static.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www-2.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [supertech.csail.mit.edu]
- [ocw.mit.edu]
- [supertech.csail.mit.edu]
- [theory.lcs.mit.edu]
- [www.fftw.org]
- [cacs.usc.edu]
- [supertech.lcs.mit.edu]
- [www.daimi.au.dk]
- [www.daimi.au.dk]
- [www.mpi-inf.mpg.de]

Venue: | In Proc. 40th Annual Symposium on Foundations of Computer Science |

Citations: | 12 - 1 self |

### BibTeX

@INPROCEEDINGS{Frigo99cache-obliviousalgorithms,

author = {Matteo Frigo and Charles E. Leiserson and Harald Prokop and Sridhar Ramachandran},

title = {Cache-oblivious algorithms (Extended Abstract)},

booktitle = {In Proc. 40th Annual Symposium on Foundations of Computer Science},

year = {1999},

pages = {285--397},

publisher = {IEEE Computer Society Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cache-line length L where Z � Ω � L 2 � the number of cache misses for an m � n matrix transpose is Θ � 1 � mn � L �. The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ � 1 �� � n � L � � 1 � log Z n �� �. We also give an Θ � mnp �-work algorithm to multiply an m � n matrix by an n � p matrix that incurs Θ � 1 �� � mn � np � mp � � L � mnp � L � Z � cache faults. We introduce an “ideal-cache ” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We also provide preliminary empirical results on the effectiveness of cache-oblivious algorithms in practice.

### Citations

8523 | Introduction to Algorithms - Cormen, Leiserson, et al. - 2001 |

3973 | Computer Architecture: A Quantitative Approach, 3 rd ed - Hennessy, Patterson, et al. - 2002 |

2435 |
The Design and Analysis of Computer Algorithms
- Aho, Hopcroft, et al.
- 1974
(Show Context)
Citation Context ...alyzed in terms of a single measure, the ideal-cache model uses two measures. An algorithm with an input of size n is measured by its work complexity W(n)—its conventional running time in a RAM model =-=[4]-=-—and its cache complexity Q(n;Z;L)—the number of cache misses it incurs as asfunction of the size Z and line length L of the ideal cache. When Z and L are clear from context, we denote the cache compl... |

1869 | Randomized Algorithms - Motwani, Raghavan - 1995 |

728 |
Amortized efficiency of list update and paging rules
- Sleator, Tarjan
- 1985
(Show Context)
Citation Context ...n;Z;L)cache misses on a problem of size n using a(Z;L)ideal cache. Then, the same algorithm a(Z;L) incurs Q(n;Z;L)2Q(n;Z=2;L)cache misses on cache that uses LRU replacement. Proof. Sleator and Tarjan =-=[30]-=- have shown that the cache misses on a(Z;L)cache using LRU replacement are(Z=L)=((Z?Z)=L+1)-competitive with optimal replacement on a(Z;L)ideal cache if both caches start empty. It follows that the nu... |

537 |
The input/output complexity of sorting and related problems
- Aggarwal, Vitter
- 1988
(Show Context)
Citation Context ...general DFT problem. 4. Funnelsort Cache-oblivious algorithms, like the familiar two-way merge sort, are not optimal with respect to cache misses. The Z-way mergesort suggested by Aggarwal and Vitter =-=[3]-=- has optimal cache complexity, but although it apparently works well in practice [23], it is cache aware. This section describes a cache-oblivious sorting algorithm called “funnelsort.” This algorithm... |

490 |
A Study of Replacement Algorithm for a Virtual-Storage Cornputer
- Belady
- 1966
(Show Context)
Citation Context ...d anywhere in the cache. If the cache is full, a cache line must be evicted. The ideal cache uses the optimal off-line strategy of replacing the cache line whose next access is furthest in the future =-=[7]-=-, and thus it exploits temporal locality perfectly. Unlike various other hierarchical-memory models [1, 2, 5, 8] in which algorithms are analyzed in terms of a single measure, the ideal-cache model us... |

449 | FFTW: an adaptive software architecture for the FFT
- Frigo, Johnson
- 1998
(Show Context)
Citation Context ...thm for LU-decomposition with pivoting. For n n matrices, Toledo’s algorithm uses Θ(n3)work and incurs Θ(1+n 2=L+n 3=LpZ)cache misses. More recently, our group has produced an FFT library called FFTW =-=[18]-=-, which in its most recent incarnation [17], employs a register-allocation and scheduling algorithm inspired by our cache-oblivious FFT algorithm. The general idea that divide-and-conquer enhances mem... |

373 |
Gaussian elimination is not optimal
- Strassen
- 1969
(Show Context)
Citation Context ...These results require the tall-cache assumption (1) for matrices stored in row-major layout format, but the assumption can be relaxed for certain other layouts. We also show that Strassen’s algorithm =-=[31]-=- for multiplying n n matrices, which uses Θ(nlg7)work 2 , incurs Θ(1+n 2=L+n lg7=LpZ)cache misses. In [9] with others, two of the present authors analyzed an optimal divide-and-conquer algorithm for n... |

319 | External memory algorithms and data structures: dealing with massive data
- Vitter
(Show Context)
Citation Context ...introduce parallelism, and they give algorithms for matrix multiplication, FFT, sorting, and other problems in both a two-level model [35] and several parallel hierarchical memory models [36]. Vitter =-=[33]-=- provides a comprehensive survey of external-memory algorithms. 8. Conclusion The theoretical work presented in this paper opens two important avenues for future research. The first is to determine th... |

234 | Algorithms for parallel memory I: Two level memories
- Vitter, Shriver
- 1994
(Show Context)
Citation Context .../O at different levels to proceed in parallel. Vitter and Shriver introduce parallelism, and they give algorithms for matrix multiplication, FFT, sorting, and other problems in both a two-level model =-=[35]-=- and several parallel hierarchical memory models [36]. Vitter [33] provides a comprehensive survey of external-memory algorithms. 8. Conclusion The theoretical work presented in this paper opens two i... |

167 |
I/o complexity: the red-blue pebbling game
- Hong, Kung
- 1981
(Show Context)
Citation Context ...atrices, the cache complexity Θ(n+n 2=L+n 3=LpZ)of the REC-MULT algorithm is the same as the cache complexity of the cache-aware BLOCK-MULT algorithm and also matches the lower bound by Hong and Kung =-=[21]-=-. This lower bound holds for all algorithms that execute the Θ(n3)opera tions given by the definition of matrix multiplication j: ci j=n ∑ aikbk k=1 No tight lower bounds for the general problem of ma... |

153 | A fast Fourier transform compiler
- Frigo
- 1999
(Show Context)
Citation Context ... n n matrices, Toledo’s algorithm uses Θ(n3)work and incurs Θ(1+n 2=L+n 3=LpZ)cache misses. More recently, our group has produced an FFT library called FFTW [18], which in its most recent incarnation =-=[17]-=-, employs a register-allocation and scheduling algorithm inspired by our cache-oblivious FFT algorithm. The general idea that divide-and-conquer enhances memory locality has been known for a long time... |

131 |
An algorithm for the machine computation of the complex fourier series
- Cooley, Tukey
- 1965
(Show Context)
Citation Context ...ing the discrete Fourier transform of a complex array of n elements, where n is an exact power of 2. The basic algorithm is the well-known “six-step” variant [6, 36] of the Cooley-Tukey FFT algorithm =-=[13]-=-. Using the cache-oblivious transposition algorithm, however, the FFT becomes cache-oblivious, and its performance matches the lower bound by Hong and Kung [21]. Recall that the discrete Fourier trans... |

129 | FFTs in external or hierarchical memory
- Bailey
- 1990
(Show Context)
Citation Context ...which uses O(mn)work and incurs O(1+mn=L)cache misses, which is optimal. Using matrix transposition as a subroutine, we convert a variant [36] of the “six-step” fast Fourier transform (FFT) algorithm =-=[6]-=- into an optimal cache-oblivious algorithm. This FFT algorithm uses O(nlgn)work and incurs O(1+(n=L)(1+log Z cache misses. The problem of matrix transposition is defined as follows. Given an m n matri... |

128 |
A model for hierarchical memory
- Aggarwal, Alpern, et al.
- 1987
(Show Context)
Citation Context ...ptimal off-line strategy of replacing the cache line whose next access is furthest in the future [7], and thus it exploits temporal locality perfectly. Unlike various other hierarchical-memory models =-=[1, 2, 5, 8]-=- in which algorithms are analyzed in terms of a single measure, the ideal-cache model uses two measures. An algorithm with an input of size n is measured by its work complexity W(n)—its conventional r... |

112 | The influence of caches on the performance of sorting
- LaMarca, Ladner
- 1997
(Show Context)
Citation Context ...-way merge sort, are not optimal with respect to cache misses. The Z-way mergesort suggested by Aggarwal and Vitter [3] has optimal cache complexity, but although it apparently works well in practice =-=[23]-=-, it is cache aware. This section describes a cache-oblivious sorting algorithm called “funnelsort.” This algorithm has optimalsL1 Lpk buffers k-merger Figure 3: Illustration of a k-merger. A k-merger... |

109 |
Hierarchical memory with block transfer
- Aggarwal, Chandra, et al.
- 1987
(Show Context)
Citation Context ...ptimal off-line strategy of replacing the cache line whose next access is furthest in the future [7], and thus it exploits temporal locality perfectly. Unlike various other hierarchical-memory models =-=[1, 2, 5, 8]-=- in which algorithms are analyzed in terms of a single measure, the ideal-cache model uses two measures. An algorithm with an input of size n is measured by its work complexity W(n)—its conventional r... |

104 | An analysis of dagconsistent distributed shared-memory algorithms
- Blumofe, Frigo, et al.
- 1996
(Show Context)
Citation Context ...e assumption can be relaxed for certain other layouts. We also show that Strassen’s algorithm [31] for multiplying n n matrices, which uses Θ(nlg7)work 2 , incurs Θ(1+n 2=L+n lg7=LpZ)cache misses. In =-=[9]-=- with others, two of the present authors analyzed an optimal divide-and-conquer algorithm for n n matrix multiplication that contained no tuning parameters, but we did not study cache-obliviousness pe... |

96 | Locality of reference in LU decomposition with partial pivoting
- Toledo
- 1997
(Show Context)
Citation Context ...l 1997. This matrix-multiplication algorithm, as well as a cacheoblivious algorithm for LU-decomposition without pivoting, eventually appeared in [9]. Shortly after leaving our research group, Toledo =-=[32]-=- independently proposed a cache-oblivious algorithm for LU-decomposition with pivoting. For n n matrices, Toledo’s algorithm uses Θ(n3)work and incurs Θ(1+n 2=L+n 3=LpZ)cache misses. More recently, ou... |

78 | Cache-oblivious algorithms
- Prokop
- 1999
(Show Context)
Citation Context ...cation, matrix transpose, FFT, and sorting are optimal in multilevel models with explicit memory management. Proof. Their complexity bounds satisfy the regularity condition (14). It can also be shown =-=[26]-=- that cache-oblivous algorithms satisfying (14) are also optimal (in expectation) in the previously studied SUMH [5, 34] and HMM [1] models. Thus, all the algorithmic results in this paper apply to th... |

76 | Auto-blocking matrix-multiplication or tracking blas3 performance from source code
- FRENS, S
- 1997
(Show Context)
Citation Context ... but no tuning parameter need be set, since submatrices Q(n)= of size O(pL)O(pL)are cache-obliviously stored on cache lines. The advantages of bit-interleaved and related layouts have been studied in =-=[11, 12, 16]-=-. One of the practical disadvantages of bit-interleaved layouts is that index calculations on conventional microprocessors can be costly, a deficiency we hope that processor architects will remedy. Fo... |

75 |
Fast Fourier transforms: a tutorial review and a state of the art
- Duhamel, Vetterli
- 1990
(Show Context)
Citation Context ...he array Y given by n; j ∑ Y[i]=n?1 X[j]ω?i (9) j=0 where ωn=e 2πp?1=n is a primitive nth root of unity, and Y[i1+i2n1]= 0 i<n. Many algorithms evaluate Equation (9) in O(nlgn)time for all integers n =-=[15]-=-. In this paper, however, we assume that n is an exact power of 2, and we compute Equation (9) according to the Cooley-Tukey algorithm, which works recursively as follows. In the base case where n=O(1... |

73 | Nonlinear array layouts for hierarchical memory systems
- CHATTERJEE, JAIN, et al.
- 1999
(Show Context)
Citation Context ... but no tuning parameter need be set, since submatrices Q(n)= of size O(pL)O(pL)are cache-obliviously stored on cache lines. The advantages of bit-interleaved and related layouts have been studied in =-=[11, 12, 16]-=-. One of the practical disadvantages of bit-interleaved layouts is that index calculations on conventional microprocessors can be costly, a deficiency we hope that processor architects will remedy. Fo... |

65 | Algorithms for parallel memory II: Hierarchical multilevel memories. Algorithmica
- Vitter, Shriver
- 1994
(Show Context)
Citation Context ...rsive cache-oblivious algorithm for transposing an m n matrix which uses O(mn)work and incurs O(1+mn=L)cache misses, which is optimal. Using matrix transposition as a subroutine, we convert a variant =-=[36]-=- of the “six-step” fast Fourier transform (FFT) algorithm [6] into an optimal cache-oblivious algorithm. This FFT algorithm uses O(nlgn)work and incurs O(1+(n=L)(1+log Z cache misses. The problem of m... |

51 |
Deterministic distribution sort in shared and distributed memory multiprocessors
- Nodine, Vitter
- 1993
(Show Context)
Citation Context ...from Section 4, the distributionsorting algorithm uses O(nlgn)work to sort n elements, and it incurs O(1+(n=L)(1+log Z n))cache misses. Unlike previous cache-efficient distribution-sorting algorithms =-=[1, 3, 25, 34, 36]-=-, which use sampling or other techniques to find the partitioning elements before the distribution step, our algorithm uses a “bucket splitting” technique to select pivots incrementally during the dis... |

48 | Recursive array layout and fast parallel matrix multiplication
- Chatterjee, Lebeck, et al.
- 1999
(Show Context)
Citation Context ... but no tuning parameter need be set, since submatrices Q(n)= of size O(pL)O(pL)are cache-obliviously stored on cache lines. The advantages of bit-interleaved and related layouts have been studied in =-=[11, 12, 16]-=-. One of the practical disadvantages of bit-interleaved layouts is that index calculations on conventional microprocessors can be costly, a deficiency we hope that processor architects will remedy. Fo... |

48 | Towards a theory of cache-efficient algorithms
- Sen, Chatterjee
(Show Context)
Citation Context ... [29]. Previous theoretical work on understanding hierarchical memories and the I/O-complexity of algorithms has been studied in cache-aware models lacking an automatic replacement strategy, although =-=[10, 28]-=- are recent Time (microseconds) 0.25 0.2 0.15 0.1 0.05 iterative recursive 0 0 200 400 600 N 800 1000 1200 Figure 4: Average time to transpose an N N matrix, divided by N 2 . exceptions. Hong and Kung... |

42 |
Uniform Memory Hierarchies
- Alpern, Carter, et al.
- 1990
(Show Context)
Citation Context ...ptimal off-line strategy of replacing the cache line whose next access is furthest in the future [7], and thus it exploits temporal locality perfectly. Unlike various other hierarchical-memory models =-=[1, 2, 5, 8]-=- in which algorithms are analyzed in terms of a single measure, the ideal-cache model uses two measures. An algorithm with an input of size n is measured by its work complexity W(n)—its conventional r... |

38 |
Extending the Hong-Kung model to memory hierachies
- Savage
- 1995
(Show Context)
Citation Context ...ve lower bounds on the I/O-complexity of matrix multiplication, FFT, and other problems. The red-blue pebble game models temporal locality using two levels of memory. The model was extended by Savage =-=[27]-=- for deeper memory hierarchies. Aggarwal and Vitter [3] introduced spatial locality and investigated a two-level memory in which a block of P contiguous items can be transferred in one step. They obta... |

37 |
An Algorithm for Computing the Mixed Radix Fast Fourier Transform
- Singleton
- 1969
(Show Context)
Citation Context ... employs a register-allocation and scheduling algorithm inspired by our cache-oblivious FFT algorithm. The general idea that divide-and-conquer enhances memory locality has been known for a long time =-=[29]-=-. Previous theoretical work on understanding hierarchical memories and the I/O-complexity of algorithms has been studied in cache-aware models lacking an automatic replacement strategy, although [10, ... |

25 | Large-scale sorting in uniform memory hierarchies
- Vitter, Nodine
- 1993
(Show Context)
Citation Context ...from Section 4, the distributionsorting algorithm uses O(nlgn)work to sort n elements, and it incurs O(1+(n=L)(1+log Z n))cache misses. Unlike previous cache-efficient distribution-sorting algorithms =-=[1, 3, 25, 34, 36]-=-, which use sampling or other techniques to find the partitioning elements before the distribution step, our algorithm uses a “bucket splitting” technique to select pivots incrementally during the dis... |

11 | Towards an optimal bit-reversal permutation program
- Carter, Gatlin
- 1998
(Show Context)
Citation Context ... [29]. Previous theoretical work on understanding hierarchical memories and the I/O-complexity of algorithms has been studied in cache-aware models lacking an automatic replacement strategy, although =-=[10, 28]-=- are recent Time (microseconds) 0.25 0.2 0.15 0.1 0.05 iterative recursive 0 0 200 400 600 N 800 1000 1200 Figure 4: Average time to transpose an N N matrix, divided by N 2 . exceptions. Hong and Kung... |

6 |
On the algebraic complexity of functions
- Winograd
- 1970
(Show Context)
Citation Context ... aikbk k=1 No tight lower bounds for the general problem of matrix multiplication are known. Q(n) By using an asymptotically faster algorithm, such as Strassen’s algorithm [31] or one of its variants =-=[37]-=-, both the work and cache complexity can be reduced. When multiplying n n matrices, Strassen’s algorithm, which is cache oblivious, requires only 7 recursive multiplications of n=2 n=2 matrices and a ... |

3 |
Efficient portability across memory hierarchies
- Bilardi, Peserico
- 2000
(Show Context)
Citation Context |

3 |
The cache performance and optimizations of blocked algortihms
- Lam, Rothberg, et al.
- 1991
(Show Context)
Citation Context ...s, all the algorithmic results in this paper apply to these models, matching the best bounds previously achieved. Other simulation results can be shown. For example, by using the copying technique of =-=[22]-=-, cache-oblivious algorithms for matrix multiplication and other problems can be designed that are provably optimal on directmapped caches. 7. Related work In this section, we discuss the origin of th... |