## Cache oblivious algorithms (2003)

Venue: | Algorithms for Memory Hierarchies, LNCS 2625 |

Citations: | 14 - 0 self |

### BibTeX

@INPROCEEDINGS{Kumar03cacheoblivious,

author = {Piyush Kumar},

title = {Cache oblivious algorithms},

booktitle = {Algorithms for Memory Hierarchies, LNCS 2625},

year = {2003},

pages = {193--212},

publisher = {Springer-Verlag}

}

### OpenURL

### Abstract

Abstract. The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems. This model was first formulated in [22] and has since been a topic of intense research. Analyzing and designing algorithms and data structures in this model involves not only an asymptotic analysis of the number of steps executed in terms of the input size, but also the movement of data optimally among the different levels of the memory hierarchy. This chapter is aimed as an introduction to the “ideal-cache ” model of [22] and techniques used to design cache oblivious algorithms. The chapter also presents some experimental insights and results. Part of this work was done while the author was visiting MPI-Saarbrücken. The

### Citations

8996 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ...ory access. The Random Access Model (RAM) in which we do analysis of algorithms today does not take into account differences in speeds of random access of memory depending upon the locality of access =-=[18]-=-. Although there exist models which can deal with multi-level memory hierarchies, they are quite complicated to use [1, 3,Cache Oblivious Algorithms 3 6, 5, 2, 25, 33, 34]. It seems there is a trade-... |

4280 |
Computer Architecture, A Quantitative Approach
- Hennessy, Paterson
- 2003
(Show Context)
Citation Context ...mory access is often categorized into two different types, code reusing recently accessed locations (temporal) and code referencing data items that are close to recently accessed data items (spatial) =-=[24]-=-. Caches use both temporal and spatial locality to improve speed. Surprisingly many things can be categorized as caches, for example, registers, L1, L2, TLB, Memory, Disk, Tape etc (Chapter ??). The w... |

830 |
Matrix multiplication via arithmetic progressions
- Coppersmith, Winograd
- 1990
(Show Context)
Citation Context ...j In 1969’ Strassen surprised the world by showing an upper bound of O(N log2 7 )[36] using a divide and conquer algorithm. This bound was later improved and the best upper bound today is O(N 2.376 ) =-=[17]-=-. We will use Strassen’s algorithm as given in [18] for the cache oblivious analysis. The algorithm breaks the three matrices x, y, z into four submatrices of size N 2 × N 2 , rewriting the equation z... |

747 |
Amortized efficiency of list update and paging rules
- Sleator, Tarjan
- 1985
(Show Context)
Citation Context ...consider the optimal replacement policy. The most commonly used replacement policy is LRU (least recently used). In [22] the following lemma, whose proof is omitted here, was proved using a result of =-=[35]-=-. Lemma 1 ([22]). An algorithm that causes Q∗ (n, M, B) cache misses on a problem of size n using a (M, B)-ideal cache incurs Q(n, M, B) ≤ 2Q∗ (n, M 2 , B) cache misses on a (M, B) cache that uses LRU... |

557 | Concrete Mathematics - Graham, Knuth, et al. - 1994 |

554 |
The input/output complexity of sorting and related problems
- Aggarwal, Vitter
- 1988
(Show Context)
Citation Context ...ve been other attempts to capture this hierarchical information the cache oblivious model seems to be one of the most simple and elegant ones. The cache oblivious model is a two level model (like the =-=[4]-=- model that has been used in the other chapters so far) but with the assumption that the parameters M,B are unknown to the algorithm (See Figure 1). It can work efficiently on most machines with multi... |

389 |
Gaussian Elimination is Not Optimal
- Strassen
- 1969
(Show Context)
Citation Context ...d we wish to compute their product z, i.e. there are N 2 outputs where the (i, j)’th output is zi,j = N∑ k=1 xi,k · yk,j In 1969’ Strassen surprised the world by showing an upper bound of O(N log2 7 )=-=[36]-=- using a divide and conquer algorithm. This bound was later improved and the best upper bound today is O(N 2.376 ) [17]. We will use Strassen’s algorithm as given in [18] for the cache oblivious analy... |

178 | A comparison of sorting algorithms for the Connection Machine CM-2
- Blelloch, Leiserson, et al.
- 1991
(Show Context)
Citation Context ...he Oblivious Algorithms 23 The function Sample(X, k) in step 2 takes as input an array X and the size of the sample that it needs to return as output. A very similar sampling is needed in Sample Sort =-=[11]-=- to select splitters. We will use the same algorithm here. What we need is a random sample such that the ci’s do not vary too much from √ N with high probability. To do this, draw a random sample of β... |

178 | External-memory graph algorithms
- Chiang, Goodrich, et al.
- 1995
(Show Context)
Citation Context ...ber of cache misses of scan and sorting functions done by an optimal cache oblivious implementation.28 Piyush Data Structure/Algorithm Cache Complexity Operations Array Reversal scan(N) List Ranking =-=[16]-=- sort(N) 2 3 N N LU Decomposition [38] Θ(1 + B + B √ M ) On N × N matrices FFT [22] sort(N) B-Trees [10, 13] Amortized O(logB N) Insertions/Deletions Insertions/Deletions/deletemin Priority Queues [8,... |

177 |
Tight Bounds on the Complexity of Parallel Sorting
- Leighton
- 1985
(Show Context)
Citation Context ... the recurrence: Q(N) ≤ { O(1 + N B ) 2 √ NQ( √ N) + Q( √ N log N) + O(1 + N B ) which solves to Q(n) ≤ O( N B log M B N B ). if N ≤ αM otherwise. (6) Exercise 9 A sorting algorithm called columnsort =-=[30]-=- was shown to be practical for large inputs [15]. The cache complexity of colum sort can be defined by the following recurrance. Q(n) = { O(1 + N B ) 8N 1 4 Q(N 3 4 ) + O(1 + N B ) if N ≤ αM otherwise... |

175 | I/O-complexity: The red-blue pebble game - Hong, Kung - 1981 |

168 | fast Fourier transform compiler
- Frigo, BA
- 1999
(Show Context)
Citation Context ...her constant. Can one really afford to hide these constants in the design of a cache oblivious algorithm in real code? Despite these limitations the model does perform very well for some applications =-=[14, 28, 20]-=-, but might be outperformed by more coding effort combined with cache aware algorithms [32, 14, 28, 20]. Here’s an intercept from an experimental paper by Chatterjee and Sen [14]. Our major conclusion... |

150 |
Seminumerical Algorithms, volume 2 of The Art of Computer Programming
- Knuth
- 1981
(Show Context)
Citation Context ...ndom subset of size nɛ from an array of size n without incurring too many cache misses. This could be done in work complexity O(nɛ ) and cache complexity O(min(1 + nɛ , N )). Use algorithm S of Knuth =-=[27]-=-, B Sec 3.4.2 with N = n, n = nɛ . By using the HyperGeometric distribution, a method mentioned as an exercise in Knuth’s book is cache oblivious for sampling without replacement. We can sample with o... |

143 | Cache-oblivious b-trees
- Bender, Demaine, et al.
(Show Context)
Citation Context ...structure or algorithm [18]. In this section we take an example to illustrate how amortization could help in design and analysis of a cache oblivious data structure called the packed memory structure =-=[9]-=-. This structure can help one maintain a dynamic van Emde Boas layout (section 6) of a strongly weight balanced search tree [9]. This structure can also be used as a cache oblivious linked list data s... |

133 |
A model for hierarchical memory
- Aggarwal, Alpern, et al.
- 1987
(Show Context)
Citation Context ...cache assumption. The tall cache assumption states that M = Ω(B 2 ) which is usually true in practice. Its notable that regular optimal cache oblivious algorithms are also optimal in SUMH [5] and HMM =-=[1]-=- models. Recently, compiler support for cache oblivious type algorithms have also been looked into [39, 40]. 3 Algorithm design tools In cache oblivious algorithm design some algorithm design techniqu... |

121 | The influence of caches on the performance of sorting
- Lamarca, Ladner
- 1997
(Show Context)
Citation Context ...ion we outline some theory and experimentation related to sorting in the cache oblivious model. Some excellent references for reading more on the influence of caches on the performance of sorting are =-=[29, 31]-=-, Chapter ??. 7.1 Randomized distribution sorting There are two optimal sorting algorithms known, funnel sort and distribution sort. Funnel sort is derived from merge sort and distribution sort is a g... |

117 | A uniform memory hierarchy model of computation
- Alpern, Carter, et al.
- 1994
(Show Context)
Citation Context ...gh probability, the expected cache complexity follows the recurrence: Q(N) ≤ { O(1 + N B ) 2 √ NQ( √ N) + Q( √ N log N) + O(1 + N B ) which solves to Q(n) ≤ O( N B log M B N B ). if N ≤ αM otherwise. =-=(6)-=- Exercise 9 A sorting algorithm called columnsort [30] was shown to be practical for large inputs [15]. The cache complexity of colum sort can be defined by the following recurrance. Q(n) = { O(1 + N ... |

113 |
Hierarchical memory with block transfer
- Aggarwal, Chandra, et al.
- 1987
(Show Context)
Citation Context ... of N × P size. There are three cases: Case I max{N, P } ≤ αB In this case, Q(N, P ) ≤ NP B + O(1) Case II N ≤ αB < P In this case, { O(1 + N) if Q(N, P ) ≤ αB 2 ≤ P ≤ αB 2Q(N, P/2) + O(1) N ≤ αB < P =-=(3)-=- Case III P ≤ αB < N Analogous to Case III.14 Piyush Fig. 6. Same experiment as in Figure 5 but now P = 1000. Fig. 7. The graph compares a simple for loop implementation with a blocked cache obliviou... |

97 | Locality of reference in LU decomposition with partial pivoting
- Toledo
- 1997
(Show Context)
Citation Context ...g functions done by an optimal cache oblivious implementation.28 Piyush Data Structure/Algorithm Cache Complexity Operations Array Reversal scan(N) List Ranking [16] sort(N) 2 3 N N LU Decomposition =-=[38]-=- Θ(1 + B + B √ M ) On N × N matrices FFT [22] sort(N) B-Trees [10, 13] Amortized O(logB N) Insertions/Deletions Insertions/Deletions/deletemin Priority Queues [8, 12] O( 1 B 10 Open problems log M B S... |

82 | Cache interference phenomena
- Temam, Fricker, et al.
- 1994
(Show Context)
Citation Context ... objects map to the same location in the cache and are referenced in temporal proximity, the accesses will become costlier than they are assumed in the model (also known as cache interference problem =-=[37]-=- ). Also, k−way set associative caches are implemented by using more comparators. (See Chapter ??) Instruction/Unified Caches Does not deal with the issue of instruction caches. Rarely executed, speci... |

76 | A locality-preserving cache-oblivious dynamic dictionary
- Bender, Duan, et al.
- 2002
(Show Context)
Citation Context ...Boas layout (section 6) of a strongly weight balanced search tree [9]. This structure can also be used as a cache oblivious linked list data structure and has been used to design dynamic dictionaries =-=[10]-=-. Our description mostly follows [9]. In the packed memory problem, also known as ordered file maintenance problem, we want to store N elements in an array of size |A| = O(N). Note that |A| is not N b... |

69 | Cache-oblivious priority queue and graph algorithm applications
- Arge, Bender, et al.
- 2002
(Show Context)
Citation Context ...16] sort(N) 2 3 N N LU Decomposition [38] Θ(1 + B + B √ M ) On N × N matrices FFT [22] sort(N) B-Trees [10, 13] Amortized O(logB N) Insertions/Deletions Insertions/Deletions/deletemin Priority Queues =-=[8, 12]-=- O( 1 B 10 Open problems log M B Sorting strings in the cache oblivious model is still open. Optimal shortest paths and minimum spanning forests still need to be explored in the model. Optimal simple ... |

66 | Cache oblivious search trees via binary trees of small height
- Brodal, Fagerberg, et al.
- 2002
(Show Context)
Citation Context ...Piyush Data Structure/Algorithm Cache Complexity Operations Array Reversal scan(N) List Ranking [16] sort(N) 2 3 N N LU Decomposition [38] Θ(1 + B + B √ M ) On N × N matrices FFT [22] sort(N) B-Trees =-=[10, 13]-=- Amortized O(logB N) Insertions/Deletions Insertions/Deletions/deletemin Priority Queues [8, 12] O( 1 B 10 Open problems log M B Sorting strings in the cache oblivious model is still open. Optimal sho... |

64 | AlphaSort: a RISC machine sort
- NYBERG, BARCLAY, et al.
- 1994
(Show Context)
Citation Context ...ion we outline some theory and experimentation related to sorting in the cache oblivious model. Some excellent references for reading more on the influence of caches on the performance of sorting are =-=[29, 31]-=-, Chapter ??. 7.1 Randomized distribution sorting There are two optimal sorting algorithms known, funnel sort and distribution sort. Funnel sort is derived from merge sort and distribution sort is a g... |

54 | Towards a theory of cache-efficient algorithms - Sen, Chatterjee, et al. |

50 |
A sparse table implementation of priority queues
- Itai, Konheim, et al.
- 1981
(Show Context)
Citation Context ...noted by d(u) is equal to the number of elements stored in the subarray divided by the size of the subarray. Associated with each subarray is a density threshold. This thresholding is very similar to =-=[26]-=- except that [9] also uses a lower bound threshold. So this means that d(u) for each subarray is bounded between an interval depending on the height of the node u in the tree. We will describe the dat... |

41 |
Uniform memory hierarchies
- Alpern, Carter, et al.
- 1990
(Show Context)
Citation Context ...uire a tall cache assumption. The tall cache assumption states that M = Ω(B 2 ) which is usually true in practice. Its notable that regular optimal cache oblivious algorithms are also optimal in SUMH =-=[5]-=- and HMM [1] models. Recently, compiler support for cache oblivious type algorithms have also been looked into [39, 40]. 3 Algorithm design tools In cache oblivious algorithm design some algorithm des... |

40 | Extending the Hong-Kung Model to Memory Hierarchies - Savage - 1995 |

35 | Funnel heap — a cache oblivious priority queue
- Brodal, Fagerberg
- 2002
(Show Context)
Citation Context ...16] sort(N) 2 3 N N LU Decomposition [38] Θ(1 + B + B √ M ) On N × N matrices FFT [22] sort(N) B-Trees [10, 13] Amortized O(logB N) Insertions/Deletions Insertions/Deletions/deletemin Priority Queues =-=[8, 12]-=- O( 1 B 10 Open problems log M B Sorting strings in the cache oblivious model is still open. Optimal shortest paths and minimum spanning forests still need to be explored in the model. Optimal simple ... |

32 |
Optimised predecessor data structures for internal memory
- RAHMAN, COLE, et al.
- 2001
(Show Context)
Citation Context ...rithm in real code? Despite these limitations the model does perform very well for some applications [14, 28, 20], but might be outperformed by more coding effort combined with cache aware algorithms =-=[32, 14, 28, 20]-=-. Here’s an intercept from an experimental paper by Chatterjee and Sen [14]. Our major conclusion are as follows: Limited associativity in the mapping from main memory addresses to cache sets can sign... |

32 | Transforming loops to recursion for multi-level memory hierarchies
- Yi, Adve, et al.
- 2000
(Show Context)
Citation Context ... Its notable that regular optimal cache oblivious algorithms are also optimal in SUMH [5] and HMM [1] models. Recently, compiler support for cache oblivious type algorithms have also been looked into =-=[39, 40]-=-. 3 Algorithm design tools In cache oblivious algorithm design some algorithm design techniques are used ubiquitously. One of them is a scan of an array which is laid out in contiguous memory. Irrespe... |

27 | Ahnentafel Indexing into Morton-ordered Arrays, or Matrix Locality for Free
- Wise
- 2001
(Show Context)
Citation Context ... Its notable that regular optimal cache oblivious algorithms are also optimal in SUMH [5] and HMM [1] models. Recently, compiler support for cache oblivious type algorithms have also been looked into =-=[39, 40]-=-. 3 Algorithm design tools In cache oblivious algorithm design some algorithm design techniques are used ubiquitously. One of them is a scan of an array which is laid out in contiguous memory. Irrespe... |

26 | Cache-Efficient Matrix Transposition, in
- Chatterjee, Sen
(Show Context)
Citation Context ...are not so impressive except Figure 7. The algorithm given here is not the best for matrix transposition. If the reader is interested in practicality of matrix transposition, excellent references are =-=[14, 39]-=-. Also note that if one expands the base case (instead of moving single elements around, one moves multiple elements) the performance improves. Changing the environment (OS, Processor etc.) has a sign... |

23 | A comparison of cache aware and cache oblivious static search trees using program instrumentation, in: Experimental Algorithmics: From Algorithm Design to Robust and Efficient Software
- Ladner, Fortna, et al.
(Show Context)
Citation Context ...est way to test the speed up given by cache oblivious layouts. For more detailed experimental results on comparing searching in cache aware and cache oblivious search trees, the reader is referred to =-=[28]-=-. There is a big difference between the graphs reported here for searching and in [28]. One of the reasons might be that the size of the nodes were fixed to be 4 bytes in [28] whereas the experiments ... |

22 | Columnsort lives! An efficient out-of-core sorting program
- Chaudhry, Cormen, et al.
- 2001
(Show Context)
Citation Context ...N) + Q( √ N log N) + O(1 + N B ) which solves to Q(n) ≤ O( N B log M B N B ). if N ≤ αM otherwise. (6) Exercise 9 A sorting algorithm called columnsort [30] was shown to be practical for large inputs =-=[15]-=-. The cache complexity of colum sort can be defined by the following recurrance. Q(n) = { O(1 + N B ) 8N 1 4 Q(N 3 4 ) + O(1 + N B ) if N ≤ αM otherwise. where α is a sufficiently small constant. Note... |

20 |
Virtual memory algorithms
- Aggarwal, Chandra
- 1988
(Show Context)
Citation Context ...lify the recurrence to Q(N) = 2 ⎡ ⎣ N ∑ (Q(i) + Q(N − i)) + (1 + ⌈ N B ⌉) ⎦ (1) i=1.. N−1 B and then you can take help from [23]. Exercise 3 Prove that mergesort also does O( N Q(i) + Θ(1 + N B ) ⎤ ⎦ =-=(2)-=- B log2 N B ) cache misses. As we said earlier, the number of cache misses randomized quicksort makes is not optimal. The sorting lower bound in the cache oblivious model is also the same as the exter... |

17 | Portable High-Performance Programs
- Frigo
- 1999
(Show Context)
Citation Context ...ween them as a normal dual processor system.Cache Oblivious Algorithms 27 Write-through cacheso L1 caches in many new CPUs is write through, i.e. it transmits a written value to L2 cache immediately =-=[21, 24]-=-. Write through caches are simpler to manage and can always discard cache data without any bookkeeping (Read misses can not result in writes). With write through caches (e.g. DECStation 3100, Intel It... |

16 |
Cache oblivious algorithms
- Frigo, Leiserson, et al.
- 1999
(Show Context)
Citation Context ...bstract. The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems. This model was first formulated in =-=[22]-=- and has since been a topic of intense research. Analyzing and designing algorithms and data structures in this model involves not only an asymptotic analysis of the number of steps executed in terms ... |

15 | On computing Voronoi diagrams by divideprune-and-conquer
- Amato, Ramos
- 1996
(Show Context)
Citation Context ... lot of seemingly simple algorithms that were based on this paradigm are already cache oblivious! For instance, Strassen’s matrix multiplication, quicksort, mergesort, closest pair [18], convex hulls =-=[7]-=-, median selection [18] are all algorithms that are cache oblivious, though not all of them are optimal in this model. This means that they might be cache oblivious but can be modified to make fewer c... |

13 | Matrix multiplication: a case study of algorithm engineering
- Eiron, Rodeh, et al.
- 1998
(Show Context)
Citation Context ...st case of the implementation is O(N + N 2 + N 3 B √ M N 2 B (5) + N 3 B √ M ) ). The experimental results are shown in Figure 8. An excellent reference for practical matrix multiplication results is =-=[19]-=-. Exercise 7 Implement and analyze a matrix multiplication routine for N × N matrices.Cache Oblivious Algorithms 17 Fig. 8. Blocked cache oblivious matrix multiplication compared with simple for loop... |