## Transforming Loops to Recursion for Multi-Level Memory Hierarchies (2000)

Venue: | In Proceedings of the SIGPLAN ’00 Conference on Programming Language Design and Implementation |

Citations: | 32 - 4 self |

### BibTeX

@INPROCEEDINGS{Yi00transformingloops,

author = {Qing Yi and Vikram Adve and Ken Kennedy},

title = {Transforming Loops to Recursion for Multi-Level Memory Hierarchies},

booktitle = {In Proceedings of the SIGPLAN ’00 Conference on Programming Language Design and Implementation},

year = {2000},

pages = {169--181}

}

### Years of Citing Articles

### OpenURL

### Abstract

Recently, there have been several experimental and theoretical results showing significant performance benefits of recursive algorithms on both multi-level memory hierarchies and on shared-memory systems. In particular, such algorithms have the data reuse characteristics of a blocked algorithm that is simultaneously blocked at many different levels. Most existing applications, however, are written using ordinary loops. We present a new compiler transformation that can be used to convert loop nests into recursive form automatically. We show that the algorithm is fast and effective, handling loop nests with arbitrary nesting and control flow. The transformation achieves substantial performance improvements for several linear algebra codes even on a current system with a two level cache hierarchy. As a side-effect of this work, we also develop an improved algorithm for transitive dependence analysis (a powerful technique used in the recursion transformation and other loop transformations) that ...

### Citations

9158 | Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1998
(Show Context)
Citation Context ...cursive algorithms on uniprocessor cache hierarchies, on page-based software distributed shared memory systems, and for preserving locality while doing dynamic load-balancing in shared memory systems =-=[21, 10]. Som-=-e algorithms they studied include FFT, matrix transpose, matrix multiplication, and sorting. Furthermore, they have shown that \cache-oblivious" divide-and-conquer algorithms provide asymptotical... |

1378 |
Program slicing
- Weiser
- 1984
(Show Context)
Citation Context ...e procedure for a particular key statement skey. The function uses a technique called \iteration space slicing" [28] to compute these iteration sets. This technique is analogous to \program slici=-=ng " [26]-=-, except that it operates on iteration sets (i.e., instances) of statements rather than entire statements. For a given set of iterations, I0 , of a statement S0 , iteration space slicing computes the ... |

748 | A data locality optimizing algorithm
- Wolf, Lam
- 1991
(Show Context)
Citation Context ...ely accessible to application programmers. A number of code transformations have been proposed for improving locality in programs, including blocking, loop fusion, loop interchange, and loop reversal =-=[11, 30, 29, 17, 4, 18, 8, -=-25]. The recursion transformation (as used here) is essentially a form of blocking, with two key dierences. First, it combines the eect of blocking at multiple dierent levels into a single transformat... |

531 | The Cache Performance and Optimizations of Blocked Algorithms
- Lam, Rothberg, et al.
- 1991
(Show Context)
Citation Context ...rization. We observe, however, that the recursive versions suer from similar problems with con ict misses as does blocking, and require similar strategies (e..g, buer copying) to reduce such misses [1=-=7, 8, 12]-=-. The next section describes our algorithm for the recursion transformation, assuming transitive dependence information exists. Section 3 describes our improved algorithm for transitive dependence ana... |

308 | Improving data locality with loop transformations
- McKinley, Carr, et al.
- 1996
(Show Context)
Citation Context ...ely accessible to application programmers. A number of code transformations have been proposed for improving locality in programs, including blocking, loop fusion, loop interchange, and loop reversal =-=[11, 30, 29, 17, 4, 18, 8, -=-25]. The recursion transformation (as used here) is essentially a form of blocking, with two key dierences. First, it combines the eect of blocking at multiple dierent levels into a single transformat... |

248 |
Strategies for cache and local memory management by global program transformation
- Gannon, Jalby, et al.
- 1988
(Show Context)
Citation Context ...ely accessible to application programmers. A number of code transformations have been proposed for improving locality in programs, including blocking, loop fusion, loop interchange, and loop reversal =-=[11, 30, 29, 17, 4, 18, 8, -=-25]. The recursion transformation (as used here) is essentially a form of blocking, with two key dierences. First, it combines the eect of blocking at multiple dierent levels into a single transformat... |

189 | More iteration space tiling
- Wolfe
- 1989
(Show Context)
Citation Context |

151 | Data-centric multi-level blocking
- Kodukula, Ahmed, et al.
- 1997
(Show Context)
Citation Context ...loop structure amenable to blocking [5]. Wolf and Lam [29] present a unied algorithm that selects such compound sequences of transformations directly using a model of reuse in loops. Kodukula et al. [=-=16]-=- proposed an alternative approach called data shackling, where a tiling transformation on a loop-nest is described in terms of a tiling of key arrays in the loop nest (a data shackle). Multi-level blo... |

121 |
Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms
- GUSTAVSON
- 1997
(Show Context)
Citation Context ...t bear out these observations, showing signicant performance benets of recursive algorithms on both uniprocessor cache hierarchies and on shared-memory systems. In particular, Gustavson and Elmroth [1=-=3, -=-9] have demonstrated signicant performance benets from recursive versions of Cholesky and QR factorization, and Gaussian elimination with pivoting. For example, a single recursive version of Cholesky ... |

108 | An analysis of dag-consistent distributed shared-memory algorithms
- Blumofe, Frigo, et al.
- 1996
(Show Context)
Citation Context ...cursive algorithms on uniprocessor cache hierarchies, on page-based software distributed shared memory systems, and for preserving locality while doing dynamic load-balancing in shared memory systems =-=[21, 10]. Som-=-e algorithms they studied include FFT, matrix transpose, matrix multiplication, and sorting. Furthermore, they have shown that \cache-oblivious" divide-and-conquer algorithms provide asymptotical... |

108 | New tiling techniques to improve cache temporal locality
- Song, Li
- 1999
(Show Context)
Citation Context |

93 | Improving memory hierarchy performance for irregular applications
- Mellor-Crummey, Whalley, et al.
- 2001
(Show Context)
Citation Context ...m this use of iteration space slicing. Finally, the automatic recursion transformation can play an important complementary role to several recursive data organizing techniques that have been proposed =-=[7, 2-=-0]. For example, Chatterjee et al. show that recursive reordering of data produces signicant performance benets on modern memory hierarchies, and they argue that recursive control structures may be ne... |

84 |
The Omega Library interface guide
- Kelly, Maslov, et al.
- 1995
(Show Context)
Citation Context ...sitive dependence analysis algorithm described in section 3. All the iteration sets or functions described in this section are represented as symbolic integer sets or mappings using the Omega library =-=[14]-=-. 2.2 Overview of Algorithm The recursion transformation algorithm is shown in Figure 2. To simplify the description of the algorithm, we initially ignore IF statements and loops with non-unit strides... |

76 | Code generation for multiple mappings
- Kelly, Pugh, et al.
- 1995
(Show Context)
Citation Context ...eration sets for the statements in a loop nest, directly synthesizes a loop nest to enumerate exactly those instances of the statements while preserving the lexicographic order of statement instances =-=[15]-=-. The techniques we use are similar to those described for code generation in [1]. To generate code for the recursive calls, we divide the ranges of all recursive loops. For example, one simple choice... |

74 | Nonlinear array layouts for hierarchical memory systems
- Chatterjee, Jain, et al.
(Show Context)
Citation Context ...m this use of iteration space slicing. Finally, the automatic recursion transformation can play an important complementary role to several recursive data organizing techniques that have been proposed =-=[7, 2-=-0]. For example, Chatterjee et al. show that recursive reordering of data produces signicant performance benets on modern memory hierarchies, and they argue that recursive control structures may be ne... |

62 |
Itanium processor microarchitecture
- Sharangpani, Arora
- 2000
(Show Context)
Citation Context ...ressively deeper memory hierarchies to achieve high performance. For example, systems designed for the forthcoming Itanium processor are expected to use 3 levels of cache, two on chip and one o-chip [=-=2-=-3]. Systems based on the IBM Power4 processor are also expected to use three levels of cache. Furthermore, in shared-memory multiprocessor systems, the memory shared between dierent processors eective... |

58 | Using Integer Sets for Data-Parallel Program Analysis and Optimization
- Adve, Mellor-Crummey
- 1998
(Show Context)
Citation Context ...o enumerate exactly those instances of the statements while preserving the lexicographic order of statement instances [15]. The techniques we use are similar to those described for code generation in =-=[1]-=-. To generate code for the recursive calls, we divide the ranges of all recursive loops. For example, one simple choice is to divide each range by half. Given m recursive loops with bound parameters l... |

52 | Applying recursion to serial and parallel QR factorization leads to better performance
- ELMROTH, GUSTAVSON
- 2000
(Show Context)
Citation Context ...t bear out these observations, showing signicant performance benets of recursive algorithms on both uniprocessor cache hierarchies and on shared-memory systems. In particular, Gustavson and Elmroth [1=-=3, -=-9] have demonstrated signicant performance benets from recursive versions of Cholesky and QR factorization, and Gaussian elimination with pivoting. For example, a single recursive version of Cholesky ... |

41 | Hierarchical tiling for improved superscalar performance
- Carter, Ferrante, et al.
- 1995
(Show Context)
Citation Context ...tion, using an algorithm that is independent of loop structure. A key advantage of blocking over recursion is that much smaller block sizes can be used with blocking (including blocking for registers =-=[6-=-]), whereas recursion would incur high overhead for very small block sizes. This suggests that it might be benecial to use blocking within the base-case code to achieve small block sizes, while using ... |

37 | Blocking Linear Algebra Codes for Memory Hierarchies
- Carr, Kennedy
- 1989
(Show Context)
Citation Context ...ersions of LU and Cholesky. To implement either blocking or recursion for LU or Cholesky with pivoting, a compiler would need to recognize that row interchange and whole column update are commutative =-=[3-=-]. We are condent that, in a compiler with that analysis, conversion to recursive form would be possible for the pivoting versions. We also study one physical simulation application, Erlebachers(erle)... |

28 | Compiler Blockability of Dense Matrix Factorizations
- Carr, Lehoucq
- 1997
(Show Context)
Citation Context ...ance of the compiler-generated recursive code with the original code for each benchmark, with one-and two-level blocked versions of mm, and with a one-level blocked version of lu that we adopted from =-=[5-=-]. In each group of bars, the bars are scaled relative to the tallest bar and the absolute value of the tallest bar is shown above it. (Note that dierent groups of bars in the same graph may be scaled... |

27 | Architecture-cognizant divide and conquer algorithms
- Gatlin, Carter
- 1999
(Show Context)
Citation Context ...rization. We observe, however, that the recursive versions suer from similar problems with con ict misses as does blocking, and require similar strategies (e..g, buer copying) to reduce such misses [1=-=7, 8, 12]-=-. The next section describes our algorithm for the recursion transformation, assuming transitive dependence information exists. Section 3 describes our improved algorithm for transitive dependence ana... |

25 |
Iteration space slicing for locality
- Pugh, Rosser
- 1999
(Show Context)
Citation Context ... an additional analysis step, as discussed in Section 4.) A key step in our algorithm is based on a loop transformation technique called iteration space slicing, recently described by Pugh and Rosser =-=[27, 28]-=-. Iteration space slicing uses transitive dependence analysis on the dependence graph to compute the instances of a particular statement that must precede or follow a given set of instances of another... |

24 |
Tile Size Selection using Cache Organization and
- Coleman, McKinley
- 1995
(Show Context)
Citation Context ...rization. We observe, however, that the recursive versions suer from similar problems with con ict misses as does blocking, and require similar strategies (e..g, buer copying) to reduce such misses [1=-=7, 8, 12]-=-. The next section describes our algorithm for the recursion transformation, assuming transitive dependence information exists. Section 3 describes our improved algorithm for transitive dependence ana... |

18 | Space-limited procedures: a methodology for portable highperformance. In: Conference on programming models for massively parallel computers
- Alpern, Carter, et al.
- 1995
(Show Context)
Citation Context ...s and eective parallelization for shared memory systems. As described in the Introduction, researchers have applied recursion by hand for both single-processor and shared-memory multiprocessor codes [=-=13, 21, 10, 12, 2-=-]. The variety of experimental benets these studies have demonstrated, as well as the theoretical results of Frigo et al. [10], provide strong motivation for developing compiler support to make this t... |

16 |
A Study of Instruction Cache Organizations and Replacement Policies
- Smith, Goodman
- 1983
(Show Context)
Citation Context ...erforms worse than the 2-way associative cache. This can be seen in the L1 cache for the original and one-level blocked versions of mm. This phenomenon is a known defect of the LRU replacement policy =-=[24]-=-. It happens because a row of matrix C and a row of matrix A together just exceed the cache size, so that each element of C is evicted from the fully associative cache just before it would be reused. ... |

15 | Iteration space slicing and its application to communication optimization
- Pugh, Rosser
- 1997
(Show Context)
Citation Context ... an additional analysis step, as discussed in Section 4.) A key step in our algorithm is based on a loop transformation technique called iteration space slicing, recently described by Pugh and Rosser =-=[27, 28]-=-. Iteration space slicing uses transitive dependence analysis on the dependence graph to compute the instances of a particular statement that must precede or follow a given set of instances of another... |

13 |
Fine-grained analysis of array computations
- Rosser
- 1998
(Show Context)
Citation Context ...e dependence analysis is a path summary problem instead of simply a reachability problem on directed graphs. Previous work used symbolic integer sets to represent and propagate transitive dependences =-=[22]-=-, and an adapted Floyd-Warshall algorithm to solve the all-pairs path summary problem up front. Because integer set operations are costly, and the adapted Floyd-Warshall algorithm has O(N 3 ) complexi... |

4 |
Improving the ratio of memory operations to operations in loops
- Carr, Kennedy
(Show Context)
Citation Context |

1 |
MHSIM: A Con Simulator for Multi-level Memory Hierarchies
- Mellor-Crummy, Whalley
(Show Context)
Citation Context ...ent base sizes (512=i for i = 1 to 7) Figure 10: Results from simulation of 2-way and fully-associative caches 4.3 Cache Simulation We used the Memory Hierarchy Simulator (MHSIM) from Rice University =-=[19]-=- to study the cache performance of the recursive and blocked codes, focusing on mm and lu. In order to study cache con icts, we compared a two-way associative cache with a fully associative one of the... |