## Towards optimal multi-level tiling for stencil computations (2007)

### Cached

### Download Links

- [www.cs.colostate.edu]
- [www.cecs.uci.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS |

Citations: | 11 - 0 self |

### BibTeX

@INPROCEEDINGS{Renganarayana07towardsoptimal,

author = {Lakshminarayanan Renganarayana and Manjukumar Harthikote-matha and Rinku Dewri and Sanjay Rajopadhye},

title = {Towards optimal multi-level tiling for stencil computations},

booktitle = {21st IEEE International Parallel and Distributed Processing Symposium (IPDPS},

year = {2007}

}

### OpenURL

### Abstract

Stencil computations form the performance-critical core of many applications. Tiling and parallelization are two important optimizations to speed up stencil computations. Many tiling and parallelization strategies are applicable to a given stencil computation. The best strategy depends not only on the combination of the two techniques, but also on many parameters: tile and loop sizes in each dimension; computation-communication balance of the code; processor architecture; message startup costs; etc. The best choices can only be determined through design-space exploration, which is extremely tedious and error prone to do via exhaustive experimentation. We characterize the space of multi-level tilings and parallelizations for 2D/3D Gauss-Siedel stencil computation. A systematic exploration of a part of this space enabled us to derive a design which is up to a factor of two faster than the standard implementation. 1.

### Citations

4011 |
Convex Optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ... refer the interested reader to [22] which shows the use of GPs to solve optimal tiling problems. Geometric programs can be transformed into convex optimization problems using a variable substitution =-=[3]-=- and solved efficiently using polynomial time interior point methods [13]. Integers solutions can be found by using a branch-and-bound algorithm. We use YALMIP [15] – a tool that provides an high leve... |

729 | A data locality optimizing algorithm
- Wolf, Lam
- 1991
(Show Context)
Citation Context ... implementations for each combination of tiling and parallelization scheme and experiment with them to find a good one, or to even eliminate the obviously poor ones. There have been extensive studies =-=[27, 23, 14, 30, 6, 11]-=- on tiling stencil computations for locality. Schemes for tiling stencil computations for parallelism can be classified based on whether or not they tile the outermost time loop. The commonly used dat... |

392 | Automatically tuned linear algebra software
- Whaley, Dongarra
- 1998
(Show Context)
Citation Context ... issues will enable us to develop high performance, multi-version, platform specific implementations of stencil computations. As an analogy, consider the matrix multiplication code generated by ATLAS =-=[26]-=-. The generated final code has different versions for different shapes of matrices, and makes several platform specific choices for optimizations. Our experiments show that stencil computations are si... |

248 |
Supernode Partitioning
- Irigoin, Triolet
- 1988
(Show Context)
Citation Context ...implementation of pattern matchers in general compilers [24] to identify stencil computations, highlight the potential for performance improvements from loop transformations and optimizations. Tiling =-=[10, 31, 32]-=- is a loop transformation that can be used for (i) partitioning data and computations among parallel processors and (ii) reordering computations within a single processor to improve data locality. For... |

186 |
More iteration space tiling
- Wolfe
- 1989
(Show Context)
Citation Context ...e the outermost time loop. The commonly used data partitioning scheme [7] does not tile the time loop and uses the “owner-computes” rule to determine the computation distribution. Early work by Wolfe =-=[28]-=- shows that skewing can be used to enable tiling of the time loops. Recently, Wonnacott [29] shows that time skewing can be used to tile for parallelism as well as locality. Several important issues a... |

168 |
The organization of computations for uniform recurrence equations
- Karp, Miller, et al.
- 1967
(Show Context)
Citation Context ... analysis purposes without loss of generality). An important property is that the tile graph with such unit dependence vectors can be viewed as an n-dimensional system of uniform recurrence equations =-=[12]-=-. Such a view allows us to use the powerful systolic array synthesis methods [20, 21] to formally reason about optimal parallelizations of the tile graph. In the context of exploring the space of poss... |

91 |
Partitioning and mapping of algorithms into fixed size systolic arrays
- Moldovan, Fortes
- 1986
(Show Context)
Citation Context ... tile” or a Strip. • It precludes adaptation to run on fewer processors in multiple passes, using another common systolic technique called LPGS (for Locally Parallel Globally Sequential) partitioning =-=[17]-=-. This means that si = Ni 2p , i.e., each macro tile is a Ni p × Nj strip. A processor performs the following steps: receive data required to execute the strip, execute the strip and send computed dat... |

68 | The Mapping of Linear Recurrence Equations on Regular Arrays
- Quinton, Dongen
- 1989
(Show Context)
Citation Context ... tile graph with such unit dependence vectors can be viewed as an n-dimensional system of uniform recurrence equations [12]. Such a view allows us to use the powerful systolic array synthesis methods =-=[20, 21]-=- to formally reason about optimal parallelizations of the tile graph. In the context of exploring the space of possible tiling and parallelizations, such a formal reasoning helps in constraining the s... |

61 | Tiling Optimizations for 3D Scientific Computations
- Rivera, Tseng
(Show Context)
Citation Context ... implementations for each combination of tiling and parallelization scheme and experiment with them to find a good one, or to even eliminate the obviously poor ones. There have been extensive studies =-=[27, 23, 14, 30, 6, 11]-=- on tiling stencil computations for locality. Schemes for tiling stencil computations for parallelism can be classified based on whether or not they tile the outermost time loop. The commonly used dat... |

59 | Quantifying the multi-level nature of tiling interactions
- Mitchell, HÄogstedt, et al.
- 1997
(Show Context)
Citation Context ...hat are the trade-offs between these schemes? • how do the tiling choices at the parallelization level affect the choices at locality 1 ? • what are the globally optimal tile sizes? 1 Mitchell et al. =-=[16]-=- point out that ignoring such tiling interactions will lead to suboptimal solutions.sA study of these issues will enable us to develop high performance, multi-version, platform specific implementation... |

50 |
Loop Tiling for Parallelism
- Xue
- 2000
(Show Context)
Citation Context ...implementation of pattern matchers in general compilers [24] to identify stencil computations, highlight the potential for performance improvements from loop transformations and optimizations. Tiling =-=[10, 31, 32]-=- is a loop transformation that can be used for (i) partitioning data and computations among parallel processors and (ii) reordering computations within a single processor to improve data locality. For... |

46 |
Synthesizing systolic arrays from recurrence equations
- Rajopandye, Fujimoto
- 1990
(Show Context)
Citation Context ... tile graph with such unit dependence vectors can be viewed as an n-dimensional system of uniform recurrence equations [12]. Such a view allows us to use the powerful systolic array synthesis methods =-=[20, 21]-=- to formally reason about optimal parallelizations of the tile graph. In the context of exploring the space of possible tiling and parallelizations, such a formal reasoning helps in constraining the s... |

42 |
Fortran at Ten Gigaflops: the Connection Machine Convolution Compiler
- Bromley, Heller, et al.
- 1991
(Show Context)
Citation Context ...Their inclusion in major benchmarks like SPEC [25], HPFBENCH [9], PARKBENCH [19], and NAS Parallel Benchmarks [18], clearly show their importance. The development of special purpose stencil compilers =-=[4]-=- and implementation of pattern matchers in general compilers [24] to identify stencil computations, highlight the potential for performance improvements from loop transformations and optimizations. Ti... |

38 | On Tiling as a Loop Transformation
- Xue
- 1997
(Show Context)
Citation Context ...implementation of pattern matchers in general compilers [24] to identify stencil computations, highlight the potential for performance improvements from loop transformations and optimizations. Tiling =-=[10, 31, 32]-=- is a loop transformation that can be used for (i) partitioning data and computations among parallel processors and (ii) reordering computations within a single processor to improve data locality. For... |

37 |
Y.Ye, "An infeasible interior-point algorithm for solving primal and dual geometric programs
- Kortanek, Xu
- 1996
(Show Context)
Citation Context ... optimal tiling problems. Geometric programs can be transformed into convex optimization problems using a variable substitution [3] and solved efficiently using polynomial time interior point methods =-=[13]-=-. Integers solutions can be found by using a branch-and-bound algorithm. We use YALMIP [15] – a tool that provides an high level symbolic interface in MATLAB to define and solve GPs for integer soluti... |

33 | Determining the idle time of a tiling
- Hogstedt, Carter, et al.
- 1997
(Show Context)
Citation Context ...ollows from the fact si that Nk skP gives the number of passes executed by a processor and Ni+sk is the number of tiles executed by a processor in si one pass. The slope sk si (also known as the rise =-=[8]-=-) plays a funFigure 4. (Left) Skewed dependences that make this tiling legal. (Right) Semi-oblique strips tiling. damental role in determining the latency. Processor pP −1 � sk can start its first ti... |

31 |
Using time skewing to eliminate idle time due to memory bandwidth and network limitations
- Wonnacott
- 2000
(Show Context)
Citation Context ...time loop and uses the “owner-computes” rule to determine the computation distribution. Early work by Wolfe [28] shows that skewing can be used to enable tiling of the time loops. Recently, Wonnacott =-=[29]-=- shows that time skewing can be used to tile for parallelism as well as locality. Several important issues are not addressed by these authors. For a given stencil computation, • what is the space of l... |

30 | 2005, Impact of modern memory subsystems on cache optimizations for stencil computations
- Kamil, Husbands, et al.
(Show Context)
Citation Context ... implementations for each combination of tiling and parallelization scheme and experiment with them to find a good one, or to even eliminate the obviously poor ones. There have been extensive studies =-=[27, 23, 14, 30, 6, 11]-=- on tiling stencil computations for locality. Schemes for tiling stencil computations for parallelism can be classified based on whether or not they tile the outermost time loop. The commonly used dat... |

30 | Achieving scalable locality with time skewing
- Wonnacott
- 2002
(Show Context)
Citation Context |

28 | Regular partitioning for synthesizing fixed-size systolic arrays
- Darte
(Show Context)
Citation Context ...is has two important consequences. • Every processor is active only on alternate time steps. This problem can easily be corrected by a well known systolic technique called clustering or serialization =-=[5]-=-. We allocate two adjacent virtual processors to a single physical processor which alternates between the two tiles and is thus always busy. This combined two-tile unit is called a “macro tile” or a S... |

21 | Automatic optimization of communication in compiling out-of-core stencil codes
- Bordawekar, Choudhary, et al.
- 1996
(Show Context)
Citation Context ...of processors. We consider an 3D iteration space and characterize the possible multi-level tilings and parallelizations. Our analytical BSP style cost models are inspired by theirs. Bordawekar et al. =-=[2]-=- present a technique for optimizing communication for out-of-core distributed stencil computations. They show how a compiler can choose the tiling parameters based on the stencil computation and proce... |

16 | Compiling stencils in high performance Fortran
- Roth, Mellor-Crummey, et al.
- 1997
(Show Context)
Citation Context ..., PARKBENCH [19], and NAS Parallel Benchmarks [18], clearly show their importance. The development of special purpose stencil compilers [4] and implementation of pattern matchers in general compilers =-=[24]-=- to identify stencil computations, highlight the potential for performance improvements from loop transformations and optimizations. Tiling [10, 31, 32] is a loop transformation that can be used for (... |

14 |
A geometric programming framework for optimal multi-level tiling
- Renganarayana, Rajopadhye
(Show Context)
Citation Context ...ight that permits this transformation is the property that the tile sizes always take positive values only. An introduction of GPs is beyond the scope of this paper. We refer the interested reader to =-=[22]-=- which shows the use of GPs to solve optimal tiling problems. Geometric programs can be transformed into convex optimization problems using a variable substitution [3] and solved efficiently using pol... |

13 | Optimal SemiOblique Tiling
- Andonov, Balev, et al.
- 2003
(Show Context)
Citation Context ... of the 3D iteration space by si,sj, and sk, respectively. The tile graph consists of nodes representing tiles and edges between them representing the dependences between tiles. It is well known that =-=[1, 32]-=- if the si’s are large as compared to the elements of the dependence vectors of the original loop, then the dependencies between the tiles are unit vectors (or binary combinations thereof, which can b... |

13 | Automatic tiling of iterative stencil loops
- Li, Song
- 2004
(Show Context)
Citation Context |

3 | HPFBench: A High Performance Fortran benchmark
- Hu, Jin, et al.
- 1998
(Show Context)
Citation Context ...roduction Stencil computations form the basis for a wide range of scientific applications from simple Jacobi to complex multigrid solvers. Their inclusion in major benchmarks like SPEC [25], HPFBENCH =-=[9]-=-, PARKBENCH [19], and NAS Parallel Benchmarks [18], clearly show their importance. The development of special purpose stencil compilers [4] and implementation of pattern matchers in general compilers ... |

2 |
YALMIP : A toolbox for modeling and optimizationinMATLAB.InProc
- Löfberg
- 2004
(Show Context)
Citation Context ...oblems using a variable substitution [3] and solved efficiently using polynomial time interior point methods [13]. Integers solutions can be found by using a branch-and-bound algorithm. We use YALMIP =-=[15]-=- – a tool that provides an high level symbolic interface in MATLAB to define and solve GPs for integer solutions. The number of (tile) variables of our GPs are related to number of dimensions tiled an... |

1 |
Tight bounds on cache use for stencil operations on rectangular grids
- Frumkin, Wijngaart
(Show Context)
Citation Context |

1 |
Solving pdes on loosely-coupled parallel processors
- Gropp
- 1987
(Show Context)
Citation Context ...putations for locality. Schemes for tiling stencil computations for parallelism can be classified based on whether or not they tile the outermost time loop. The commonly used data partitioning scheme =-=[7]-=- does not tile the time loop and uses the “owner-computes” rule to determine the computation distribution. Early work by Wolfe [28] shows that skewing can be used to enable tiling of the time loops. R... |