## 1 Cache Accurate Time Skewing in Iterative Stencil Computations

### Cached

### Download Links

Citations: | 4 - 1 self |

### BibTeX

@MISC{Strzodka_1cache,

author = {Robert Strzodka and Mohammed Shaheen and Hans-peter Seidel},

title = {1 Cache Accurate Time Skewing in Iterative Stencil Computations},

year = {}

}

### OpenURL

### Abstract

Abstract—We present a time skewing algorithm that breaks the memory wall for certain iterative stencil computations. A stencil computation, even with constant weights, is a completely memory-bound algorithm. For example, for a large 3D domain of 500 3 doubles and 100 iterations on a quad-core Xeon X5482 3.2GHz system, a hand-vectorized and parallelized naive 7-point stencil implementation achieves only 1.4 GFLOPS because the system memory bandwidth limits the performance. Although many efforts have been undertaken to improve the performance of such nested loops, for large data sets they still lag far behind synthetic benchmark performance. The state-of-art automatic locality optimizer PluTo [1] achieves 3.7 GFLOPS for the above stencil, whereas a parallel benchmark executing the inner stencil computation directly on registers performs at 25.1 GFLOPS. In comparison, our algorithm achieves 13.0 GFLOPS (52 % of the stencil peak benchmark). We present results for 2D and 3D domains in double precision including problems with gigabyte large data sets. The results are compared against hand-optimized naive schemes, PluTo, the stencil peak benchmark and results from literature. For constant stencils of slope one we break the dependence on the low system bandwidth and achieve at least 50 % of the stencil peak, thus performing within a factor two of an ideal system with infinite bandwidth (the benchmark runs on registers without memory access). For large stencils and banded matrices the additional data transfers let the limitations of the system bandwidth come into play again, however, our algorithm still gains a large improvement over the other schemes. Keywords-memory wall, memory bound, stencil, banded matrix, time skewing, temporal blocking, wavefront I.

### Citations

167 |
I/o complexity: the red-blue pebbling game
- Hong, Kung
- 1981
(Show Context)
Citation Context ...all divided dimensions. This minimizes the surface area to volume ratio of the space-time tiles and thus reduces cache misses. It is the best general strategy to traverse a space-time of unknown size =-=[21]-=-. However, knowing the typical cache size of 128KiB–4MiB per core and domain sizes (100–1000) d , d = 2, 3 we contribute an algorithm that does the exact opposite: we tile only one spatial dimension (... |

106 | New tiling techniques to improve cache temporal locality
- Song, Li
- 1999
(Show Context)
Citation Context ...n several stencil iterations ahead of the rest, we need to respect data dependencies induced by the form of the stencil. So called time skewing techniques have been described by Wolf [6], Song et al, =-=[7]-=- and Wonnacott [8]. Thereby, the time axis corresponds to the number of iterations that the stencil is applied to the entire spatial domain, e.g., this can be the explicit time steps of a PDE solver, ... |

62 | P.: A practical automatic polyhedral parallelizer and locality optimizer
- Bondhugula, Hartono, et al.
- 2008
(Show Context)
Citation Context ...ave been undertaken to improve the performance of such nested loops, for large data sets they still lag far behind synthetic benchmark performance. The state-of-art automatic locality optimizer PluTo =-=[1]-=- achieves 3.7 GFLOPS for the above stencil, whereas a parallel benchmark executing the inner stencil computation directly on registers performs at 25.1 GFLOPS. In comparison, our algorithm achieves 13... |

43 | Automatic Parallelization of Loop Programs for Distributed Memory Architectures
- Griebl
(Show Context)
Citation Context ... size, which is thus labelled cache oblivious [9], [15], [16]. A more general approach for optimizing iterative stencil computations is to use a loop transformation and parallelization framework [1], =-=[17]-=-– [20]. We compare our results against one of them in detail, namely PluTo [1], which is an easy-to-use fully automatic tool and a good indicator of the performance that can be achieved immediately on... |

32 | Implicit and explicit optimizations for stencil computations - Kamil, Datta, et al. - 2006 |

31 |
Using time skewing to eliminate idle time due to memory bandwidth and network limitations
- Wonnacott
- 2000
(Show Context)
Citation Context ...iterations ahead of the rest, we need to respect data dependencies induced by the form of the stencil. So called time skewing techniques have been described by Wolf [6], Song et al, [7] and Wonnacott =-=[8]-=-. Thereby, the time axis corresponds to the number of iterations that the stencil is applied to the entire spatial domain, e.g., this can be the explicit time steps of a PDE solver, or the iterations ... |

29 |
Cache oblivious stencil computations
- Frigo, Strumpen
- 2005
(Show Context)
Citation Context ...nly data locality, or parallelism, or both equally. A third approach is to use a hierarchical tiling that adapts automatically to the available cache size, which is thus labelled cache oblivious [9], =-=[15]-=-, [16]. A more general approach for optimizing iterative stencil computations is to use a loop transformation and parallelization framework [1], [17]– [20]. We compare our results against one of them ... |

28 | Optimizations and performance modeling of stencil computations on modern microprocessors - Datta, Kamil, et al. - 2009 |

26 | An autotuning framework for parallel multicore stencil computations
- Kamil, Chan, et al.
- 2010
(Show Context)
Citation Context ...such cases up to the point where tight lower and upper bounds on the number of data loads can be given [4]. Recent results show large benefits in applying these techniques on multi-core architectures =-=[5]-=-. But no matter how efficiently we load the data into the caches, for data exceeding the cache size, we still read every vector component at least once per timestep from the main memory and for repeat... |

25 | Effective automatic parallelization of stencil computations - Krishnamoorthy, Baskaran, et al. |

21 |
International Technology Roadmap for Semiconductors (2004). http://public.itrs.net
- SEMATECH
- 2002
(Show Context)
Citation Context ...e doubling at the same pace. Intensive research into alternative technologies, e.g., stacked memory or optical connection, is underway but an economic solution for the mass-market is not yet in sight =-=[3]-=-. A. Related Work For small discrete vectors that fit into the processor’s caches, the cache bandwidth is the decisive factor of performance, but stencils in scientific computing typically operate on ... |

19 | Multi-level Tiling: M for the Price of One - Kim, Renganarayanan, et al. - 2007 |

15 | 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs
- Nguyen, Satish, et al.
- 2010
(Show Context)
Citation Context ... parallel. These requirements lead to skewed tiles in the spacetime, see Fig. 2. The tile dimensions form a large optimization space which can be explored empirically [9]–[11] and systematically [12]–=-=[14]-=-, whereby it makes a big difference if the exploration targets mainly data locality, or parallelism, or both equally. A third approach is to use a hierarchical tiling that adapts automatically to the ... |

14 | P.: Parametric multi-level tiling of imperfectly nested loops - Hartono, Baskaran, et al. - 2009 |

13 |
Parameterized tiling revisited
- Baskaran, Hartono, et al.
- 2010
(Show Context)
Citation Context ... which is thus labelled cache oblivious [9], [15], [16]. A more general approach for optimizing iterative stencil computations is to use a loop transformation and parallelization framework [1], [17]– =-=[20]-=-. We compare our results against one of them in detail, namely PluTo [1], which is an easy-to-use fully automatic tool and a good indicator of the performance that can be achieved immediately on these... |

10 | S.V.: Towards optimal multi-level tiling for stencil computations - Renganarayanan, Harthikote-Matha, et al. - 2007 |

10 |
Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization
- Wellein, Hager, et al.
- 2009
(Show Context)
Citation Context ...not necessary. Instead of diagonal wavefronts, we consider axisaligned wavefronts, and our tile placement is also different. The pipelined temporal blocking by Wittmann et al. [11] and Wellein et al. =-=[22]-=- can also be seen as a variant of space-time wavefront processing. However, they use the term ’wavefront’ completely differently, describing the parallelization along the time axis, which benefits fro... |

8 |
Improving parallelism and locality with asynchronous algorithms
- Liu, Li
(Show Context)
Citation Context ...perty, however, they use the diamond shape only in 1D with a traditional bottom-up processing of the tile in cache. The second property avoids the problem of dependent tiles encountered by Liu and Li =-=[24]-=-, where they have to relax the numerical properties of the scheme in order to gain better parallelization. As in CATS1, we pursue the goal of maximizing the wavefront size without reverting to multi-d... |

6 |
More iteration space tiling
- Wolf
- 1989
(Show Context)
Citation Context ...arts of the domain several stencil iterations ahead of the rest, we need to respect data dependencies induced by the form of the stencil. So called time skewing techniques have been described by Wolf =-=[6]-=-, Song et al, [7] and Wonnacott [8]. Thereby, the time axis corresponds to the number of iterations that the stencil is applied to the entire spatial domain, e.g., this can be the explicit time steps ... |

6 | Cache oblivious parallelograms in iterative stencil computations
- Strzodka, Shaheen, et al.
- 2010
(Show Context)
Citation Context ...ta locality, or parallelism, or both equally. A third approach is to use a hierarchical tiling that adapts automatically to the available cache size, which is thus labelled cache oblivious [9], [15], =-=[16]-=-. A more general approach for optimizing iterative stencil computations is to use a loop transformation and parallelization framework [1], [17]– [20]. We compare our results against one of them in det... |

5 |
Multicore-Aware Parallel Temporal Blocking of Stencil Codes for Shared and Distributed Memory
- Wittmann, Hager, et al.
- 2010
(Show Context)
Citation Context ...he misses and ideally also in parallel. These requirements lead to skewed tiles in the spacetime, see Fig. 2. The tile dimensions form a large optimization space which can be explored empirically [9]–=-=[11]-=- and systematically [12]–[14], whereby it makes a big difference if the exploration targets mainly data locality, or parallelism, or both equally. A third approach is to use a hierarchical tiling that... |

4 |
Mapping the FDTD Application to Many-Core Chip Architectures
- Orozco, Gao
- 2009
(Show Context)
Citation Context ...volume ratio (cache miss reduction), they are independent of each other when arranged side-by-side (parallel execution), and require only one tile form to cover the plane (simplicity). Orozco and Gao =-=[23]-=- give a quantitative analysis for the first property, however, they use the diamond shape only in 1D with a traditional bottom-up processing of the tile in cache. The second property avoids the proble... |

2 |
der Wijngaart, “Tight bounds on cache use for stencil operations on rectangular grids
- Frumkin, Van
- 2002
(Show Context)
Citation Context ...er than the cache capacity. Substantial work has been performed to optimize the data locality in such cases up to the point where tight lower and upper bounds on the number of data loads can be given =-=[4]-=-. Recent results show large benefits in applying these techniques on multi-core architectures [5]. But no matter how efficiently we load the data into the caches, for data exceeding the cache size, we... |

1 |
The memory gap
- Wilkes
- 2000
(Show Context)
Citation Context ...nly a non-linear stencil computation with many operations could prevent it from being memory bound. This imbalance between the computation power and system bandwidth is called the memory wall problem =-=[2]-=-. The costly introduction of double, triple and quad-channel memory buses has temporarily stopped the further deterioration of this problem, but in the long run we will see a growing discrepancy again... |