#### DMCA

## Tuning a Finite Difference Computation for Parallel Vector Processors

Citations: | 1 - 0 self |

### Citations

234 | Benchmarking GPUs to tune dense linear algebra
- Volkov, Demmel
- 2008
(Show Context)
Citation Context ...ector rotate (warp shuffle) operations. An alternative way is to implement the aligned, CPU-like algorithm, with very long vectors. This is in fact preferable, given the low bandwidth of local memory =-=[7]-=-. Time-slicing was developed as a cache-aware algorithm with a fast cache. However, the fast GPU local memory and the L1 cache are even smaller than the total capacity of the registers and are not use... |

123 | New tiling techniques to improve cache temporal locality
- Song, Li
- 1999
(Show Context)
Citation Context ...n the spacetime domain to order the points (i, t), such that the data dependence is granted. A systematic way uses trapezoidal shapes constructed of diagonal slices in space-time, called time-skewing =-=[4]-=-, [5], see Fig. 4. This is effective, if at least two preceding diagonals are placed in fast (cache) memory. The amount of work of the start-up slices is wasted. Note that a straightforward implementa... |

83 |
Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on
- Lee, Kim, et al.
- 2010
(Show Context)
Citation Context ...ocessor performance and to hide main memory latency. Both x86 CPUs with SSE and AVX vectors and current Nvidia GPUs meet this requirement. There are many claims comparing CPU and GPU performance, see =-=[17]-=-. A fair comparison may be based on a single PC or a server configuration, see table III. Other comparisons Table III COMPARISON CPU AND GPU. FLOP/S. system CPU GPU ratio PC multi-core CPU one GPU 1:6... |

70 | 3D finite difference computation on GPUs using CUDA
- Micikevicius
(Show Context)
Citation Context ...on Intel ‘Core’ core, and 6.5GF on 8 cores for a 2D problem. A 3D higher order finite difference stencil in time-stepping is optimised with 130GF on previous generation Nvidia GPUs with 690GF peak in =-=[15]-=-. A coupled 3D Finite Difference Implementation on Nvidia Fermi based multi GPU cluster Tsubame 2.0 is discussed in [16]. V. CONCLUSION We were able to develop highly efficient parallel, wide, vectori... |

68 | Tiling optimizations for 3D scientific computations
- Rivera, Tseng
- 2000
(Show Context)
Citation Context ... times numerical algorithms were based on finite differences. However, attempts to optimise for memory hierarchies, namely main memory and cache, started with a cache aware block tiling in space-time =-=[9]-=-, the introduction of time-skewing [4], [5] and extensions to grid hierarchies [10] and a cache oblivious space-filling Z curve in space-time [11]. More recent is the overview [12], which e.g. describ... |

47 |
Cache oblivious stencil computations
- Frigo, Strumpen
- 2005
(Show Context)
Citation Context ... started with a cache aware block tiling in space-time [9], the introduction of time-skewing [4], [5] and extensions to grid hierarchies [10] and a cache oblivious space-filling Z curve in space-time =-=[11]-=-. More recent is the overview [12], which e.g. describes a 3D time-skewing implementation at 37% peak performance on a 4.4GF peak performance, 2004 AMD Opteron processor. An automatic tuning of 3D tim... |

44 | K.: Optimization and performance modeling of stencil computations on modern microprocessors
- Datta, Kamil, et al.
- 2009
(Show Context)
Citation Context ...tiling in space-time [9], the introduction of time-skewing [4], [5] and extensions to grid hierarchies [10] and a cache oblivious space-filling Z curve in space-time [11]. More recent is the overview =-=[12]-=-, which e.g. describes a 3D time-skewing implementation at 37% peak performance on a 4.4GF peak performance, 2004 AMD Opteron processor. An automatic tuning of 3D time-stepping stencils on various dou... |

41 |
Likwid: A lightweight performance-oriented tool suite for x86 multicore environments
- Treibig, Hager, et al.
- 2010
(Show Context)
Citation Context ...ut FMA) and rotate by a single SSE shuffle or a sequence of AVX instructions. The algorithm is memory bound, which can also be verified by the aid of CPU performance counters with tools like ‘likwid’ =-=[3]-=-. Note that loop unrolling and data which fit into cache are essential to the success of vectorisation. The ratio of register performance to fastest cache/local memory performance (2 − 2.4) is even wo... |

26 |
Better performance at lower occupancy
- Volkov
- 2010
(Show Context)
Citation Context ...e like the floating point performance for Kepler. The code is memory bound again. In order to optimise memory bandwidth at low occupancy, the number of memory load and store operations may be reduced =-=[8]-=-. For example 64-bit memory accesses by loads and stores of ‘float2’ values instead of 32-bit ‘float’ values improve the effective memory bandwidth, see Tab. I. This way a 32-element coalesced vector ... |

24 |
Heterogeneous Computing with OpenCL
- Gaster, Kaeli, et al.
- 2011
(Show Context)
Citation Context ...vidia GPUs. A common programming pattern in OpenCL (and Cuda) is the use of local (shared) memory to prefetch data and share data between threads together with fast synchronisation within a processor =-=[6]-=-. This way e.g. vector rotate can be easily implemented and we can develop an unaligned vector version of Sec. III-A. However, local memory is prohibitively slow compared to registers, see Fig. 2. Not... |

20 | Time skewing: A valuebased approach to optimizing for memory locality,” Rutgers Univ
- McCalpin, Wonnacott
- 1999
(Show Context)
Citation Context ... spacetime domain to order the points (i, t), such that the data dependence is granted. A systematic way uses trapezoidal shapes constructed of diagonal slices in space-time, called time-skewing [4], =-=[5]-=-, see Fig. 4. This is effective, if at least two preceding diagonals are placed in fast (cache) memory. The amount of work of the start-up slices is wasted. Note that a straightforward implementation ... |

10 | A generalized framework for auto-tuning stencil computations
- Kamil, Chan, et al.
- 2009
(Show Context)
Citation Context ...4GF peak performance, 2004 AMD Opteron processor. An automatic tuning of 3D time-stepping stencils on various double precision 10GF peak AMD and Intel cores show sequential 1.6GF and 11GF on 16 cores =-=[13]-=-. Another group [14] mentions 1GF on a 11.2GF peak double precision Intel ‘Core’ core, and 6.5GF on 8 cores for a 2D problem. A 3D higher order finite difference stencil in time-stepping is optimised ... |

7 | Data locality optimizations for multigrid methods on structured grids
- Weiß
- 2001
(Show Context)
Citation Context ... optimise for memory hierarchies, namely main memory and cache, started with a cache aware block tiling in space-time [9], the introduction of time-skewing [4], [5] and extensions to grid hierarchies =-=[10]-=- and a cache oblivious space-filling Z curve in space-time [11]. More recent is the overview [12], which e.g. describes a 3D time-skewing implementation at 37% peak performance on a 4.4GF peak perform... |

7 |
Peta-scale phase-field simulation for dendritic solidification
- Shimokawabe, Aoki, et al.
- 2011
(Show Context)
Citation Context ...g is optimised with 130GF on previous generation Nvidia GPUs with 690GF peak in [15]. A coupled 3D Finite Difference Implementation on Nvidia Fermi based multi GPU cluster Tsubame 2.0 is discussed in =-=[16]-=-. V. CONCLUSION We were able to develop highly efficient parallel, wide, vectorised time-slice implementations of the Finite Difference model problem for CPUs and GPUs. All optimisation techniques had... |

4 |
Bridge spans generations
- Gwennap, “Sandy
- 2010
(Show Context)
Citation Context ...zer architecture implements AVX vectors as a combination of two SSE vectors internally, such that a single core can issue either two independent SSE instructions or one AVX instruction per cycle, see =-=[1]-=-, [2]. This results in 33.6GF single core peak performance (30GF measured) for independent add and multiply instructions of both SSE and AVX type vectors with exclusive access to the shared floating p... |

2 |
Intel’s Sandy Bridge microarchitecture,” http://www.realworldtech.com/page.cfm?ArticleID= RWT091810191937
- Kanter
- 2010
(Show Context)
Citation Context ...rchitecture implements AVX vectors as a combination of two SSE vectors internally, such that a single core can issue either two independent SSE instructions or one AVX instruction per cycle, see [1], =-=[2]-=-. This results in 33.6GF single core peak performance (30GF measured) for independent add and multiply instructions of both SSE and AVX type vectors with exclusive access to the shared floating point ... |

1 |
A framework that supports in writing performance-optimized stencil-based codes,” Universität Erlangen-Nürnberg
- Stürmer, Rüde
- 2010
(Show Context)
Citation Context ..., 2004 AMD Opteron processor. An automatic tuning of 3D time-stepping stencils on various double precision 10GF peak AMD and Intel cores show sequential 1.6GF and 11GF on 16 cores [13]. Another group =-=[14]-=- mentions 1GF on a 11.2GF peak double precision Intel ‘Core’ core, and 6.5GF on 8 cores for a 2D problem. A 3D higher order finite difference stencil in time-stepping is optimised with 130GF on previo... |