## PARALLEL THREE-DIMENSIONAL NONEQUISPACED FAST FOURIER TRANSFORMS AND THEIR APPLICATION TO PARTICLE SIMULATION

### BibTeX

@MISC{Pippig_parallelthree-dimensional,

author = {Michael Pippig and Daniel Potts},

title = {PARALLEL THREE-DIMENSIONAL NONEQUISPACED FAST FOURIER TRANSFORMS AND THEIR APPLICATION TO PARTICLE SIMULATION},

year = {}

}

### OpenURL

### Abstract

Abstract. In this paper we describe a parallel algorithm for calculating nonequispaced fast Fourier transforms on massively parallel distributed memory architectures. These algorithms are implemented in an open source software library called PNFFT. Furthermore, we derive a parallel fast algorithm for the computation of the Coulomb potentials and forces in a charged particle system, which is based on the parallel nonequispaced fast Fourier transform. To prove the high scalability of our algorithms we provide performance results on a BlueGene/P system using up to 65536 cores. Key words and phrases: parallel nonequispaced fast Fourier transform, parallel fast summation, parallel particle mesh methods, NFFT

### Citations

401 | The design and implementation of FFTW3
- Frigo, Johnson
- 2005
(Show Context)
Citation Context ...e for the important interplay between the development of fast algorithms and sustainable software engineering in order to produce high performance software. Without a doubt, the FFTW software library =-=[16, 17]-=- is an outstanding implementation of the fast Fourier transform and one of the most important software packages in scientific computing. It offers support of shared memory parallelism and also distrib... |

349 | Computer Simulation Using Particles - Hockney, Eastwood |

244 |
Scalable molecular dynamics with NAMD
- Phillips, Braun, et al.
- 2005
(Show Context)
Citation Context ...oth algorithms employ the NFFT in order to achieve a fast algorithm. For periodic boundary conditions, there has been a broad variety of publications about parallel particle-mesh algorithms, e.g. see =-=[57, 50, 39, 43, 38, 26, 54]-=-. However, these implementations use simple cutoff schemes in order to deal with Coulomb interactions for non-periodic boundary conditions and thereby totally ignore the long range nature of these int... |

182 |
A smooth particle mesh Ewald method
- Essmann, Perera, et al.
- 2004
(Show Context)
Citation Context ...(xj) 0 : l /∈ In,m(xj) . An approximation of the adjoint transform is given by A ⊢⊣ f ≈ DF ⊢⊣ B ⊤ f. The gradients (2.4) can be approximated by means of the analytic derivative of the window function =-=[11]-=-, i.e., ∇A ˆf ≈ ∇BF D ˆf with { ∇ϕ ∇B := (∇bjl) j=1,...,M; , ∇bjl := l∈In ( xj − n−1 ⊙ l ) : l ∈ In,m(xj) 0 : l /∈ In,m(xj) . Note that the NFFT and gradient NFFT only differ in the multiplication wit... |

145 |
Fast Fourier transforms for nonequispaced data
- Dutt, Rokhlin
- 1993
(Show Context)
Citation Context ...er transform (NDFT), which is a generalization of the discrete Fourier transform to nonequispaced nodes. Especially its fast approximate realization called nonequispaced fast Fourier transform (NFFT) =-=[7, 2, 43, 46, 42, 18, 27]-=- multiplied the application areas, since it led to fast algorithms in computerized tomography [14, 8], particle simulation [40, 21] and spectral methods on adaptive grids, just to name a few examples.... |

111 | Fast fourier transform for nonequispaced data: A tutorial
- Potts, Steidl, et al.
- 1998
(Show Context)
Citation Context ...er transform (NDFT), which is a generalization of the discrete Fourier transform to nonequispaced nodes. Especially its fast approximate realization called nonequispaced fast Fourier transform (NFFT) =-=[7, 2, 43, 46, 42, 18, 27]-=- multiplied the application areas, since it led to fast algorithms in computerized tomography [14, 8], particle simulation [40, 21] and spectral methods on adaptive grids, just to name a few examples.... |

103 | Using MPI-2: Advanced Features of the Message-Passing Interface - Gropp, Lusk, et al. - 1999 |

95 |
On the fast Fourier transform of functions with singularities
- Beylkin
- 1995
(Show Context)
Citation Context ...er transform (NDFT), which is a generalization of the discrete Fourier transform to nonequispaced nodes. Especially its fast approximate realization called nonequispaced fast Fourier transform (NFFT) =-=[7, 2, 43, 46, 42, 18, 27]-=- multiplied the application areas, since it led to fast algorithms in computerized tomography [14, 8], particle simulation [40, 21] and spectral methods on adaptive grids, just to name a few examples.... |

84 | Selection of a convolution function for Fourier inversion using gridding
- Jackson, Meyer, et al.
- 1991
(Show Context)
Citation Context ...proposed. In our parallel NFFT implementation the user is free to choose between the (dilated) Gaussian [7, 43, 6], (dilated) cardinal central B–splines [2, 43], and (dilated) Kaiser–Bessel functions =-=[24, 15]-=-. We point out that the approximation error introduced by the NFFT decays exponentially with the number of summands m. Error estimates for the multivariate case were presented in [9], see also [27, Ap... |

83 | Nonuniform fast Fourier transforms using Min-Max interpolation
- Fessler, Sutton
(Show Context)
Citation Context ...fast approximate realization called nonequispaced fast Fourier transform (NFFT) [7, 2, 43, 46, 42, 18, 27] multiplied the application areas, since it led to fast algorithms in computerized tomography =-=[14, 8]-=-, particle simulation [40, 21] and spectral methods on adaptive grids, just to name a few examples. An extensive list of applications can be found e.g. in [18]. Roughly speaking, the nonequispaced fas... |

59 |
GROMACS 4: Algorithms for highly efficient, load-balanced, and scalable molecular simulation
- Hess, Kutzner, et al.
(Show Context)
Citation Context ...oth algorithms employ the NFFT in order to achieve a fast algorithm. For periodic boundary conditions, there has been a broad variety of publications about parallel particle-mesh algorithms, e.g. see =-=[57, 50, 39, 43, 38, 26, 54]-=-. However, these implementations use simple cutoff schemes in order to deal with Coulomb interactions for non-periodic boundary conditions and thereby totally ignore the long range nature of these int... |

53 |
Die Berechnung optischer und elektrostatischer Gitterpotentiale. Annalen der Physik 369
- Ewald
- 1921
(Show Context)
Citation Context ...lomb potential must be computed by an FFT within a precomputation step. In contrast, for periodic boundary conditions the Fourier coefficients are given analytically by the well known Ewald splitting =-=[15]-=-. Furthermore, the fast summation algorithm for nonperiodic boundary conditions makes use of oversampled FFTs and rescaling of the particle coordinates into a box of roughly half the size of the initi... |

46 | Fast approximate Fourier transforms for irregularly spaced data
- Ware
- 1998
(Show Context)
Citation Context |

44 | Nonuniform fast Fourier transform
- Duijndam, Schonewille
- 1999
(Show Context)
Citation Context ... error small, several functions ϕ with good localization in time and frequency domain have been proposed. In our parallel NFFT implementation the user is free to choose between the (dilated) Gaussian =-=[7, 43, 6]-=-, (dilated) cardinal central B–splines [2, 43], and (dilated) Kaiser–Bessel functions [24, 15]. We point out that the approximation error introduced by the NFFT decays exponentially with the number of... |

35 | Accelerating the nonuniform fast Fourier transform
- Greengard, Lee
(Show Context)
Citation Context |

33 |
How to mesh up ewald sums: A theoretical and numerical comparison of various paricle mesh routines
- Deserno, Holm
- 1998
(Show Context)
Citation Context ...in Alg. 3 for the fast evaluation of the potentials (5.1) and the fields (5.2). The steps of this algorithm are very similar to P3M (Particle-Particle– Particle-Mesh) algorithms [22, Ch. 8], see also =-=[4]-=- for an overview of particle-mesh algorithms. Analogously, Alg. 3 is called P2NFFT (Particle-Particle–NFFT), since the short range particle-particle interactions are computed in the same way, while th... |

31 | Stability results for scattered data interpolation by trigonometric polynomials
- Kunis, Potts
(Show Context)
Citation Context ...the matrix A is not square. Even for the square case, it is usually not orthogonal. Therefore, the definition of an inverse NFFT is not canonical, but can be realized by an iterative method, see e.g. =-=[29]-=-. Instead, it is customary to define the adjoint transform by the matrix-vector product that is equivalent to the sums ˆhk = ˆh = A ⊢⊣ f, M∑ fje +2πikxj , k ∈ IN , (2.3) j=1 with the vector ˆ h := ( ˆ... |

31 |
Fast summation at nonequispaced knots by nffts
- Potts, Steidl
(Show Context)
Citation Context ...called nonequispaced fast Fourier transform (NFFT) [7, 2, 43, 46, 42, 18, 27] multiplied the application areas, since it led to fast algorithms in computerized tomography [14, 8], particle simulation =-=[40, 21]-=- and spectral methods on adaptive grids, just to name a few examples. An extensive list of applications can be found e.g. in [18]. Roughly speaking, the nonequispaced fast Fourier transform consists o... |

28 |
Non-equispaced fast Fourier transforms with applications to tomography
- Fourmont
- 2003
(Show Context)
Citation Context ...proposed. In our parallel NFFT implementation the user is free to choose between the (dilated) Gaussian [7, 43, 6], (dilated) cardinal central B–splines [2, 43], and (dilated) Kaiser–Bessel functions =-=[24, 15]-=-. We point out that the approximation error introduced by the NFFT decays exponentially with the number of summands m. Error estimates for the multivariate case were presented in [9], see also [27, Ap... |

28 | Particle-mesh ewald and rrespa for parallel molecular dynamics simulations
- Plimpton, Pollock, et al.
- 1997
(Show Context)
Citation Context ...thematics, 09107 Chemnitz, Germany ‡ michael.pippig@mathematik.tu-chemnitz.de § potts@mathematik.tu-chemnitz.de 12 MICHAEL PIPPIG AND DANIEL POTTS publicly available, parallel FFT software libraries =-=[49, 48, 42, 41, 37, 36, 46, 44]-=- based on this approach have been proposed during the last years. Following the example of the FFTW software library, the NFFT algorithm has been implemented in the publicly available NFFT software li... |

27 |
Fast Fourier Transforms for nonequispaced data,” in Approximation Theory
- Elbel, Steidl
- 1998
(Show Context)
Citation Context ...el functions [24, 15]. We point out that the approximation error introduced by the NFFT decays exponentially with the number of summands m. Error estimates for the multivariate case were presented in =-=[9]-=-, see also [27, Appendix C]. In the case of the Gaussian window function, the evaluations of the exponential function exp() can be reduced substantially, see [18] and [27, Appendix C]. 4. The Parallel... |

22 | Fast convolution with radial kernels at nonequispaced knots - Potts, Steidl, et al. - 2004 |

13 |
inhomogeneity correction based on gridding reconstruction
- Eggers, Knopp, et al.
(Show Context)
Citation Context ...fast approximate realization called nonequispaced fast Fourier transform (NFFT) [7, 2, 43, 46, 42, 18, 27] multiplied the application areas, since it led to fast algorithms in computerized tomography =-=[14, 8]-=-, particle simulation [40, 21] and spectral methods on adaptive grids, just to name a few examples. An extensive list of applications can be found e.g. in [18]. Roughly speaking, the nonequispaced fas... |

12 |
Fast NFFT based summation of radial functions
- Fenn, Steidl
(Show Context)
Citation Context ...) for an appropriate degree of smoothness p ∈ N. Several regularizations of 1 r are possible, e.g., by algebraic polynomials, splines, trigonometric polynomials or two point Taylor interpolation, see =-=[13]-=-. The potentials φj in equation (5.1) can be approximated by where φ NE j := R(0) + ∑ φ RF j := ∑ k∈IN φj ≈ hj := φ NE j l∈INE ε (j) I ql ( 1 + φ RF j , ‖xj − xl‖2 ) − R(‖xj − xl‖2) , (5.3) ( M∑ ˆRk q... |

12 | Using NFFT3 - a software library for various nonequispaced fast Fourier transforms
- Keiner, Kunis, et al.
(Show Context)
Citation Context |

11 | Numerical Simulation in Molecular Dynamics - Griebel, Knapek, et al. - 2007 |

11 |
ESPResSo - An Extensible Simulation Package for Research on Soft Matter Systems
- Limbach, Arnold, et al.
- 2006
(Show Context)
Citation Context ...oth algorithms employ the NFFT in order to achieve a fast algorithm. For periodic boundary conditions, there has been a broad variety of publications about parallel particle-mesh algorithms, e.g. see =-=[57, 50, 39, 43, 38, 26, 54]-=-. However, these implementations use simple cutoff schemes in order to deal with Coulomb interactions for non-periodic boundary conditions and thereby totally ignore the long range nature of these int... |

9 |
Force fields for silicas and aluminophosphates based on ab initio calculations
- Beest, Kramer
- 1990
(Show Context)
Citation Context ...cle NFFT (P2NFFT) for each process r ∈ P P . we use a cubic box filled with 12960 particles of a silica melt. It was generated by a simulation of a melting silica crystal using the potential given in =-=[1]-=-. This particle system consist of positive and negative charged ions which are sufficiently homogeneously distributed. We duplicate the initial test system of 12960 particles for 4, 9 and 20 times in ... |

9 |
NFFT 3.0, C subroutine library. http://www.tu-chemnitz.de/~potts/nfft. http://www.tu-chemnitz.de/~potts/nfft
- Keiner, Kunis, et al.
(Show Context)
Citation Context ... on this approach have been proposed during the last years. Following the example of the FFTW software library, the NFFT algorithm has been implemented in the publicly available NFFT software library =-=[27, 26]-=-, which also offers support of shared memory parallelism [45] and a parallel implementation for {michael.pippig,potts}@mathematik.tu-chemnitz.de, Chemnitz University of Technology, Department of Mathe... |

7 |
Comments on P3M, FMM, and the Ewald method for large periodic Coulombic systems
- Polloc, Glosli
- 1996
(Show Context)
Citation Context |

6 | Gennery: A portable 3d FFT package for distributed-memory parallel architectures
- Ding, Ferraro, et al.
- 1995
(Show Context)
Citation Context ...mensional decomposition of the input array. However, it has been argued that the one-dimensional data decomposition is not scalable enough for modern massive parallel distributed memory architectures =-=[5, 10, 12, 44]-=-. Whenever the dimensionality of the input array is at least three, a more scalable two-dimensional domain decomposition can be applied. Several publicly available parallel FFT software libraries [39,... |

6 |
R.S.: A Volumetric FFT for BlueGene/L
- Eleftheriou, Moreira, et al.
- 2003
(Show Context)
Citation Context ...mensional decomposition of the input array. However, it has been argued that the one-dimensional data decomposition is not scalable enough for modern massive parallel distributed memory architectures =-=[5, 10, 12, 44]-=-. Whenever the dimensionality of the input array is at least three, a more scalable two-dimensional domain decomposition can be applied. Several publicly available parallel FFT software libraries [39,... |

6 | 2006): “Fast Gauss transforms with complex parameters using NFFTs
- Kunis, Potts, et al.
(Show Context)
Citation Context ... library [36], are fast approximate algorithms to compute the sums in (2.2) and the adjoint transform (2.3). These both transforms are also the cornerstone for the nonequispaced convolution, see e.g. =-=[30]-=-. In addition, we implemented a fast approximate algorithm for computing the gradients (2.4). 3. The Three-Dimensional NFFT Algorithm. This section summarizes the mathematical theory and ideas behind ... |

4 |
The Error-Controlled Fast Multipole Method for Open and Periodic Boundary Conditions,” in Fast Methods for Long-Range Interactions in Complex Systems
- Kabadshow, Dachsel
(Show Context)
Citation Context ...resents results on distributed memory parallelization of particle-mesh algorithms with non-periodic boundary conditions. A numerical comparison with parallel implementation of other methods, e.g. FMM =-=[25]-=-, will be published elsewhere. The outline of this paper is as follows: We start in Section 2 with the introduction of the notation and definitions that we will use in the remainder of the paper. Next... |

3 |
Ewald summation based on nonuniform fast Fourier transform
- Hedman, Laaksonen
(Show Context)
Citation Context ...called nonequispaced fast Fourier transform (NFFT) [7, 2, 43, 46, 42, 18, 27] multiplied the application areas, since it led to fast algorithms in computerized tomography [14, 8], particle simulation =-=[40, 21]-=- and spectral methods on adaptive grids, just to name a few examples. An extensive list of applications can be found e.g. in [18]. Roughly speaking, the nonequispaced fast Fourier transform consists o... |

3 | Particle simulation based on nonequispaced fast Fourier transforms,” in Fast Methods for Long-Range Interactions in Complex Systems
- Pippig, Potts
(Show Context)
Citation Context ...stems with non-periodic boundary conditions. This is similar to Ewald-like particle-mesh algorithms, which only work for periodic boundary conditions, see e.g. [19, Ch. 7] for an overview. Indeed, in =-=[38]-=- we point out that the building blocks of the fast summation [40, 41] for non-periodic boundary conditions and the fast Ewald summation [21] for periodic boundary conditions are very similar. Especial... |

2 |
Library Support for Parallel Sorting in Scientific Computations
- Dachsel, Hofmann, et al.
- 2007
(Show Context)
Citation Context ...es xj, j ∈ ⋃ l∈Mr I P NE εI (l). We perform these two tasks at once using a fine-grained data distribution operation [23] that is implemented within a software library for parallel sorting algorithms =-=[3]-=-. Note that the serial fast summation algorithm requires the condition ‖xj‖2 < 1 4 −εB for every node xj, j = 1, . . . , M. Therefore, we set C = ( 1 1 1 ⊤ 4 −εB, 4 −εB, 4 −εB) and our parallel adjoin... |

2 |
Performance of the 3D FFT on the 6D network torus QCDOC parallel supercomputer
- Fang, Deng, et al.
(Show Context)
Citation Context ...mensional decomposition of the input array. However, it has been argued that the one-dimensional data decomposition is not scalable enough for modern massive parallel distributed memory architectures =-=[5, 10, 12, 44]-=-. Whenever the dimensionality of the input array is at least three, a more scalable two-dimensional domain decomposition can be applied. Several publicly available parallel FFT software libraries [39,... |

2 |
Beyond homogeneous decomposition: Scaling Long-Range Forces on Massively Parallel Systems
- Richards, Surh, et al.
- 2009
(Show Context)
Citation Context |

1 |
Fine-grained Data Distribution Operations for Particle Codes
- Hofmann, Rünger
- 2009
(Show Context)
Citation Context ...volved in the calculation of the near field sums (5.1) and (5.5), i.e., all the nodes xj, j ∈ ⋃ l∈Mr I P NE εI (l). We perform these two tasks at once using a fine-grained data distribution operation =-=[23]-=- that is implemented within a software library for parallel sorting algorithms [3]. Note that the serial fast summation algorithm requires the condition ‖xj‖2 < 1 4 −εB for every node xj, j = 1, . . .... |

1 | The nonequispaced FFT on graphics processing units - Kunis, Kunis |

1 |
2DECOMP&FFT, Parallel FFT subroutine library. http://www.2decomp.org
- Li
(Show Context)
Citation Context ... 44]. Whenever the dimensionality of the input array is at least three, a more scalable two-dimensional domain decomposition can be applied. Several publicly available parallel FFT software libraries =-=[39, 34, 32, 31, 37, 35]-=- based on this approach have been proposed during the last years. Following the example of the FFTW software library, the NFFT algorithm has been implemented in the publicly available NFFT software li... |

1 |
2DECOMP & FFT - A Highly Scalable 2D Decomposition Library and FFT Interface
- Li, Laizet
- 2010
(Show Context)
Citation Context ... 44]. Whenever the dimensionality of the input array is at least three, a more scalable two-dimensional domain decomposition can be applied. Several publicly available parallel FFT software libraries =-=[39, 34, 32, 31, 37, 35]-=- based on this approach have been proposed during the last years. Following the example of the FFTW software library, the NFFT algorithm has been implemented in the publicly available NFFT software li... |

1 |
P3DFFT, Parallel FFT subroutine library. http://code.google.com/p
- Pekurovsky
(Show Context)
Citation Context ... 44]. Whenever the dimensionality of the input array is at least three, a more scalable two-dimensional domain decomposition can be applied. Several publicly available parallel FFT software libraries =-=[39, 34, 32, 31, 37, 35]-=- based on this approach have been proposed during the last years. Following the example of the FFTW software library, the NFFT algorithm has been implemented in the publicly available NFFT software li... |

1 |
Parallel FFT subroutine library
- PFFT
- 2011
(Show Context)
Citation Context |

1 |
Parallel Nonequispaced FFT subroutine library
- PNFFT
- 2011
(Show Context)
Citation Context ...the vectors ∇f := (∇fj) j=1,...,M ∈ C3M , ˆf := ( ˆ fk)k∈IN ∈ C|IN |, and the matrix ∇A := ( ∇e−2πikxj ) ∈ C j=1,...,M; k∈IN 3M×|IN |. The parallel NFFT (PNFFT) algorithms, implemented in our library =-=[36]-=-, are fast approximate algorithms to compute the sums in (2.2) and the adjoint transform (2.3). These both transforms are also the cornerstone for the nonequispaced convolution, see e.g. [30]. In addi... |

1 |
PFFT - An extension of FFTW to massively parallel architectures
- Pippig
(Show Context)
Citation Context |

1 |
Parallel FFT subroutine library. http://www.sandia.gov/~sjplimp/docs/fft/ README.html
- Plimpton
(Show Context)
Citation Context |

1 |
An Implementation of Parallel 3-D FFT with 2-D Decomposition on a Massively Parallel Cluster of Multi-core Processors
- Takahashi
(Show Context)
Citation Context |

1 | OpenMP parallelization in the NFFT software library
- Volkmer
- 2012
(Show Context)
Citation Context ...wing the example of the FFTW software library, the NFFT algorithm has been implemented in the publicly available NFFT software library [27, 26], which also offers support of shared memory parallelism =-=[45]-=- and a parallel implementation for {michael.pippig,potts}@mathematik.tu-chemnitz.de, Chemnitz University of Technology, Department of Mathematics, 09107 Chemnitz, Germany 1graphic processing units [2... |