Results 1  10
of
33
A highorder 3D boundary integral equation solver for elliptic pdes in smooth domains
 Journal of Computational Physics
, 2005
"... We present a highorder boundary integral equation solver for 3D elliptic boundary value problems on domains with smooth boundaries. We use Nyström’s method for discretization and we combine it with special quadrature rules for the singular kernels that appear in the boundary integrals. The overall ..."
Abstract

Cited by 43 (7 self)
 Add to MetaCart
(Show Context)
We present a highorder boundary integral equation solver for 3D elliptic boundary value problems on domains with smooth boundaries. We use Nyström’s method for discretization and we combine it with special quadrature rules for the singular kernels that appear in the boundary integrals. The overall asymptotic complexity of our method is O(N 3/2), where N is the number of discretization points on the boundary of the domain, and corresponds to linear complexity in the number of uniformly sampled evaluation points. A kernelindependent fast summation algorithm is used to accelerate the evaluation of the discretized integral operators. We describe a highorder accurate method for evaluating the solution at arbitrary points inside the domain, including points close to the domain boundary. We demonstrate how our solver, combined with a regulargrid spectral solver, can be applied to problems with distributed sources. We present numerical results for the Stokes, Navier, and Poisson problems.
BOTTOMUP CONSTRUCTION AND 2:1 BALANCE REFINEMENT OF LINEAR OCTREES IN PARALLEL ∗
"... Abstract. In this article, we propose new parallel algorithms for the construction and 2:1 balance refinement of large linear octrees on distributed memory machines. Such octrees are used in many problems in computational science and engineering, e.g., object representation, image analysis, unstruct ..."
Abstract

Cited by 32 (9 self)
 Add to MetaCart
(Show Context)
Abstract. In this article, we propose new parallel algorithms for the construction and 2:1 balance refinement of large linear octrees on distributed memory machines. Such octrees are used in many problems in computational science and engineering, e.g., object representation, image analysis, unstructured meshing, finite elements, adaptive mesh refinement and Nbody simulations. Fixedsize scalability and isogranular analysis of the algorithms, using an MPIbased parallel implementation, was performed on a variety of input data and demonstrated good scalability for different processor counts (1 to 1024 processors) at the Pittsburgh Supercomputing Center’s TCS1 AlphaServer. The results are consistent for different data distributions. Octrees with over a billion octants were constructed and balanced in less than a minute on 1024 processors. Like other existing algorithms for constructing and balancing octrees, our algorithms have O(n log n) work and O(n) storage complexity. Under reasonable assumptions on the distribution of octants and the work per octant, the parallel time complexity is O(n/np log(n/np) + np log np), were n is the final number of leaves and np is the number of processors. Key words. Linear octrees, Balance refinement, Morton encoding, large scale parallel computing, space filling curves AMS subject classifications. 65N50, 65Y05, 68W10, 68W15 1. Introduction. Spatial
Optimizing and Tuning the Fast Multipole Method for StateoftheArt Multicore Architectures
"... This work presents the first extensive study of singlenode performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multicore systems. We consider single and doubleprecision with numerous performance enhancements, including lowlevel tuning, numerical approximati ..."
Abstract

Cited by 25 (9 self)
 Add to MetaCart
(Show Context)
This work presents the first extensive study of singlenode performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multicore systems. We consider single and doubleprecision with numerous performance enhancements, including lowlevel tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning. Among our numerous findings, we show that optimization and parallelization can improve doubleprecision performance by 25 × on Intel’s quadcore Nehalem, 9.4 × on AMD’s quadcore Barcelona, and 37.6 × on Sun’s Victoria Falls (dualsockets on all systems). We also compare our singleprecision version against our prior stateoftheart GPUbased code and show, surprisingly, that the most advanced multicore architecture (Nehalem) reaches parity in both performance and power efficiency with NVIDIA’s most advanced GPU architecture. 1
Fast multipole method for the biharmonic equation in three dimensions
 J. Comput. Phys
, 2006
"... The evaluation of sums (matrixvector products) of the solutions of the threedimensional biharmonic equation can be accelerated using the fast multipole method, while memory requirements can also be significantly reduced. We develop a complete translation theory for these equations. It is shown tha ..."
Abstract

Cited by 17 (8 self)
 Add to MetaCart
(Show Context)
The evaluation of sums (matrixvector products) of the solutions of the threedimensional biharmonic equation can be accelerated using the fast multipole method, while memory requirements can also be significantly reduced. We develop a complete translation theory for these equations. It is shown that translations of elementary solutions of the biharmonic equation can be achieved by considering the translation of a pair of elementary solutions of the Laplace equations. The extension of the theory to the case of polyharmonic equations in R 3 is also discussed. An efficient way of performing the FMM for biharmonic equations using the solution of a complex valued FMM for the Laplace equation is presented. Compared to previous methods presented for the biharmonic equation our method appears more efficient. The theory is implemented and numerical tests presented that demonstrate the performance of the method for varying problem sizes and accuracy requirements. In our implementation, the FMM for the biharmonic equation is faster than direct matrix vector product for a matrix size of 550 for a relative L2 accuracy 2 =10 −4, and N = 3550 for 2 =10 −12. 1
Petascale direct numerical simulation of blood flow on 200K
"... cores and heterogeneous architectures ..."
(Show Context)
On the Limits of GPU Acceleration
"... This paper throws a small “wet blanket ” on the hot topic of GPGPU acceleration, based on experience analyzing and tuning both multithreaded CPU and GPU implementations of three computations in scientific computing. These computations—(a) iterative sparse linear solvers; (b) sparse Cholesky factoriz ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
This paper throws a small “wet blanket ” on the hot topic of GPGPU acceleration, based on experience analyzing and tuning both multithreaded CPU and GPU implementations of three computations in scientific computing. These computations—(a) iterative sparse linear solvers; (b) sparse Cholesky factorization; and (c) the fast multipole method—exhibit complex behavior and vary in computational intensity and memory reference irregularity. In each case, algorithmic analysis and prior work might lead us to conclude that an idealized GPU can deliver better performance, but we find that for at least equaleffort CPU tuning and consideration of realistic workloads and callingcontexts, we can with two modern quadcore CPU sockets roughly match one or two
X10 as a parallel language for scientific computation: practice and experience
"... Abstract—X10 is an emerging Partitioned Global Address Space (PGAS) language intended to increase significantly the productivity of developing scalable HPC applications. The language has now matured to a point where it is meaningful to consider writing large scale scientific application codes in X10 ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
(Show Context)
Abstract—X10 is an emerging Partitioned Global Address Space (PGAS) language intended to increase significantly the productivity of developing scalable HPC applications. The language has now matured to a point where it is meaningful to consider writing large scale scientific application codes in X10. This paper reports our experiences writing three codes from
A FreeSpace Adaptive FMMBased PDE Solver in Three Dimensions
, 2008
"... We present a kernelindependent, adaptive fast multipole method (FMM) of arbitrary order accuracy for solving elliptic PDEs in three dimensions with radiation boundary conditions. The algorithm requires only a Green’s function evaluation routine for the governing equation and a representation of the ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
We present a kernelindependent, adaptive fast multipole method (FMM) of arbitrary order accuracy for solving elliptic PDEs in three dimensions with radiation boundary conditions. The algorithm requires only a Green’s function evaluation routine for the governing equation and a representation of the source distribution (the righthand side) that can be evaluated at arbitrary points. The performance of the FMM is accelerated in two ways. First, we construct a piecewise polynomial approximation of the righthand side and compute farfield expansions in the FMM from the coefficients of this approximation. Second, we precompute tables of quadratures to handle the nearfield interactions on adaptive octree data structures, keeping the total storage requirements in check through the exploitation of symmetries. We present numerical examples for the Laplace, modified Helmholtz and Stokes equations. 1
Hybrid MPIthread parallelization of the Fast Multipole Method
 in "6th International Symposium on Parallel and Distributed Computing (ISPDC
"... We present in this paper multithread and multiprocess parallelizations of the Fast Multipole Method (FMM) for Laplace equation, for uniform and non uniform distributions. These parallelizations apply to the original FMM formulation and to our new matrix formulation with BLAS (Basic Linear Algebra ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
We present in this paper multithread and multiprocess parallelizations of the Fast Multipole Method (FMM) for Laplace equation, for uniform and non uniform distributions. These parallelizations apply to the original FMM formulation and to our new matrix formulation with BLAS (Basic Linear Algebra Subprograms) routines. Differences between the multithread and the multiprocess versions are detailed, and a hybrid MPIthread approach enables to gain parallel efficiency and memory scalability over the pure MPI one on clusters of SMP nodes. On 128 processors, we obtain 85 % (respectively 75%) parallel efficiency for uniform (respectively non uniform) distributions with up to 100 million particles. 1.
A Parallel and Incremental Extraction of Variational Capacitance With Stochastic Geometric Moments
"... Abstract—This paper presents a parallel and incremental solver for stochastic capacitance extraction. The random geometrical variation is described by stochastic geometrical moments, which lead to a densely augmented system equation. To efficiently extract the capacitance and solve the system equati ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract—This paper presents a parallel and incremental solver for stochastic capacitance extraction. The random geometrical variation is described by stochastic geometrical moments, which lead to a densely augmented system equation. To efficiently extract the capacitance and solve the system equation, a parallel fastmultipolemethod(FMM)isdevelopedintheframeworkofstochastic geometrical moments. This can efficiently estimate the stochastic potential interaction and its matrixvector product (MVP) with charge. Moreover, a generalized minimal residual (GMRES) method with incremental update is developed to calculate both the nominal value and the variance. Our overall extraction flow is called piCAP. A number of experiments show that piCAP efficiently handles a largescale onchip capacitance extraction with variations. Specifically, a parallel MVP in piCAP is up to faster than a serial MVP, and an incremental GMRES in piCAP is up to faster than nonincremental GMRES methods. Index Terms—Capacitance extraction, fast multipole method, process variation. I.