Results 1  10
of
68
A Decomposition of MultiDimensional Point Sets with Applications to kNearestNeighbors and nBody Potential Fields
 J. ACM
, 1992
"... We define the notion of a wellseparated pair decomposition of points in ddimensional space. We then develop efficient sequential and parallel algorithms for computing such a decomposition. We apply the resulting decomposition to the efficient computation of knearest neighbors and nbody potential ..."
Abstract

Cited by 281 (4 self)
 Add to MetaCart
We define the notion of a wellseparated pair decomposition of points in ddimensional space. We then develop efficient sequential and parallel algorithms for computing such a decomposition. We apply the resulting decomposition to the efficient computation of knearest neighbors and nbody potential fields.
A Parallel Hashed OctTree NBody Algorithm
, 1993
"... We report on an efficient adaptive Nbody method which we have recently designed and implemented. The algorithm computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number. The accuracy of the force calculations is analytically bounded, and c ..."
Abstract

Cited by 200 (14 self)
 Add to MetaCart
We report on an efficient adaptive Nbody method which we have recently designed and implemented. The algorithm computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number. The accuracy of the force calculations is analytically bounded, and can be adjusted via a user defined parameter between a few percent relative accuracy, down to machine arithmetic accuracy. Instead of using pointers to indicate the topology of the tree, we identify each possible cell with a key. The mapping of keys into memory locations is achieved via a hash table. This allows the program to access data in an efficient manner across multiple processors. Performance of the parallel program is measured on the 512 processor Intel Touchstone Delta system. We also comment on a number of wideranging applications which can benefit from application of this type of algorithm.
Special Purpose Parallel Computing
 Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract

Cited by 81 (6 self)
 Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
The Parallel Multipole Method on the Connection Machine
 SIAM J. ON SCIENTIFIC & STATISTICAL COMPUTING, 12(6):14201437, 1991
, 1991
"... This paper reports on a fast implementation of the threedimensional nonadaptive Parallel Multipole Method (PMM) on the Connection Machine system model CM2. The data interactions within the decomposition tree are modeled by a hierarchy of three dimensional grids forming a pyramid in which parent n ..."
Abstract

Cited by 54 (6 self)
 Add to MetaCart
(Show Context)
This paper reports on a fast implementation of the threedimensional nonadaptive Parallel Multipole Method (PMM) on the Connection Machine system model CM2. The data interactions within the decomposition tree are modeled by a hierarchy of three dimensional grids forming a pyramid in which parent nodes have degree eight. The base of the pyramid is embedded in the Connection Machine as a three dimensional grid. The standard grid embedding feature is used. For 10 or more particles per processor the communication time is insignificant. The evaluation of the potential field for a system with 128k particles takes 5 seconds, and a million particle system about 3 minutes. The maximum number of particles that can be represented in 2G bytes of primary storage is ~ 50 million. The execution rate of this implementation of the PMM is at about 1.7 Gflops/sec for a particleprocessorratio of 10 or greater. A further speed improvement is possible by an improved use of the memory hierarchy associate...
Fast Multipole Methods on Graphical Processors
 Journal of Computational Physics
"... The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at points distributed inside a computational domain at a large number of evaluation points to a specified accuracy ɛ. The method scales as O (N) compared to the direct method with complexity O(N 2), whic ..."
Abstract

Cited by 47 (6 self)
 Add to MetaCart
(Show Context)
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at points distributed inside a computational domain at a large number of evaluation points to a specified accuracy ɛ. The method scales as O (N) compared to the direct method with complexity O(N 2), which allows one to solve larger scale problems. Graphical processing units (GPU) are now increasingly viewed as data parallel compute coprocessors that can provide significant computational performance at low price. We describe acceleration of the FMM using the data parallel GPU architecture. The FMM has a complex hierarchical (adaptive) structure, which is not easily implemented on dataparallel processors. We described strategies for parallelization of all components of the FMM, develop a model to explain the performance of the algorithm on the GPU architectures, and determined optimal settings for the FMM on the GPU, which are different from those on usual CPUs. Some innovations in the FMM algorithm, including the use of modified stencils, real polynomial basis functions for the Laplace kernel, and decompositions of the translation operators, are also described. We obtained accelerations of the Laplace kernel FMM on a single NVIDIA GeForce 8800 GTX GPU in the range 3060 compared to a serial CPU implementation for benchmark cases of up to million size. For a problem with a million sources, the summations involved are performed in approximately one second. This performance is equivalent to solving of the same problem at 2443 Teraflop rate if we use straightforward summation. 1
Scalable Parallel Formulations of the BarnesHut Method for nBody Simulations
 IN PROCEEDINGS OF SUPERCOMPUTING '94
, 1994
"... In this paper, we present two new parallel formulations of the BarnesHut method. These parallel formulations are especially suited for simulations with irregular particle densities. We first present a parallel formulation that uses a static partitioning of the domain and assignment of subdomains to ..."
Abstract

Cited by 47 (7 self)
 Add to MetaCart
In this paper, we present two new parallel formulations of the BarnesHut method. These parallel formulations are especially suited for simulations with irregular particle densities. We first present a parallel formulation that uses a static partitioning of the domain and assignment of subdomains to processors. We demonstrate that this scheme delivers acceptable load balance, and coupled with two collective communication operations, it yields good performance. We present a second parallel formulation which combines static decomposition of the domain with an assignment of subdomains to processors based on Morton ordering. This alleviates the load imbalance inherent in the first scheme. The second parallel formulation is inspired by two currently best known parallel algorithms for the BarnesHut method. We present an experimental evaluation of these schemes on a 256 processor nCUBE2 parallel computer for an astrophysical simulation.
The future fast fourier transform
 SIAM J. Sci. Computing
, 1999
"... It seems likely that improvements in arithmetic speed will continue to outpace advances in communications bandwidth. Furthermore, as more and more problems are working on huge datasets, it is becoming increasingly likely that data will be distributed across many processors because one processor does ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
It seems likely that improvements in arithmetic speed will continue to outpace advances in communications bandwidth. Furthermore, as more and more problems are working on huge datasets, it is becoming increasingly likely that data will be distributed across many processors because one processor does not have sufficient storage capacity. For these reasons, we propose that an inexact DFT such as an approximate matrixvector approach based on singular values or a variation of the DuttRokhlin fastmultipolebased algorithm [9] may outperform any exact parallel FFT. The speedup may be as large as a factor of three in situations where FFT run time is dominated by communication. For the multipole idea we further propose that a method of “virtual charges ” may improve accuracy, and we provide an analysis of the singular values that are needed for the approximate matrixvector approaches. 1
Parallel Hierarchical Solvers and Preconditioners for Boundary Element Methods
 Purdue University
, 1997
"... The method of moments is an important tool for solving boundary integral equations arising in a variety of applications. It transforms the physical problem into a dense linear system. Due to the large number of variables and the associated computational requirements, these systems are solved iterati ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
(Show Context)
The method of moments is an important tool for solving boundary integral equations arising in a variety of applications. It transforms the physical problem into a dense linear system. Due to the large number of variables and the associated computational requirements, these systems are solved iteratively using methods such as GMRES, CG and its variants. The core operation of these iterative solvers is the application of the system matrix to a vector. This requires `(n 2 ) operations and memory using accurate dense methods. The computational complexity can be reduced to O(n log n) and the memory requirement to \Theta(n) using hierarchical approximation techniques. The algorithmic speedup from approximation can be combined with parallelism to yield very fast dense solvers. In this paper, we present efficient parallel formulations of dense iterative solvers based on hierarchical approximations for solving the integral form of Laplace equation. We study the impact of various parameters o...
Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures
"... We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPUGPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divideandconquer algorithm that per ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
(Show Context)
We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPUGPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divideandconquer algorithm that performs a fast Nbody sum using a spatial decomposition and is often used in a timestepping or iterative loop. Using the observation that the local summation and the analysisbased translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiplenode versions. Our implementation can perform the Nbody sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters. ACM computing classification: C.1.2 [Multiple Data Stream Architectures]:Parallel processors; C.1.m [Miscellaneous]: