Results 1  10
of
44
A Decomposition of MultiDimensional Point Sets with Applications to kNearestNeighbors and nBody Potential Fields
 J. ACM
, 1992
"... We define the notion of a wellseparated pair decomposition of points in ddimensional space. We then develop efficient sequential and parallel algorithms for computing such a decomposition. We apply the resulting decomposition to the efficient computation of knearest neighbors and nbody potential ..."
Abstract

Cited by 241 (4 self)
 Add to MetaCart
We define the notion of a wellseparated pair decomposition of points in ddimensional space. We then develop efficient sequential and parallel algorithms for computing such a decomposition. We apply the resulting decomposition to the efficient computation of knearest neighbors and nbody potential fields.
A Parallel Hashed OctTree NBody Algorithm
, 1993
"... We report on an efficient adaptive Nbody method which we have recently designed and implemented. The algorithm computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number. The accuracy of the force calculations is analytically bounded, and c ..."
Abstract

Cited by 147 (11 self)
 Add to MetaCart
We report on an efficient adaptive Nbody method which we have recently designed and implemented. The algorithm computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number. The accuracy of the force calculations is analytically bounded, and can be adjusted via a user defined parameter between a few percent relative accuracy, down to machine arithmetic accuracy. Instead of using pointers to indicate the topology of the tree, we identify each possible cell with a key. The mapping of keys into memory locations is achieved via a hash table. This allows the program to access data in an efficient manner across multiple processors. Performance of the parallel program is measured on the 512 processor Intel Touchstone Delta system. We also comment on a number of wideranging applications which can benefit from application of this type of algorithm.
Special Purpose Parallel Computing
 Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract

Cited by 77 (5 self)
 Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
The Parallel Multipole Method on the Connection Machine
 SIAM J. ON SCIENTIFIC & STATISTICAL COMPUTING, 12(6):14201437, 1991
, 1991
"... This paper reports on a fast implementation of the threedimensional nonadaptive Parallel Multipole Method (PMM) on the Connection Machine system model CM2. The data interactions within the decomposition tree are modeled by a hierarchy of three dimensional grids forming a pyramid in which parent n ..."
Abstract

Cited by 46 (5 self)
 Add to MetaCart
This paper reports on a fast implementation of the threedimensional nonadaptive Parallel Multipole Method (PMM) on the Connection Machine system model CM2. The data interactions within the decomposition tree are modeled by a hierarchy of three dimensional grids forming a pyramid in which parent nodes have degree eight. The base of the pyramid is embedded in the Connection Machine as a three dimensional grid. The standard grid embedding feature is used. For 10 or more particles per processor the communication time is insignificant. The evaluation of the potential field for a system with 128k particles takes 5 seconds, and a million particle system about 3 minutes. The maximum number of particles that can be represented in 2G bytes of primary storage is ~ 50 million. The execution rate of this implementation of the PMM is at about 1.7 Gflops/sec for a particleprocessorratio of 10 or greater. A further speed improvement is possible by an improved use of the memory hierarchy associate...
Scalable Parallel Formulations of the BarnesHut Method for nBody Simulations
 IN PROCEEDINGS OF SUPERCOMPUTING '94
, 1994
"... In this paper, we present two new parallel formulations of the BarnesHut method. These parallel formulations are especially suited for simulations with irregular particle densities. We first present a parallel formulation that uses a static partitioning of the domain and assignment of subdomains to ..."
Abstract

Cited by 39 (7 self)
 Add to MetaCart
In this paper, we present two new parallel formulations of the BarnesHut method. These parallel formulations are especially suited for simulations with irregular particle densities. We first present a parallel formulation that uses a static partitioning of the domain and assignment of subdomains to processors. We demonstrate that this scheme delivers acceptable load balance, and coupled with two collective communication operations, it yields good performance. We present a second parallel formulation which combines static decomposition of the domain with an assignment of subdomains to processors based on Morton ordering. This alleviates the load imbalance inherent in the first scheme. The second parallel formulation is inspired by two currently best known parallel algorithms for the BarnesHut method. We present an experimental evaluation of these schemes on a 256 processor nCUBE2 parallel computer for an astrophysical simulation.
Fast Multipole Methods on Graphical Processors
 Journal of Computational Physics
"... The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at points distributed inside a computational domain at a large number of evaluation points to a specified accuracy ɛ. The method scales as O (N) compared to the direct method with complexity O(N 2), whic ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at points distributed inside a computational domain at a large number of evaluation points to a specified accuracy ɛ. The method scales as O (N) compared to the direct method with complexity O(N 2), which allows one to solve larger scale problems. Graphical processing units (GPU) are now increasingly viewed as data parallel compute coprocessors that can provide significant computational performance at low price. We describe acceleration of the FMM using the data parallel GPU architecture. The FMM has a complex hierarchical (adaptive) structure, which is not easily implemented on dataparallel processors. We described strategies for parallelization of all components of the FMM, develop a model to explain the performance of the algorithm on the GPU architectures, and determined optimal settings for the FMM on the GPU, which are different from those on usual CPUs. Some innovations in the FMM algorithm, including the use of modified stencils, real polynomial basis functions for the Laplace kernel, and decompositions of the translation operators, are also described. We obtained accelerations of the Laplace kernel FMM on a single NVIDIA GeForce 8800 GTX GPU in the range 3060 compared to a serial CPU implementation for benchmark cases of up to million size. For a problem with a million sources, the summations involved are performed in approximately one second. This performance is equivalent to solving of the same problem at 2443 Teraflop rate if we use straightforward summation. 1
nBody Simulations Using Message Passing Parallel Computers
, 1995
"... In this paper, we present new parallel formulations of the BarnesHut method for nbody simulations on message passing computers. These parallel formulations partition the domain efficiently incurring minimal communication overhead. This is in contrast to existing schemes that are based on sorting a ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
In this paper, we present new parallel formulations of the BarnesHut method for nbody simulations on message passing computers. These parallel formulations partition the domain efficiently incurring minimal communication overhead. This is in contrast to existing schemes that are based on sorting a large number of keys or on the use of global data structures. The new formulations are augmented by alternate communication strategies which serve to minimize communication overhead. The impact of these communication strategies is experimentally studied. We report on experimental results obtained from an astrophysical simulation on an nCUBE2 parallel computer. 1 nBody Simulation Techniques The nbody problem simulates the behavior of n particles, each of which is influenced by every other particle during a timestep. An exact formulation of this problem therefore requires calculation of n 2 interactions between each pair of particles. Many approximate algorithms have been devised that r...
Load Balancing and Data Locality in the Parallelization of the Fast Multipole Algorithm
, 1996
"... Scientific problems are often irregular, large and computationally intensive. Efficient parallel implementations of algorithms that are employed in finding solutions to these problems play an important role in the development of science. This thesis studies the parallelization of a certain class of ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
Scientific problems are often irregular, large and computationally intensive. Efficient parallel implementations of algorithms that are employed in finding solutions to these problems play an important role in the development of science. This thesis studies the parallelization of a certain class of irregular scientific problems, the Nbody problem, using a classical hierarchical algorithm: the Fast Multipole Algorithm (FMA). Hierarchical Nbody algorithms in general, and the FMA in particular, are amenable to parallel execution. However, performance gains are difficult to obtain, due to load imbalances that are primarily caused by the irregular distribution of bodies and of computation domains. Understanding application characteristics is essential for obtaining high performance implementations on parallel machines. After surveying the available parallelism in the FMA, we address the problem of exploiting this parallelism with partitioning and scheduling techniques that optimally map i...
Parallel Hierarchical Solvers and Preconditioners for Boundary Element Methods
 Purdue University
, 1997
"... The method of moments is an important tool for solving boundary integral equations arising in a variety of applications. It transforms the physical problem into a dense linear system. Due to the large number of variables and the associated computational requirements, these systems are solved iterati ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
The method of moments is an important tool for solving boundary integral equations arising in a variety of applications. It transforms the physical problem into a dense linear system. Due to the large number of variables and the associated computational requirements, these systems are solved iteratively using methods such as GMRES, CG and its variants. The core operation of these iterative solvers is the application of the system matrix to a vector. This requires `(n 2 ) operations and memory using accurate dense methods. The computational complexity can be reduced to O(n log n) and the memory requirement to \Theta(n) using hierarchical approximation techniques. The algorithmic speedup from approximation can be combined with parallelism to yield very fast dense solvers. In this paper, we present efficient parallel formulations of dense iterative solvers based on hierarchical approximations for solving the integral form of Laplace equation. We study the impact of various parameters o...
The future fast fourier transform
 SIAM J. Sci. Computing
, 1999
"... It seems likely that improvements in arithmetic speed will continue to outpace advances in communications bandwidth. Furthermore, as more and more problems are working on huge datasets, it is becoming increasingly likely that data will be distributed across many processors because one processor does ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
It seems likely that improvements in arithmetic speed will continue to outpace advances in communications bandwidth. Furthermore, as more and more problems are working on huge datasets, it is becoming increasingly likely that data will be distributed across many processors because one processor does not have sufficient storage capacity. For these reasons, we propose that an inexact DFT such as an approximate matrixvector approach based on singular values or a variation of the DuttRokhlin fastmultipolebased algorithm [9] may outperform any exact parallel FFT. The speedup may be as large as a factor of three in situations where FFT run time is dominated by communication. For the multipole idea we further propose that a method of “virtual charges ” may improve accuracy, and we provide an analysis of the singular values that are needed for the approximate matrixvector approaches. 1