Results 1  10
of
12
Experiences with parallel Nbody simulation
, 2000
"... This paper describes our experiences developing highperformance code for astrophysical Nbody simulations. Recent Nbody methods are based on an adaptive tree structure. The tree must be built and maintained across physically distributed memory; moreover, the communication requirements are irregul ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
(Show Context)
This paper describes our experiences developing highperformance code for astrophysical Nbody simulations. Recent Nbody methods are based on an adaptive tree structure. The tree must be built and maintained across physically distributed memory; moreover, the communication requirements are irregular and adaptive. Together with the need to balance the computational workload among processors, these issues pose interesting challenges and tradeoffs for highperformance implementation. Our implementation was guided by the need to keep solutions simple and general. We use a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures. The contributions include methods to vectorize the computation and minimize communication time which are theoretically and experimentally justified. The code has been tested by varying the number and distribution of bodies on different configurations of the Connection Machine CM5. The overall performance on instances with 10 million bodies is typically over 48 percent of the peak machine rate, which compares favorably with other approaches.
Highly Portable and Efficient Implementations of Parallel Adaptive NBody Methods
 In SC'97
, 1997
"... We describe the design of several portable and efficient parallel implementations of adaptive Nbody methods, including the adaptive Fast Multipole Method, the adaptive version of Anderson's Method, and the BarnesHut algorithm. Our codes are based on a communication and work partitioning schem ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
(Show Context)
We describe the design of several portable and efficient parallel implementations of adaptive Nbody methods, including the adaptive Fast Multipole Method, the adaptive version of Anderson's Method, and the BarnesHut algorithm. Our codes are based on a communication and work partitioning scheme that allows an efficient implementation of adaptive multipole methods even on highlatency systems. Our test runs demonstrate high performance and speedup on several parallel architectures, including traditional MPPs, sharedmemory machines, and networks of workstations connected by Ethernet. 1 Introduction The Nbody problem is the problem of simulating the movement of a set of bodies (or particles) under the influence of gravitational, electrostatic, or other type of force. Algorithms for Nbody simulations have a number of important applications in fields such as astrophysics, molecular dynamics, fluid dynamics, and even computer graphics [12]. A large number of algorithms for Nbody simula...
Load Balancing and Data Locality in the Parallelization of the Fast Multipole Algorithm
, 1996
"... Scientific problems are often irregular, large and computationally intensive. Efficient parallel implementations of algorithms that are employed in finding solutions to these problems play an important role in the development of science. This thesis studies the parallelization of a certain class of ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
(Show Context)
Scientific problems are often irregular, large and computationally intensive. Efficient parallel implementations of algorithms that are employed in finding solutions to these problems play an important role in the development of science. This thesis studies the parallelization of a certain class of irregular scientific problems, the Nbody problem, using a classical hierarchical algorithm: the Fast Multipole Algorithm (FMA). Hierarchical Nbody algorithms in general, and the FMA in particular, are amenable to parallel execution. However, performance gains are difficult to obtain, due to load imbalances that are primarily caused by the irregular distribution of bodies and of computation domains. Understanding application characteristics is essential for obtaining high performance implementations on parallel machines. After surveying the available parallelism in the FMA, we address the problem of exploiting this parallelism with partitioning and scheduling techniques that optimally map i...
The Parallel Implementation of Nbody Algorithms
, 1994
"... This dissertation studies issues critical to efficient Nbody simulations on parallel computers. The Nbody problem poses several challenges for distributedmemory implementation: adaptive distributed data structures, irregular data access patterns, and irregular and adaptive communication patterns. ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
This dissertation studies issues critical to efficient Nbody simulations on parallel computers. The Nbody problem poses several challenges for distributedmemory implementation: adaptive distributed data structures, irregular data access patterns, and irregular and adaptive communication patterns. We introduce new techniques to maintain dynamic irregular data structures, to vectorize irregular computational structures, and for efficient communication. We report results from experiments on the Connection Machine CM5. The results demonstrate the performance advantages of design simplicity; the code provides generality of use on various messagepassing architectures. Our methods have been used as the basis of a C++ library that provides abstractions for tree computations to ease the development of different Nbody codes. This dissertation also presents the atomic message model to capture the important factors of efficient communication in messagepassing systems. The atomic model was m...
Numerical Study of ThreeDimensional Flow using Fast Parallel Particle Algorithms.
, 1994
"... Numerical studies of turbulent flows have always been prone to crude approximations due to the limitations in computing power. With the advent of supercomputers, new turbulence models and fast particle algorithms, more highly resolved models can now be computed. Vortex Methods are gridfree and so a ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
Numerical studies of turbulent flows have always been prone to crude approximations due to the limitations in computing power. With the advent of supercomputers, new turbulence models and fast particle algorithms, more highly resolved models can now be computed. Vortex Methods are gridfree and so avoid a number of shortcomings of gridbased methods for solving turbulent fluid flow equations; these include such problems as poor resolution and numerical diffusion. In these methods, the continuum vorticity field is discretised into a collection of Lagrangian elements, known as vortex elements, which are free to move in the flow field they collectively induce. The vortex element interaction constitutes an Nbody problem, which may be calculated by a direct pairwise summation method, in a time proportional to N 2 . This time complexity may be reduced by use of fast particle algorithms. The most common algorithms are known as the Nbody Treecodes and have a hierarchical structure. An inde...
A Data Parallel Implementation of Hierarchical Nbody Methods
 IN PROCEEDINGS OF SUPERCOMPUTING, THE SCXY CONFERENCE SERIES
, 1994
"... The O(N) hierarchical Nbody algorithms and Massively Parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We describe a data parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementa ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
The O(N) hierarchical Nbody algorithms and Massively Parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We describe a data parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM5/5E systems. The communication time for large particle systems amounts to about 1025%, and the overall efficiency is about 35%. On a CM5E the overall performance is about 60 Mflop/s per node, independent of the number of nodes.
Software Issues In HighPerformance Computing And A Framework For The Development Of HPC Applications
 COMPUTING, U. VISHKIN, ED.: ACM
, 1994
"... We identify the following key problems faced by HPC software: (1) the large gap between HPC design and implementation models in application development, (2) achieving high performance for a single application on different HPC platforms, and (3) accommodating constant changes in both problem spe ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
We identify the following key problems faced by HPC software: (1) the large gap between HPC design and implementation models in application development, (2) achieving high performance for a single application on different HPC platforms, and (3) accommodating constant changes in both problem specification and target architecture as computational methods and architectures evolve. To attack these problems, we suggest an application development methodology in which highlevel architectureindependent specifications are elaborated, through an iterative refinement process which introduces architectural detail, into a form which can be translated to efficient lowlevel architecturespecific programming notations. A treestructured development process permits multiple architectures to be targeted with implementation strategies appropriate to each architecture, and also provides a systematic means to accommodate changes in specification and target architecture. We describe the Pr...
Implementing Nbody Algorithms Efficiently in DataParallel Languages
 Harvard University, Division of Applied Sciences
, 1996
"... The optimization techniques for hierarchical O(N ) Nbody algorithms described here focus on managing the data distribution and the data references, both between the memories of different nodes, and within the memory hierarchy of each node. We show how the techniques can be expressed in dataparal ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
(Show Context)
The optimization techniques for hierarchical O(N ) Nbody algorithms described here focus on managing the data distribution and the data references, both between the memories of different nodes, and within the memory hierarchy of each node. We show how the techniques can be expressed in dataparallel languages, such as High Performance Fortran (HPF) and Connection Machine Fortran (CMF). The effectiveness of our techniques is demonstrated on an implementation of Anderson's hierarchical O(N ) N body method for the Connection Machine system CM5/5E. Of the total execution time, communication accounts for about 1020% of the total time, with the average efficiency for arithmetic operations being about 40% and the total efficiency (including communication) being about 35%. For the CM5E, a performance in excess of 60 Mflop/s per node (peak 160 Mflop/s per node) has been measured. c fl1996 John Wiley & Sons, Inc. 1 INTRODUCTION Achieving high efficiency in hierarchical methods on mas...