Results 1  10
of
103
A Parallel Hashed OctTree NBody Algorithm
, 1993
"... We report on an efficient adaptive Nbody method which we have recently designed and implemented. The algorithm computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number. The accuracy of the force calculations is analytically bounded, and c ..."
Abstract

Cited by 200 (14 self)
 Add to MetaCart
We report on an efficient adaptive Nbody method which we have recently designed and implemented. The algorithm computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number. The accuracy of the force calculations is analytically bounded, and can be adjusted via a user defined parameter between a few percent relative accuracy, down to machine arithmetic accuracy. Instead of using pointers to indicate the topology of the tree, we identify each possible cell with a key. The mapping of keys into memory locations is achieved via a hash table. This allows the program to access data in an efficient manner across multiple processors. Performance of the parallel program is measured on the 512 processor Intel Touchstone Delta system. We also comment on a number of wideranging applications which can benefit from application of this type of algorithm.
Commutativity Analysis: A New Analysis Technique for Parallelizing Compilers
 ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1997
"... This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointerbased data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granula ..."
Abstract

Cited by 86 (11 self)
 Add to MetaCart
This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointerbased data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granularity to discover when operations commute (i.e., generate the same final result regardless of the order in which they execute). If all of the operations required to perform a given computation commute, the compiler can automatically generate parallel code. We have implemented a prototype compilation system that uses commutativity analysis as its primary analysis technique
The Design, Implementation, and Evaluation of Jade
 ACM Transactions on Programming Languages and Systems
, 1998
"... this article we discuss the design goals and decisions that determined the final form of Jade and present an overview of the Jade implementation. We also present our experience using Jade to implement several complete scientific and engineering applications. We use this experience to evaluate how th ..."
Abstract

Cited by 83 (7 self)
 Add to MetaCart
this article we discuss the design goals and decisions that determined the final form of Jade and present an overview of the Jade implementation. We also present our experience using Jade to implement several complete scientific and engineering applications. We use this experience to evaluate how the different Jade language features were used in practice and how well Jade as a whole supports the process of developing parallel applications. We find that the basic idea of preserving the serial semantics simplifies the program development process, and that the concept of using data access specifications to guide the parallelization offers significant advantages over more traditional controlbased approaches. We also find that the Jade data model can interact poorly with concurrency patterns that write disjoint pieces of a single aggregate data structure, although this problem arises in only one of the applications. Categories and Subject Descriptors: D.1.3 [Programming Te
Working Sets, Cache Sizes, and Node Granularity Issues for LargeScale Multiprocessors
 In Proceedings of the 20th Annual International Symposium on Computer Architecture
, 1993
"... The distribution of resources among processors, memory and caches is a crucial question faced by designers of largescale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a s ..."
Abstract

Cited by 78 (5 self)
 Add to MetaCart
(Show Context)
The distribution of resources among processors, memory and caches is a crucial question faced by designers of largescale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for costeffectiveness? And how do these decisions change as larger problems are run on larger machines? In this paper, we explore the above questions based on the characteristics of five important classes of largescale parallel scientific applications. We first show that all the applications have a hierarchy of welldefined perprocessor working sets, whose size, performance impact and scaling characteristics can help determine how large diffkrent levels of a multiprocessor 's cache hierarchy should be. Then, we use these working sets together with certain other imporant characteristics of the applications such as communication to computation ratios, concurrency, and load balancing behavioto reflect upon the broader question of the granularity of processing nodes in highperformance multiprocessors.
Load Balancing and Data Locality in Adaptive Hierarchical Nbody Methods: BarnesHut, Fast Multipole, and Radiosity
 Journal Of Parallel and Distributed Computing
, 1995
"... processes, are increasingly being used to solve largescale problems in a variety of scientific/engineering domains. Applications that use these methods are challenging to parallelize effectively, however, owing to their nonuniform, dynamically changing characteristics and their need for longrang ..."
Abstract

Cited by 76 (2 self)
 Add to MetaCart
(Show Context)
processes, are increasingly being used to solve largescale problems in a variety of scientific/engineering domains. Applications that use these methods are challenging to parallelize effectively, however, owing to their nonuniform, dynamically changing characteristics and their need for longrange communication.
Application scheduling and processor allocation in multiprogrammed parallel processing systems
 Performance Evaluation
, 1994
"... ..."
A Portable Parallel Particle Program
 Computer Physics Communications
, 1995
"... We describe our implementation of the parallel hashed octtree (HOT) code, and in particular its application to neighbor finding in a smoothed particle hydrodynamics (SPH) code. We also review the error bounds on the multipole approximations involved in treecodes, and extend them to include general ..."
Abstract

Cited by 66 (9 self)
 Add to MetaCart
(Show Context)
We describe our implementation of the parallel hashed octtree (HOT) code, and in particular its application to neighbor finding in a smoothed particle hydrodynamics (SPH) code. We also review the error bounds on the multipole approximations involved in treecodes, and extend them to include general cellcell interactions. Performance of the program on a variety of problems (including gravity, SPH, vortex method and panel method) is measured on several parallel and sequential machines. 1 Introduction There are two strategies that can be applied in the quest for more knowledge from bigger and better particle simulations. One can use the brute force approach; simple algorithms on bigger and faster machines (and bigger and faster now means massively parallel). To compute the gravitational force and potential for a single interaction takes 28 floating point operations (here we count a division as 4 floating point operations and a square root as 4 floating point operations). A typical grav...
Skeletons from the Treecode Closet
 J. Comp. Phys
, 1994
"... We consider treecodes (Nbody programs which use a tree data structure) from the standpoint of their worstcase behavior. That is, we derive upper bounds on the largest possible errors that are introduced into a calculation by use of various multipole acceptability criteria (MAC). We find that the ..."
Abstract

Cited by 59 (12 self)
 Add to MetaCart
(Show Context)
We consider treecodes (Nbody programs which use a tree data structure) from the standpoint of their worstcase behavior. That is, we derive upper bounds on the largest possible errors that are introduced into a calculation by use of various multipole acceptability criteria (MAC). We find that the conventional BarnesHut MAC can introduce potentially unbounded errors unless ` ! 1= p 3, and that this behavior while rare, is demonstrable in astrophysically reasonable examples. We consider two other MACs closely related to the BH MAC. While they don't admit the same unbounded errors, they nevertheless require extraordinary amounts of CPU time to guarantee modest levels of accuracy. We derive new error bounds based on some additional, easily computed moments of the mass distribution. These error bounds form the basis for four new MACs which can be used to limit the absolute or relative error introduced by each multipole evaluation, or, with the introduction of some additional data struc...
Commutativity analysis: A new analysis framework for parallelizing compilers
 In Programming Language Design and Implementation (PLDI
, 1996
"... This paper presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointerbased data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granulari ..."
Abstract

Cited by 52 (8 self)
 Add to MetaCart
(Show Context)
This paper presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointerbased data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granularity to discover when operations commute (i.e. generate the same final result regardless of the order in which they execute). If all of the operations required to perform a given computation commute, the compiler can automatically generate parallel code. We have implemented a prototype compilation system that uses commutativity analysis as its primary analysis framework. We have used this system to automatically parallelize two complete scientific computations: the BarnesHut Nbody solver and the Water code. This paper presents performance results for the generated parallel code running on the Stanford DASH machine. These results provide encouraging evidence that commutativity analysis can serve as the basis for a successful parallelizing compiler. 1
A Parallel Adaptive Fast Multipole Method
 In Proceedings of Supercomputing 93
, 1993
"... We present parallel versions of a representative Nbody application that uses Greengard and Rokhlin's adaptive Fast Multipole Method (FMM). While parallel implementations of the uniform FMM are straightforward and have been developed on different architectures, the adaptive version complicates ..."
Abstract

Cited by 42 (0 self)
 Add to MetaCart
(Show Context)
We present parallel versions of a representative Nbody application that uses Greengard and Rokhlin's adaptive Fast Multipole Method (FMM). While parallel implementations of the uniform FMM are straightforward and have been developed on different architectures, the adaptive version complicates the task of obtaining effective parallel performance owing to the nonuniform and dynamically changing nature of the problem domains to which it is applied. We propose and evaluate two techniques for providing load balancing and data locality, both of which take advantage of key insights into the method and its typical applications. Using the better of these techniques, we demonstrate 45fold speedups on galactic simulations on a 48processor Stanford DASH machine, a stateoftheart shared address space multiprocessor, even for relatively small problems. We also show good speedups on a 2ring Kendall Square Research KSR1. Finally, we summarize some key architectural implications of this importan...