Results 1  10
of
74
A Parallel Hashed OctTree NBody Algorithm
, 1993
"... We report on an efficient adaptive Nbody method which we have recently designed and implemented. The algorithm computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number. The accuracy of the force calculations is analytically bounded, and c ..."
Abstract

Cited by 147 (11 self)
 Add to MetaCart
We report on an efficient adaptive Nbody method which we have recently designed and implemented. The algorithm computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number. The accuracy of the force calculations is analytically bounded, and can be adjusted via a user defined parameter between a few percent relative accuracy, down to machine arithmetic accuracy. Instead of using pointers to indicate the topology of the tree, we identify each possible cell with a key. The mapping of keys into memory locations is achieved via a hash table. This allows the program to access data in an efficient manner across multiple processors. Performance of the parallel program is measured on the 512 processor Intel Touchstone Delta system. We also comment on a number of wideranging applications which can benefit from application of this type of algorithm.
Working Sets, Cache Sizes, and Node Granularity Issues for LargeScale Multiprocessors
 In Proceedings of the 20th Annual International Symposium on Computer Architecture
, 1993
"... The distribution of resources among processors, memory and caches is a crucial question faced by designers of largescale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a s ..."
Abstract

Cited by 72 (4 self)
 Add to MetaCart
The distribution of resources among processors, memory and caches is a crucial question faced by designers of largescale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for costeffectiveness? And how do these decisions change as larger problems are run on larger machines? In this paper, we explore the above questions based on the characteristics of five important classes of largescale parallel scientific applications. We first show that all the applications have a hierarchy of welldefined perprocessor working sets, whose size, performance impact and scaling characteristics can help determine how large diffkrent levels of a multiprocessor 's cache hierarchy should be. Then, we use these working sets together with certain other imporant characteristics of the applications such as communication to computation ratios, concurrency, and load balancing behavioto reflect upon the broader question of the granularity of processing nodes in highperformance multiprocessors.
Commutativity Analysis: A New Analysis Technique for Parallelizing Compilers
 ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1997
"... This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointerbased data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granula ..."
Abstract

Cited by 71 (9 self)
 Add to MetaCart
This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointerbased data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granularity to discover when operations commute (i.e., generate the same final result regardless of the order in which they execute). If all of the operations required to perform a given computation commute, the compiler can automatically generate parallel code. We have implemented a prototype compilation system that uses commutativity analysis as its primary analysis technique
Application scheduling and processor allocation in multiprogrammed parallel processing systems
 Performance Evaluation
, 1994
"... ..."
Load Balancing and Data Locality in Adaptive Hierarchical Nbody Methods: BarnesHut, Fast Multipole, and Radiosity
 Journal Of Parallel and Distributed Computing
, 1995
"... processes, are increasingly being used to solve largescale problems in a variety of scientific/engineering domains. Applications that use these methods are challenging to parallelize effectively, however, owing to their nonuniform, dynamically changing characteristics and their need for longrang ..."
Abstract

Cited by 62 (2 self)
 Add to MetaCart
processes, are increasingly being used to solve largescale problems in a variety of scientific/engineering domains. Applications that use these methods are challenging to parallelize effectively, however, owing to their nonuniform, dynamically changing characteristics and their need for longrange communication.
The Design, Implementation, and Evaluation of Jade
 ACM Transactions on Programming Languages and Systems
, 1998
"... this article we discuss the design goals and decisions that determined the final form of Jade and present an overview of the Jade implementation. We also present our experience using Jade to implement several complete scientific and engineering applications. We use this experience to evaluate how th ..."
Abstract

Cited by 62 (4 self)
 Add to MetaCart
this article we discuss the design goals and decisions that determined the final form of Jade and present an overview of the Jade implementation. We also present our experience using Jade to implement several complete scientific and engineering applications. We use this experience to evaluate how the different Jade language features were used in practice and how well Jade as a whole supports the process of developing parallel applications. We find that the basic idea of preserving the serial semantics simplifies the program development process, and that the concept of using data access specifications to guide the parallelization offers significant advantages over more traditional controlbased approaches. We also find that the Jade data model can interact poorly with concurrency patterns that write disjoint pieces of a single aggregate data structure, although this problem arises in only one of the applications. Categories and Subject Descriptors: D.1.3 [Programming Te
A Portable Parallel Particle Program
 Computer Physics Communications
, 1995
"... We describe our implementation of the parallel hashed octtree (HOT) code, and in particular its application to neighbor finding in a smoothed particle hydrodynamics (SPH) code. We also review the error bounds on the multipole approximations involved in treecodes, and extend them to include general ..."
Abstract

Cited by 53 (7 self)
 Add to MetaCart
We describe our implementation of the parallel hashed octtree (HOT) code, and in particular its application to neighbor finding in a smoothed particle hydrodynamics (SPH) code. We also review the error bounds on the multipole approximations involved in treecodes, and extend them to include general cellcell interactions. Performance of the program on a variety of problems (including gravity, SPH, vortex method and panel method) is measured on several parallel and sequential machines. 1 Introduction There are two strategies that can be applied in the quest for more knowledge from bigger and better particle simulations. One can use the brute force approach; simple algorithms on bigger and faster machines (and bigger and faster now means massively parallel). To compute the gravitational force and potential for a single interaction takes 28 floating point operations (here we count a division as 4 floating point operations and a square root as 4 floating point operations). A typical grav...
Commutativity analysis: A new analysis framework for parallelizing compilers
 In Programming Language Design and Implementation (PLDI
, 1996
"... This paper presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointerbased data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granulari ..."
Abstract

Cited by 48 (8 self)
 Add to MetaCart
This paper presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointerbased data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granularity to discover when operations commute (i.e. generate the same final result regardless of the order in which they execute). If all of the operations required to perform a given computation commute, the compiler can automatically generate parallel code. We have implemented a prototype compilation system that uses commutativity analysis as its primary analysis framework. We have used this system to automatically parallelize two complete scientific computations: the BarnesHut Nbody solver and the Water code. This paper presents performance results for the generated parallel code running on the Stanford DASH machine. These results provide encouraging evidence that commutativity analysis can serve as the basis for a successful parallelizing compiler. 1
Skeletons from the Treecode Closet
 J. Comp. Phys
, 1994
"... We consider treecodes (Nbody programs which use a tree data structure) from the standpoint of their worstcase behavior. That is, we derive upper bounds on the largest possible errors that are introduced into a calculation by use of various multipole acceptability criteria (MAC). We find that the ..."
Abstract

Cited by 41 (10 self)
 Add to MetaCart
We consider treecodes (Nbody programs which use a tree data structure) from the standpoint of their worstcase behavior. That is, we derive upper bounds on the largest possible errors that are introduced into a calculation by use of various multipole acceptability criteria (MAC). We find that the conventional BarnesHut MAC can introduce potentially unbounded errors unless ` ! 1= p 3, and that this behavior while rare, is demonstrable in astrophysically reasonable examples. We consider two other MACs closely related to the BH MAC. While they don't admit the same unbounded errors, they nevertheless require extraordinary amounts of CPU time to guarantee modest levels of accuracy. We derive new error bounds based on some additional, easily computed moments of the mass distribution. These error bounds form the basis for four new MACs which can be used to limit the absolute or relative error introduced by each multipole evaluation, or, with the introduction of some additional data struc...
The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors
, 1995
"... Designers of distributed shared memory (DSM) multiprocessors are moving toward the use of commodity parts, not only in the processor and memory subsytem but also in the communication architecture. While the desire to use commodity parts and not perturb the underlying uniprocessor node can compromise ..."
Abstract

Cited by 37 (3 self)
 Add to MetaCart
Designers of distributed shared memory (DSM) multiprocessors are moving toward the use of commodity parts, not only in the processor and memory subsytem but also in the communication architecture. While the desire to use commodity parts and not perturb the underlying uniprocessor node can compromise the efficiency of the communication architecture, the impact on the end performance of applications is unclear. In this paper we study this performance impact through detailed simulation and analytical modeling, using a range of important applications and computational kernels. We characterize the communication architectures of DSM machines by four parameters, similar to those in the logP model. The l (latency) and o (occupancy of the communication controller in this model) parameters are the keys to performance in these machines, with the g (gap or nodetonetwork bandwidth) parameter not being a bottleneck in recent and upcoming machines. Conventional wisdom is that latency is the domina...