Results 1 - 10
of
70
A Parallel Hashed Oct-Tree N-Body Algorithm
, 1993
"... We report on an efficient adaptive N-body method which we have recently designed and implemented. The algorithm computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number. The accuracy of the force calculations is analytically bounded, and c ..."
Abstract
-
Cited by 138 (11 self)
- Add to MetaCart
We report on an efficient adaptive N-body method which we have recently designed and implemented. The algorithm computes the forces on an arbitrary distribution of bodies in a time which scales as N log N with the particle number. The accuracy of the force calculations is analytically bounded, and can be adjusted via a user defined parameter between a few percent relative accuracy, down to machine arithmetic accuracy. Instead of using pointers to indicate the topology of the tree, we identify each possible cell with a key. The mapping of keys into memory locations is achieved via a hash table. This allows the program to access data in an efficient manner across multiple processors. Performance of the parallel program is measured on the 512 processor Intel Touchstone Delta system. We also comment on a number of wide-ranging applications which can benefit from application of this type of algorithm.
Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors
- In Proceedings of the 20th Annual International Symposium on Computer Architecture
, 1993
"... The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a s ..."
Abstract
-
Cited by 71 (4 self)
- Add to MetaCart
The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines? In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large diffkrent levels of a multiprocessor 's cache hierarchy should be. Then, we use these working sets together with certain other imporant characteristics of the applications such as communication to computation ratios, concurrency, and load balancing behavioto reflect upon the broader question of the granularity of processing nodes in highperformance multiprocessors.
Commutativity Analysis: A New Analysis Technique for Parallelizing Compilers
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1997
"... This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointer-based data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granula ..."
Abstract
-
Cited by 61 (7 self)
- Add to MetaCart
This article presents a new analysis technique, commutativity analysis, for automatically parallelizing computations that manipulate dynamic, pointer-based data structures. Commutativity analysis views the computation as composed of operations on objects. It then analyzes the program at this granularity to discover when operations commute (i.e., generate the same final result regardless of the order in which they execute). If all of the operations required to perform a given computation commute, the compiler can automatically generate parallel code. We have implemented a prototype compilation system that uses commutativity analysis as its primary analysis technique
Application scheduling and processor allocation in multiprogrammed parallel processing systems
- Performance Evaluation
, 1994
"... ..."
A Portable Parallel Particle Program
- Computer Physics Communications
, 1995
"... We describe our implementation of the parallel hashed oct-tree (HOT) code, and in particular its application to neighbor finding in a smoothed particle hydrodynamics (SPH) code. We also review the error bounds on the multipole approximations involved in treecodes, and extend them to include general ..."
Abstract
-
Cited by 49 (7 self)
- Add to MetaCart
We describe our implementation of the parallel hashed oct-tree (HOT) code, and in particular its application to neighbor finding in a smoothed particle hydrodynamics (SPH) code. We also review the error bounds on the multipole approximations involved in treecodes, and extend them to include general cell-cell interactions. Performance of the program on a variety of problems (including gravity, SPH, vortex method and panel method) is measured on several parallel and sequential machines. 1 Introduction There are two strategies that can be applied in the quest for more knowledge from bigger and better particle simulations. One can use the brute force approach; simple algorithms on bigger and faster machines (and bigger and faster now means massively parallel). To compute the gravitational force and potential for a single interaction takes 28 floating point operations (here we count a division as 4 floating point operations and a square root as 4 floating point operations). A typical grav...
Load Balancing and Data Locality in Adaptive Hierarchical N-body Methods: Barnes-Hut, Fast Multipole, and Radiosity
- Journal Of Parallel and Distributed Computing
, 1995
"... processes, are increasingly being used to solve large-scale problems in a variety of scientific/engineering domains. Applications that use these methods are challenging to parallelize effectively, however, owing to their nonuniform, dynamically changing characteristics and their need for long-rang ..."
Abstract
-
Cited by 49 (2 self)
- Add to MetaCart
processes, are increasingly being used to solve large-scale problems in a variety of scientific/engineering domains. Applications that use these methods are challenging to parallelize effectively, however, owing to their nonuniform, dynamically changing characteristics and their need for long-range communication.
The Design, Implementation, and Evaluation of Jade
- ACM Transactions on Programming Languages and Systems
, 1998
"... this article we discuss the design goals and decisions that determined the final form of Jade and present an overview of the Jade implementation. We also present our experience using Jade to implement several complete scientific and engineering applications. We use this experience to evaluate how th ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
this article we discuss the design goals and decisions that determined the final form of Jade and present an overview of the Jade implementation. We also present our experience using Jade to implement several complete scientific and engineering applications. We use this experience to evaluate how the different Jade language features were used in practice and how well Jade as a whole supports the process of developing parallel applications. We find that the basic idea of preserving the serial semantics simplifies the program development process, and that the concept of using data access specifications to guide the parallelization offers significant advantages over more traditional control-based approaches. We also find that the Jade data model can interact poorly with concurrency patterns that write disjoint pieces of a single aggregate data structure, although this problem arises in only one of the applications. Categories and Subject Descriptors: D.1.3 [Programming Te
A Parallel Adaptive Fast Multipole Method
- In Proceedings of Supercomputing 93
, 1993
"... We present parallel versions of a representative N-body application that uses Greengard and Rokhlin's adaptive Fast Multipole Method (FMM). While parallel implementations of the uniform FMM are straightforward and have been developed on different architectures, the adaptive version complicates the t ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
We present parallel versions of a representative N-body application that uses Greengard and Rokhlin's adaptive Fast Multipole Method (FMM). While parallel implementations of the uniform FMM are straightforward and have been developed on different architectures, the adaptive version complicates the task of obtaining effective parallel performance owing to the nonuniform and dynamically changing nature of the problem domains to which it is applied. We propose and evaluate two techniques for providing load balancing and data locality, both of which take advantage of key insights into the method and its typical applications. Using the better of these techniques, we demonstrate 45-fold speedups on galactic simulations on a 48-processor Stanford DASH machine, a state-of-the-art shared address space multiprocessor, even for relatively small problems. We also show good speedups on a 2-ring Kendall Square Research KSR-1. Finally, we summarize some key architectural implications of this importan...
The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors
, 1995
"... Designers of distributed shared memory (DSM) multiprocessors are moving toward the use of commodity parts, not only in the processor and memory subsytem but also in the communication architecture. While the desire to use commodity parts and not perturb the underlying uniprocessor node can compromise ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
Designers of distributed shared memory (DSM) multiprocessors are moving toward the use of commodity parts, not only in the processor and memory subsytem but also in the communication architecture. While the desire to use commodity parts and not perturb the underlying uniprocessor node can compromise the efficiency of the communication architecture, the impact on the end performance of applications is unclear. In this paper we study this performance impact through detailed simulation and analytical modeling, using a range of important applications and computational kernels. We characterize the communication architectures of DSM machines by four parameters, similar to those in the logP model. The l (latency) and o (occupancy of the communication controller in this model) parameters are the keys to performance in these machines, with the g (gap or node-to-network bandwidth) parameter not being a bottleneck in recent and upcoming machines. Conventional wisdom is that latency is the domina...
Skeletons from the Treecode Closet
- J. Comp. Phys
, 1994
"... We consider treecodes (N-body programs which use a tree data structure) from the standpoint of their worst-case behavior. That is, we derive upper bounds on the largest possible errors that are introduced into a calculation by use of various multipole acceptability criteria (MAC). We find that the ..."
Abstract
-
Cited by 32 (10 self)
- Add to MetaCart
We consider treecodes (N-body programs which use a tree data structure) from the standpoint of their worst-case behavior. That is, we derive upper bounds on the largest possible errors that are introduced into a calculation by use of various multipole acceptability criteria (MAC). We find that the conventional Barnes-Hut MAC can introduce potentially unbounded errors unless ` ! 1= p 3, and that this behavior while rare, is demonstrable in astrophysically reasonable examples. We consider two other MACs closely related to the BH MAC. While they don't admit the same unbounded errors, they nevertheless require extraordinary amounts of CPU time to guarantee modest levels of accuracy. We derive new error bounds based on some additional, easily computed moments of the mass distribution. These error bounds form the basis for four new MACs which can be used to limit the absolute or relative error introduced by each multipole evaluation, or, with the introduction of some additional data struc...

