Results 1 - 10
of
22
Efficient Support for Irregular Applications on Distributed-Memory Machines
, 1995
"... Irregular computation problems underlie many important scientific applications. Although these problems are computationally expensive, and so would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and a ..."
Abstract
-
Cited by 81 (12 self)
- Add to MetaCart
Irregular computation problems underlie many important scientific applications. Although these problems are computationally expensive, and so would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance. This paper explores three issues -- partitioning, mutual exclusion, and data transfer -- crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three alternative systems on the same hardware base (a Thinking Machines CM-5): the CHAOS irregular application library, Transparent Shared Memory (TSM), and eXtensible Shared Memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems were somewhat (13%) to significantly faster (991%) than TSM.
Run-time and compile-time support for adaptive irregular problems
- SUPERCOMPUTING’94
, 1994
"... In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This rese ..."
Abstract
-
Cited by 49 (9 self)
- Add to MetaCart
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.
Study of Scalable Declustering Algorithms for Parallel Grid Files
- In Proceedings of the Tenth International Parallel Processing Symposium
, 1996
"... Efficient storage and retrieval of large multidimensional datasets is an important concern for large-scale scientific computations such as long-running time-dependent simulations which periodically generate snapshots of the state. The main challenge for efficiently handling such datasets is to minim ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
Efficient storage and retrieval of large multidimensional datasets is an important concern for large-scale scientific computations such as long-running time-dependent simulations which periodically generate snapshots of the state. The main challenge for efficiently handling such datasets is to minimize response time for multidimensional range queries. The grid file is one of the well known access methods for multidimensional and spatial data. We investigate effective and scalable declustering techniques for grid files with the primary goal of minimizing response time and the secondary goal of maximizing the fairness of data distribution. The main contributions of this paper are (1) analytic and experimental evaluation of existing index-based declustering techniques and their extensions for grid files, and (2) development of a proximity-based declustering algorithm called minimax which is experimentally shown to scale and to consistently achieve better response time compared to availabl...
Selective, accurate, and timely self-invalidation using last-touch prediction
- In Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... Communication in cache-coherent distributed shared memory (DSM) often requires invalidating (or writing back) cached copies of a memory block, incurring high overheads. This paper proposes Last-Touch Predictors (LTPs) that learn and predict the “last touch ” to a memory block by one processor before ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
Communication in cache-coherent distributed shared memory (DSM) often requires invalidating (or writing back) cached copies of a memory block, incurring high overheads. This paper proposes Last-Touch Predictors (LTPs) that learn and predict the “last touch ” to a memory block by one processor before the block is accessed and subsequently invalidated by another. By predicting a last-touch and (self-)invalidating the block in advance, an LTP hides the invalidation time, significantly reducing the coherence overhead. The key behind accurate last-touch prediction is tracebased correlation, associating a last-touch with the sequence of instructions (i.e., a trace) touching the block from a coherence miss until the block is invalidated. Correlating instructions enables an LTP to identify a last-touch to a memory block uniquely throughout an application’s execution. In this paper, we use results from running shared-memory applications on a simulated DSM to evaluate LTPs. The results indicate that: (1) our base case LTP design, maintaining trace signatures on a per-block basis, substantially improves prediction accuracy over previous self-invalidation schemes to an average of 79%; (2) our alternative LTP design, maintaining a global trace signature table, reduces storage overhead but only achieves an average accuracy of 58%; (3) last-touch prediction based on a single instruction only achieves an average accuracy of 41 % due to instruction reuse within and across computation; and (4) LTP enables selective, accurate, and timely self-invalidation in DSM, speeding up program execution on average by 11%. 1
Compiler and Run-Time Support for Adaptive Load Balancing in Software Distributed Shared Memory Systems
- In Proceedings of the Fourth Workshop on Languages, Compilers, and Run-Time Systems for Parallel Computing
, 1998
"... . Networks of workstations offer inexpensive and highly available high performance computing environments. A critical issue for achieving good performance in any parallel system is load balancing, even more so in workstation environments where the machines might be shared among many users. In this p ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
. Networks of workstations offer inexpensive and highly available high performance computing environments. A critical issue for achieving good performance in any parallel system is load balancing, even more so in workstation environments where the machines might be shared among many users. In this paper, we present and evaluate a system that combines compiler and run-time support to achieve load balancing dynamically on software distributed shared memory programs. We use information provided by the compiler to help the run-time system distribute the work of the parallel loops, not only according to the relative power of the processors, but also in such a way as to minimize communication and page sharing. 1 Introduction Clusters of workstations, whether uniprocessors or symmetric multiprocessors (SMPs), offer cost-effective and highly available parallel computing environments. Software distributed shared memory (SDSM) provides a shared memory abstraction on a distributed memory machine...
An Adaptive Approach to Data Placement
- In Proceedings of the 10th International Symposium on Parallel Processing
, 1996
"... Programming distributed-memory machines requires careful placement of data to balance the computationalload among the nodes and minimize excess data movement between the nodes. Most current approaches to data placement require the programmer or compiler to place data initially and then possibly to m ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
Programming distributed-memory machines requires careful placement of data to balance the computationalload among the nodes and minimize excess data movement between the nodes. Most current approaches to data placement require the programmer or compiler to place data initially and then possibly to move it explicitly during a computation. This paper describes a new, adaptive approach. It is implemented in the Adapt system, which takes an initial data placement, efficiently monitors how well it performs, and changes the placement whenever the monitoring indicates that a different placement would perform better. Adapt frees the programmer from having to specify data placements, and it can use run-time information to find better placements than compilers. Moreover, Adapt automatically supports a "variable block" placement, which is especially useful for applications with nearest-neighbor communication but an imbalanced workload. For applications in which the best data placement varies dyna...
A General Interprocedural Framework for Placement of Split-phase Large Latency Operations
"... Overlapping split-phase large latency operations with computations is a standard technique for improving performance on modern architectures, In this paper, we present a general interprocedural technique for overlapping such accesses with computation. We have developed an Interprocedural Balanced ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Overlapping split-phase large latency operations with computations is a standard technique for improving performance on modern architectures, In this paper, we present a general interprocedural technique for overlapping such accesses with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous operations with a balanced pair of asynchronous operations. We have evaluated this scheme in the context of overlapping I/O operations with computation. We demonstrate how this analysis is useful for applications which perform frequent and large accesses to disks, including applications which snapshot or checkpoint their computations or out-of-core applications.
Object-Oriented Runtime Support for Complex Distributed Data Structures
, 1995
"... Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Traditionally, however, applications executed on distributed memory parallel arch ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers. We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointerbased data structures. In particular, CHAOS++ deals with complex ...
CHAOS++: A Runtime Library for Supporting Distributed Dynamic Data Structures
- GREGORY V. WILSON, EDITOR, PARALLEL PROGRAMMING USING C
, 1995
"... Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. Howe ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that solely rely on global indices cannot be used for these applications, as no global names or indices are imposed upon the elements of these data structures. CHAOS++ is a portable object-oriented runtime library that supports applications using dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and gl...
Parallel Dsmc Solution Of Three-Dimensional Flow Over A Finite Flat Plate
, 1994
"... This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are con ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that a onedimensional chain-partitioning method works best for the problems considered. The parallel code is then used to simulate the Mach 20 nitrogen flow over a finitethickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times than the scalar code, as well as very good load-balance characteristics. Nomenclature Freestream Mach number Re Reynolds number Stagnation pressure, bars Surface pressure, Pa Surface heat flux, W/m 2 Time required to compute 1 time step Stagnation...

