Results 1 - 10
of
317
Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
- Journal of Parallel and Distributed Computing
, 1993
"... This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. These primitives (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) ..."
Abstract
-
Cited by 134 (17 self)
- Add to MetaCart
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. These primitives (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM). 1 Introduction Over the past few years we have developed a methodology to produce efficient distributed memory code for sparse and unstructured problems in which array accesses are made through a level of indirection. In such problems the dependency structure is determined by variable values known only at runtime. In...
Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading
, 1996
"... Attractive inter-residue contact energies for proteins have been re-evaluated with the same assumptions and approximations used originally by us in 1985, but with a significantly larger set of protein crystal structures. An additional repulsive packing energy term, operative at higher densities to p ..."
Abstract
-
Cited by 91 (6 self)
- Add to MetaCart
Attractive inter-residue contact energies for proteins have been re-evaluated with the same assumptions and approximations used originally by us in 1985, but with a significantly larger set of protein crystal structures. An additional repulsive packing energy term, operative at higher densities to prevent overpacking, has also been estimated for all 20 amino acids as a function of the number of contacting residues, based on their observed distributions. The two terms of opposite sign are intended to be used together to provide an estimate of the overall energies of inter-residue interactions in simplified proteins without atomic details. To overcome the problem of how to utilize the many homologous proteins in the Protein Data Bank, a new scheme has been devised to assign different weights to each protein, based on similarities among amino acid sequences. A total of 1168 protein structures containing 1661 subunit sequences are actually used here. After the sequence weights have been applied, these correspond to an effective number of residue–residue contacts of 113,914, or about six
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings
- International Journal of Parallel Programming
, 2001
"... The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically underutilize multi-level memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to i ..."
Abstract
-
Cited by 83 (2 self)
- Add to MetaCart
The performance of irregular applications on modern computer systems is hurt by the wide gap between CPU and memory speeds because these applications typically underutilize multi-level memory hierarchies, which help hide this gap. This paper investigates using data and computation reorderings to improve memory hierarchy utilization for irregular applications. We evaluate the impact of reordering on data reuse at different levels in the memory hierarchy. We focus on coordinated data and computation reordering based on space-filling curves and we introduce a new architecture-independent multi-level blocking strategy for irregular applications. For two particle codes we studied, the most effective reorderings reduced overall execution time by a factor of two and four, respectively. Preliminary experience with a scatter benchmark derived from a large unstructured mesh application showed that careful data and computation ordering reduced primary cache misses by a factor of two compared to a random ordering.
Efficient Support for Irregular Applications on Distributed-Memory Machines
, 1995
"... Irregular computation problems underlie many important scientific applications. Although these problems are computationally expensive, and so would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and a ..."
Abstract
-
Cited by 81 (12 self)
- Add to MetaCart
Irregular computation problems underlie many important scientific applications. Although these problems are computationally expensive, and so would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance. This paper explores three issues -- partitioning, mutual exclusion, and data transfer -- crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three alternative systems on the same hardware base (a Thinking Machines CM-5): the CHAOS irregular application library, Transparent Shared Memory (TSM), and eXtensible Shared Memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems were somewhat (13%) to significantly faster (991%) than TSM.
An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction
- J. Mol. Biol
, 1998
"... Any algorithm that attempts to predict protein structure requires a discriminatory function that can distinguish between correct and incorrect conformations. These discriminatory functions can be ..."
Abstract
-
Cited by 76 (15 self)
- Add to MetaCart
Any algorithm that attempts to predict protein structure requires a discriminatory function that can distinguish between correct and incorrect conformations. These discriminatory functions can be
Runtime support and compilation methods for user-specified irregular data distributions
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1995
"... This paper describes two new ideas by which a High Performance Fortran compiler can deal with irregular computa-tions effectively. The first mechanism invokes a user specified mapping procedure via a set of proposed compiler directives. The directives allow use of program arrays to describe graph c ..."
Abstract
-
Cited by 55 (11 self)
- Add to MetaCart
This paper describes two new ideas by which a High Performance Fortran compiler can deal with irregular computa-tions effectively. The first mechanism invokes a user specified mapping procedure via a set of proposed compiler directives. The directives allow use of program arrays to describe graph connec-tivity, spatial location of array elements, and computational load. The second mechanism is a conservative method for compiling irregular loops in which dependence arises only due to reduction operations. This mechanism in many cases enables a compiler to recognize that it is possible to reuse previously computed infor-mation from inspectors (e.g., communication schedules, loop it-eration partitions, and information that associates off-processor data copies with on-processor buffer locations). This paper also presents performance results for these mechanisms from a For-tran 90D compiler implementation.
Using Prediction to Accelerate Coherence Protocols
, 1998
"... Most large shared-memory multiprocessors use directory protocols to keep per-processor caches coherent. Some memory references in such systems, however, suffer long latencies for misses to remotely cached blocks. To ameliorate this latency, researchers have augmented standard coherence protocols wit ..."
Abstract
-
Cited by 52 (4 self)
- Add to MetaCart
Most large shared-memory multiprocessors use directory protocols to keep per-processor caches coherent. Some memory references in such systems, however, suffer long latencies for misses to remotely cached blocks. To ameliorate this latency, researchers have augmented standard coherence protocols with optimizations for specific sharing patterns, such as read-modify-write, producer-consumer, and migratory sharing. This paper seeks to replace these directed solutions with general prediction logic that monitors coherence activity and triggers appropriate coherence actions. This paper takes the first step toward using general prediction to accelerate coherence protocols by developing and evaluating the Cosmos coherence message predictor. Cosmos predicts the source and type of the next coherence message for a cache block using logic that is an extension of Yeh and Patt's two-level PAp branch predictor. For five scientific applications running on 16 processors, Cosmos has prediction accuracie...
ECO: Efficient Collective Operations for Communication on Heterogeneous Networks
- In International Parallel Processing Symposium
, 1995
"... PVM and other distributed computing systems have enabled the use of networks of workstations for parallel computation, but their approach of treating a network as a collection of point-to-point connections does not promote efficient communication--- particularly collective communication. ECO is a ..."
Abstract
-
Cited by 50 (4 self)
- Add to MetaCart
PVM and other distributed computing systems have enabled the use of networks of workstations for parallel computation, but their approach of treating a network as a collection of point-to-point connections does not promote efficient communication--- particularly collective communication. ECO is a package which solves this problem with programs which analyze the network and establish efficient communication patterns which are used by a library of collective operations. The analysis is done off-line, so that after paying the one-time cost of analyzing the network, the execution of application programs is not delayed. This paper gives performance results from using ECO to implement the collective communication in CHARMM, a widely used macromolecular dynamics package. ECO facilitates the development of data parallel applications by providing a simple interface to routines which use the available heterogeneous networks efficiently. This approach gives a naive programmer the abili...
Run-time and compile-time support for adaptive irregular problems
- SUPERCOMPUTING’94
, 1994
"... In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This rese ..."
Abstract
-
Cited by 49 (9 self)
- Add to MetaCart
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.
Coherent network interfaces for fine-grain communication
- In Proceedings of the 23rd Annual International Symposium on Computer Architecture
, 1996
"... Historically, processor accesses to memory-mapped device registers huve been marked uncachable to insure their visibili ~ to the device. The ubiquity of snooping cache coherence, howeveg makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coheren ..."
Abstract
-
Cited by 48 (14 self)
- Add to MetaCart
Historically, processor accesses to memory-mapped device registers huve been marked uncachable to insure their visibili ~ to the device. The ubiquity of snooping cache coherence, howeveg makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads (e. g., for polling). This paper begins an exploration of network inter-jtces (NIs) that use coherence—coherent network interfaces (CNIs)--to improve communication performance, We restrict this study to NI/ CNIS that reside on coherent memoty or I/O buses, to NVCNIS that are much simpler than processors, and to the pe~ormance of&egrain messagingfiom user process to user process. Our jirst contribution is to develop and optimize two mechanisms that CNIS use to communicate with processors. A cachable device register—derived from cachable control registers [39)40]— is a coherent, cachable block of memory used to transfer status, control, or data between a device and a processor Cachable queues generalize cachable device registers from one cachable, coherent memory block to a contiguous region of cachable, coherent blocks managed as a circular queue. Our second contribution is a taxonomy and comparison of four CNIS with a more conventional NI. Microbenchmark results show that CNIS can improve the round-trip latency and achievable bandwidth of a small 64-byte message by 37 % and 125 % respectively on the memory bus and 74 % and 123 % respectively on a coherent 1/0 bus. Experiments with jive macrobenchmarks show that CNIS can improve the pe~ormance by 17-5370 on the memory bus and 30-88 % on the I/O bus.

