Results 1 - 10
of
30
Where is Time Spent in Message-Passing and Shared-Memory Programs?
- In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI
, 1994
"... Message passing and shared memory are two techniques parallel programs use for coordination and communication. This paper studies the strengths and weaknesses of these two mechanisms by comparing equivalent, well-written message-passing and shared-memory programs running on similar hardware. To ensu ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
Message passing and shared memory are two techniques parallel programs use for coordination and communication. This paper studies the strengths and weaknesses of these two mechanisms by comparing equivalent, well-written message-passing and shared-memory programs running on similar hardware. To ensure that our measurements are comparable, we produced two carefully tuned versions of each program and measured them on closely-related simulators of a message-passing and a shared-memory machine, both of which are based on same underlying hardware assumptions. We examined the behavior and performance of each program carefully. Although the cost of computation in each pair of programs was similar, synchronization and communication differed greatly. We found that message-passing's advantage over shared-memory is not clear-cut. Three of the four shared-memory programs ran at roughly the same speed as their message-passing equivalent, even though their communication patterns were different. 1 In...
Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems
- ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1998
"... In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing natural ..."
Abstract
-
Cited by 44 (2 self)
- Add to MetaCart
In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing naturally-occurring local events and their corresponding implicit information, i.e., information available outside of a defined interface. Many systems, particularly in distributed and networked environments, have leveraged implicit control to simplify the implementation of services with autonomous components. To concretely demonstrate the advantages of implicit control, we propose and implement implicit coscheduling, an algorithm for dynamically coordinating the time...
LoPC: Modeling Contention in Parallel Algorithms
, 1997
"... Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention for message processing resources in parallel al ..."
Abstract
-
Cited by 41 (9 self)
- Add to MetaCart
Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention for message processing resources in parallel algorithms on a multiprocessor or network of workstations. LoPC takes the , and parameters directly from the LogP model and uses them to predict the cost of contention, .
Performance Analysis of MPI Collective Operations
- In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) - Workshop 15
, 2005
"... Previous studies of application usage show that the performance of collective communica-tions are critical for high performance computing and are often overlooked when compared to the point-to-point performance. In this paper we attempt to analyze and improve collective communication in the context ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
Previous studies of application usage show that the performance of collective communica-tions are critical for high performance computing and are often overlooked when compared to the point-to-point performance. In this paper we attempt to analyze and improve collective communication in the context of the widely deployed MPI programming paradigm by extending accepted models of point-to-point communication, such as Hockney, LogP/LogGP, and PLogP. The predictions from the models were compared to the experimentally gathered data and our findings were used to optimize the implementation of collective operations in the FT-MPI library. 1
CICO: A Practical Shared-Memory Programming Performance Model
- Workshop on Portability and Performance for Parallel Processing
, 1993
"... A programming performance model provides a programmer with feedback on the cost of program operations and is a necessary basis to write efficient programs. Many sharedmemory performance models do not accurately capture the cost of interprocessor communication caused by non-local memory references ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
A programming performance model provides a programmer with feedback on the cost of program operations and is a necessary basis to write efficient programs. Many sharedmemory performance models do not accurately capture the cost of interprocessor communication caused by non-local memory references, particularly in computers with caches. This paper describes a simple and practical programming performance model--called check-in, check-out (CICO)--for cache-coherent, shared-memory parallel computers. cica consists of two components. The first is a collection of annotations that a programmer adds to a program to elucidate the communication arising from shared-memory references. The second is a model that calculates the communication cost of these annotations. An annotation's cost models the cost of the memory references that it summarizes and serves as a metric to compare alternative implementations. Several examples demonstrate that cica accurately predicts cache misses and identifies changes that improve program performance.
Program Development and Performance Prediction on BSP Machines Using Opal
, 1994
"... Machine. This uses combining networks on a butterfly topology with a hashed address space to try and hide the network latency. [ Abolhassan et al., 1991 ] analyses Ranade's approach in a quantitative way by giving cost models for implementing various parts of the PRAM machine. This is then used to d ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Machine. This uses combining networks on a butterfly topology with a hashed address space to try and hide the network latency. [ Abolhassan et al., 1991 ] analyses Ranade's approach in a quantitative way by giving cost models for implementing various parts of the PRAM machine. This is then used to demonstrate an improvement on Ranade's Fluent machine using multiple butterflies and parallel slackness. It is then shown that the proposed improved Fluent machine would have a similar price / performance ratio of conventional distributed memory architectures. Other attempts at realising the PRAM model involves it's simulation on conventional distributed memory architectures. This method usually involves hashing the address space of the PRAM across the distributed memory of the machine and replication of variables [ Mehlhorn and Vishkin, 1984 ] , or using multiple hash functions [ Abolhassan et al., 1991 ] . 2.2 BSP A Bulk Synchronous Parallel machine consists of a number of processor memo...
Improving the Performance of MPI Derived Datatypes by Optimizing Memory-Access Cost
- In Proceedings of the IEEE International Conference on Cluster Computing
, 2003
"... The MPI Standard supports derived datatypes, which allow users to describe noncontiguous memory layout and communicate noncontiguous data with a single communication function. This feature enables an MPI implementation to optimize the transfer of noncontiguous data. In practice, however, few MPI imp ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
The MPI Standard supports derived datatypes, which allow users to describe noncontiguous memory layout and communicate noncontiguous data with a single communication function. This feature enables an MPI implementation to optimize the transfer of noncontiguous data. In practice, however, few MPI implementations implement derived datatypes in a way that performs better than what the user can achieve by manually packing data into a contiguous buffer and then calling an MPI function. In this paper, we present a technique for improving the performance of derived datatypes by automatically using packing algorithms that are optimized for memory-access cost. The packing algorithms use memory-optimization techniques that the user cannot apply easily without advanced knowledge of the memory architecture. We present performance results for a matrix-transpose example that demonstrate that our implementation of derived datatypes significantly outperforms both manual packing by the user and the existing derived-datatype code in the MPI implementation (MPICH).
Multigrid Equation Solvers for Large Scale Nonlinear Finite Element Simulations
, 1999
"... The finite element method has grown, in the past 40 years, to be a popular method for the simulation of physical systems in science and engineering. The finite element method is used in a wide array of industries. In fact just about any enterprise that makes a physical product can, and probably do ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
The finite element method has grown, in the past 40 years, to be a popular method for the simulation of physical systems in science and engineering. The finite element method is used in a wide array of industries. In fact just about any enterprise that makes a physical product can, and probably does, use finite element technology. The success of the finite element method is due in large part to its ability to allow the use of accurate formulation of partial differential equations (PDEs), on arbitrarily general physical domains with complex boundary conditions. Additionally, the rapid growth in the computational power available in todays computers - for an ever more affordable price - has made finite element technology...
Latency and Bandwidth Requirements of Massively Parallel Programs: FFT as a Case Study
- IN SECOND INTERNATIONAL EURO-PAR CONFERENCE, VOLUME I, NUMBER 1123 IN LNCS
, 1999
"... In this paper we compare three routing algorithms for massively parallel architectures, each offering an increasing degree of adaptivity: a deterministic algorithm, a minimal adaptive based on Duato's methodology and a non-minimal adaptive, the Chaos routing. Rather than using a synthetic benchmark ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
In this paper we compare three routing algorithms for massively parallel architectures, each offering an increasing degree of adaptivity: a deterministic algorithm, a minimal adaptive based on Duato's methodology and a non-minimal adaptive, the Chaos routing. Rather than using a synthetic benchmark, the comparison is done with a real application, the transpose FFT algorithm. The simulation results collected on bi-dimensional tori with up to 256 processing nodes show that both adaptive algorithms suffer from post-saturation problems that degrade the network throughput.

