Results 1 - 10
of
35
MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface
, 2002
"... ..."
(Show Context)
MagPIe: MPI’s Collective Communication Operations for Clustered Wide Area Systems
- Proc PPoPP'99
, 1999
"... Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have d ..."
Abstract
-
Cited by 173 (27 self)
- Add to MetaCart
(Show Context)
Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have developed MAGPIE, a library of collective communication operations optimized for wide area systems. MAGPIE's algorithms send the minimal amount of data over the slow wide area links, and only incur a single wide area latency. Using our system, existing MPI applications can be run unmodified on geographically distributed systems. On moderate cluster sizes, using a wide area latency of 10 milliseconds and a bandwidth of 1 MByte/s, MAGPIE executes operations up to 10 times faster than MPICH, a widely used MPI implementation; application kernels improve by up to a factor of 4. Due to the structure of our algorithms, MAGPIE's advantage increases for higher wide area latencies.
Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance
, 2000
"... The ecient implementation of collective communication operations has received much attention. Initial eorts modeled network communication and produced \optimal" trees based on those models. However, the models used by these initial eorts assumed equal point-to-point latencies between any two pr ..."
Abstract
-
Cited by 89 (12 self)
- Add to MetaCart
(Show Context)
The ecient implementation of collective communication operations has received much attention. Initial eorts modeled network communication and produced \optimal" trees based on those models. However, the models used by these initial eorts assumed equal point-to-point latencies between any two processes. This assumption is violated in heterogeneous systems such as clusters of SMPs and wide-area \computational grids", and as a result, collective operations that utilize the trees generated by these models perform suboptimally. In response, more recent work has focused on creating topology-aware trees for collective operations that minimize communication across slower channels (e.g., a wide-area network). While these efforts have signicant communication benets, they all limit their view of the network to only two layers. We present a strategy based upon a multilayer view of the network. By creating multilevel topology trees we take advantage of communication cost dierences at every lev...
Bandwidth-efficient Collective Communication for Clustered Wide Area Systems
- In Proc. International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun
, 1999
"... Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such wide-area parallel applications is the difference in communication costs inside and between clu ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
(Show Context)
Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such wide-area parallel applications is the difference in communication costs inside and between clusters. Latency and bandwidth of WANs often are orders of magnitude worse than those of local networks. Our MagPIe library eases wide-area parallel programming by providing an efficient implementation of MPI's collective communication operations. MagPIe exploits the hierarchical structure of clustered wide-area systems and minimizes the communication overhead over the WAN links. In this paper, we present improved algorithms for collective communication that achieve shorter completion times by simultaneously using the aggregate bandwidth of the available wide-area links. Our new algorithms split messages into multiple segments that are sent in parallel over different WAN links, thus resulting ...
BIP-SMP: High Performance Message Passing over a Cluster of Commodity SMPs
- IN SUPERCOMPUTING (SC’99
, 1999
"... ..."
Optimizing Threaded MPI Execution on SMP Clusters
- IN PROC. OF 15TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING
, 2001
"... Our previous work has shown that using threads to execute MPI programs can yield great performance gain on multiprogrammed shared-memory machines. This paper investigates the design and implementation of a thread-based MPI system on SMP clusters. Our study indicates that with a proper design for thr ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
Our previous work has shown that using threads to execute MPI programs can yield great performance gain on multiprogrammed shared-memory machines. This paper investigates the design and implementation of a thread-based MPI system on SMP clusters. Our study indicates that with a proper design for threaded MPI execution, both point-to-point and collective communication performance can be improved substantially, compared to a processbased MPI implementation in a cluster environment. Our contribution includes a hierarchy-aware and adaptive communication scheme for threaded MPI execution and a thread-safe network device abstraction that uses event-driven synchronization and provides separated collective and point-to-point communication channels. This paper describes the implementation of our design and illustrates its performance advantage on a Linux SMP cluster.
A Multilevel Approach to Topology-aware Collective Operations in Computational Grids
, 2002
"... The ecient implementation of collective communication operations has received much attention. Initial eorts produced \optimal " trees based on network communi-cation models that assumed equal point-to-point latencies between any two processes. This assumption is violated in most practical setti ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
(Show Context)
The ecient implementation of collective communication operations has received much attention. Initial eorts produced \optimal " trees based on network communi-cation models that assumed equal point-to-point latencies between any two processes. This assumption is violated in most practical settings, however, particularly in heteroge-neous systems such as clusters of SMPs and wide-area \computational Grids, " with the result that collective operations perform suboptimally. In response, more recent work has focused on creating topology-aware trees for collective operations that minimize com-munication across slower channels (e.g., a wide-area network). While these eorts have signi cant communication benets, they all limit their view of the network to only two layers. We present a strategy based upon a multilayer view of the network. By creating multilevel topology-aware trees we take advantage of communication cost dierences at every level in the network. We used this strategy to implement topology-aware versions of several MPI collective operations in MPICH-G2, the Globus Toolkit TM-enabled ver-
Accurately measuring MPI broadcasts in a computational grid
- In Proc. IEEE Symp. on High Performance Distributed Computing (HPDC-8
, 1999
"... An MPI library’s implementation of broadcast communication can significantly affect the performance of applications built with that library. In order to choose between similar implementations or to evaluate available libraries, accurate measurements of broadcast performance are required. As we demon ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
An MPI library’s implementation of broadcast communication can significantly affect the performance of applications built with that library. In order to choose between similar implementations or to evaluate available libraries, accurate measurements of broadcast performance are required. As we demonstrate, existing methods for measuring broadcast performance are either inaccurate or inadequate. Fortunately, we have designed an accurate method for measuring broadcast performance, even in a challenging grid environment. Measuring broadcast performance is not easy. Simply sending one broadcast after another allows them to proceed through the network concurrently, thus resulting in inaccurate per broadcast timings. Existing methods either fail to eliminate this pipelining effect or eliminate it
A Multiprotocol Communication Support for the Global Address Space Programming Model on the IBM SP
- on the IBM SP, Proc. EuroPar-2000, Springer Verlag LNCS-1900
, 2000
"... . The paper describes an efficient communication support for the global address space programming model on the IBM SP, a commercial example of the SMP (symmetric multi-processor) clusters. Our approach integrates shared memory with active messages, threads and remote memory copy between nodes. T ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
(Show Context)
. The paper describes an efficient communication support for the global address space programming model on the IBM SP, a commercial example of the SMP (symmetric multi-processor) clusters. Our approach integrates shared memory with active messages, threads and remote memory copy between nodes. The shared memory operations offer substantial performance improvement over LAPI, IBM one-sided communication library, within an SMP node. Based on the experiments with the SPLASH-2 LU benchmark and a molecular dynamics simulation, our multiprotocol support for the global address space is found to improve performance and scalability of applications. This approach could also be used in optimizing the MPI-2 one-sided communication on the SMP clusters. 1 Introduction This work is motivated by applications that require support for a shared-memory programming style rather than just message passing. Many of them are characterized by irregular data structures, and dynamic or unpredictable dat...
Load-Balancing Scatter Operations for Grid Computing
- IN 12TH HETEROGENEOUS COMPUTING WORKSHOP (HCW’2003). IEEE CS
, 2003
"... We present solutions to statically load-balance scatter operations in parallel codes run on Grids. Our loadbalancing strategy is based on the modification of the data distributions used in scatter operations. We need to modify the user source code, but we want to keep the code as close as possible t ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
(Show Context)
We present solutions to statically load-balance scatter operations in parallel codes run on Grids. Our loadbalancing strategy is based on the modification of the data distributions used in scatter operations. We need to modify the user source code, but we want to keep the code as close as possible to the original. Hence, we study the replacement of scatter operations with a parameterized scatter, allowing a custom distribution of data. The paper presents: 1) a general algorithm which finds an optimal distribution of data across processors; 2) a quicker guaranteed heuristic relying on hypotheses on communications and computations; 3) a policy on the ordering of the processors. Experimental results with an MPI scientific code of seismic tomography illustrate the benefits obtained from our load-balancing.