• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

MPI-StarT: Delivering Network Performance to Numerical Applications (1998)

by Parry Husbands, James C. Hoe
Venue:In SC
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 35
Next 10 →

MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface

by Nicholas T. Karonis, Brian Toonen, Ian Foster , 2002
"... ..."
Abstract - Cited by 303 (15 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...level topology-aware approach [32] relative both to topologyunaware binomial trees and earlier topology-aware approaches that distinguish only between “intracluster” and “intercluster” communications =-=[30, 35]-=-. As we explain in the next subsection, MPICH-G2’s topology-aware collective operations are constructed in terms of topology discovery mechanisms that can also be used by topology-aware applications. ...

MagPIe: MPI’s Collective Communication Operations for Clustered Wide Area Systems

by Thilo Kielmann, Rutger F. H. Hofman, Henri E. Bal, Aske Plaat, Raoul A. F. Bhoedjang - Proc PPoPP'99 , 1999
"... Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have d ..."
Abstract - Cited by 173 (27 self) - Add to MetaCart
Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have developed MAGPIE, a library of collective communication operations optimized for wide area systems. MAGPIE's algorithms send the minimal amount of data over the slow wide area links, and only incur a single wide area latency. Using our system, existing MPI applications can be run unmodified on geographically distributed systems. On moderate cluster sizes, using a wide area latency of 10 milliseconds and a bandwidth of 1 MByte/s, MAGPIE executes operations up to 10 times faster than MPICH, a widely used MPI implementation; application kernels improve by up to a factor of 4. Due to the structure of our algorithms, MAGPIE's advantage increases for higher wide area latencies.
(Show Context)

Citation Context

...ted design philosophy. MagPIe implements the complete set of MPI's collective operations with in-depth treatment of wide-area optimality and associativity of the reduction operations. Husbands et al. =-=[20]-=- report significantly improved performance on a cluster of SMPs with a handcrafted two-level implementation of MPI Bcast. Banikazemi et al. [4] investigate optimal communication structures for multica...

Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance

by Nicholas Karonis, Bronis R. De Supinski, Ian Foster, William Gropp, Ewing Lusk, John Bresnahan , 2000
"... The ecient implementation of collective communication operations has received much attention. Initial eorts modeled network communication and produced \optimal" trees based on those models. However, the models used by these initial eorts assumed equal point-to-point latencies between any two pr ..."
Abstract - Cited by 89 (12 self) - Add to MetaCart
The ecient implementation of collective communication operations has received much attention. Initial eorts modeled network communication and produced \optimal" trees based on those models. However, the models used by these initial eorts assumed equal point-to-point latencies between any two processes. This assumption is violated in heterogeneous systems such as clusters of SMPs and wide-area \computational grids", and as a result, collective operations that utilize the trees generated by these models perform suboptimally. In response, more recent work has focused on creating topology-aware trees for collective operations that minimize communication across slower channels (e.g., a wide-area network). While these efforts have signicant communication benets, they all limit their view of the network to only two layers. We present a strategy based upon a multilayer view of the network. By creating multilevel topology trees we take advantage of communication cost dierences at every lev...
(Show Context)

Citation Context

...gorithm may generate O(log N) intercluster messages, while a topology-aware algorithm generates only 1, for a cost saving of a factor of O(log N) if intercluster message costs dominate. Previous work =-=[13, 15]-=- has demonstrated that topology-aware collective operations can indeed reduce communication costs by reducing the amount of communication performed over slow channels. However, this work limited the d...

Bandwidth-efficient Collective Communication for Clustered Wide Area Systems

by Thilo Kielmann, Henri E. Bal, Sergei Gorlatch - In Proc. International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun , 1999
"... Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such wide-area parallel applications is the difference in communication costs inside and between clu ..."
Abstract - Cited by 36 (3 self) - Add to MetaCart
Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such wide-area parallel applications is the difference in communication costs inside and between clusters. Latency and bandwidth of WANs often are orders of magnitude worse than those of local networks. Our MagPIe library eases wide-area parallel programming by providing an efficient implementation of MPI's collective communication operations. MagPIe exploits the hierarchical structure of clustered wide-area systems and minimizes the communication overhead over the WAN links. In this paper, we present improved algorithms for collective communication that achieve shorter completion times by simultaneously using the aggregate bandwidth of the available wide-area links. Our new algorithms split messages into multiple segments that are sent in parallel over different WAN links, thus resulting ...
(Show Context)

Citation Context

... multiple receivers. Some work has been performed on optimizing single collective operations (e.g. broadcast) for clusters of SMPs which (like wide-area networks) also exhibit hierarchical structures =-=[22, 26]-=-. KELP [2] and SIMPLE [3] both provide programming frameworks where parallel applications can be designed in a BSP-like manner. Their implementations target at clusters of SMPs where communication ste...

BIP-SMP: High Performance Message Passing over a Cluster of Commodity SMPs

by Patrick Geoffray, Loïc Prylli, Bernard Tourancheau - IN SUPERCOMPUTING (SC’99 , 1999
"... ..."
Abstract - Cited by 31 (2 self) - Add to MetaCart
Abstract not found

Optimizing Threaded MPI Execution on SMP Clusters

by Hong Tang, Tao Yang - IN PROC. OF 15TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING , 2001
"... Our previous work has shown that using threads to execute MPI programs can yield great performance gain on multiprogrammed shared-memory machines. This paper investigates the design and implementation of a thread-based MPI system on SMP clusters. Our study indicates that with a proper design for thr ..."
Abstract - Cited by 30 (1 self) - Add to MetaCart
Our previous work has shown that using threads to execute MPI programs can yield great performance gain on multiprogrammed shared-memory machines. This paper investigates the design and implementation of a thread-based MPI system on SMP clusters. Our study indicates that with a proper design for threaded MPI execution, both point-to-point and collective communication performance can be improved substantially, compared to a processbased MPI implementation in a cluster environment. Our contribution includes a hierarchy-aware and adaptive communication scheme for threaded MPI execution and a thread-safe network device abstraction that uses event-driven synchronization and provides separated collective and point-to-point communication channels. This paper describes the implementation of our design and illustrates its performance advantage on a Linux SMP cluster.

A Multilevel Approach to Topology-aware Collective Operations in Computational Grids

by Nicholas T. Karonis, Bronis De Supinski, Ian Foster, Ewing Lusk, Sebastien Lacour , 2002
"... The ecient implementation of collective communication operations has received much attention. Initial eorts produced \optimal " trees based on network communi-cation models that assumed equal point-to-point latencies between any two processes. This assumption is violated in most practical setti ..."
Abstract - Cited by 14 (0 self) - Add to MetaCart
The ecient implementation of collective communication operations has received much attention. Initial eorts produced \optimal " trees based on network communi-cation models that assumed equal point-to-point latencies between any two processes. This assumption is violated in most practical settings, however, particularly in heteroge-neous systems such as clusters of SMPs and wide-area \computational Grids, " with the result that collective operations perform suboptimally. In response, more recent work has focused on creating topology-aware trees for collective operations that minimize com-munication across slower channels (e.g., a wide-area network). While these eorts have signi cant communication benets, they all limit their view of the network to only two layers. We present a strategy based upon a multilayer view of the network. By creating multilevel topology-aware trees we take advantage of communication cost dierences at every level in the network. We used this strategy to implement topology-aware versions of several MPI collective operations in MPICH-G2, the Globus Toolkit TM-enabled ver-
(Show Context)

Citation Context

...gorithm may generate O(log N) intercluster messages, while a topology-aware algorithm generates only 1, for a cost saving of a factor of O(log N) if intercluster message costs dominate. Previous work =-=[13, 16]-=- has demonstrated that topology-aware collective operations can indeed reduce communication costs by reducing the amount of communication performed over slow channels. However, this work limited the d...

Accurately measuring MPI broadcasts in a computational grid

by Bronis R. De Supinski - In Proc. IEEE Symp. on High Performance Distributed Computing (HPDC-8 , 1999
"... An MPI library’s implementation of broadcast communication can significantly affect the performance of applications built with that library. In order to choose between similar implementations or to evaluate available libraries, accurate measurements of broadcast performance are required. As we demon ..."
Abstract - Cited by 14 (3 self) - Add to MetaCart
An MPI library’s implementation of broadcast communication can significantly affect the performance of applications built with that library. In order to choose between similar implementations or to evaluate available libraries, accurate measurements of broadcast performance are required. As we demonstrate, existing methods for measuring broadcast performance are either inaccurate or inadequate. Fortunately, we have designed an accurate method for measuring broadcast performance, even in a challenging grid environment. Measuring broadcast performance is not easy. Simply sending one broadcast after another allows them to proceed through the network concurrently, thus resulting in inaccurate per broadcast timings. Existing methods either fail to eliminate this pipelining effect or eliminate it

A Multiprotocol Communication Support for the Global Address Space Programming Model on the IBM SP

by Jarek Nieplocha, Jialin Ju, T. P. Straatsma - on the IBM SP, Proc. EuroPar-2000, Springer Verlag LNCS-1900 , 2000
"... . The paper describes an efficient communication support for the global address space programming model on the IBM SP, a commercial example of the SMP (symmetric multi-processor) clusters. Our approach integrates shared memory with active messages, threads and remote memory copy between nodes. T ..."
Abstract - Cited by 13 (6 self) - Add to MetaCart
. The paper describes an efficient communication support for the global address space programming model on the IBM SP, a commercial example of the SMP (symmetric multi-processor) clusters. Our approach integrates shared memory with active messages, threads and remote memory copy between nodes. The shared memory operations offer substantial performance improvement over LAPI, IBM one-sided communication library, within an SMP node. Based on the experiments with the SPLASH-2 LU benchmark and a molecular dynamics simulation, our multiprotocol support for the global address space is found to improve performance and scalability of applications. This approach could also be used in optimizing the MPI-2 one-sided communication on the SMP clusters. 1 Introduction This work is motivated by applications that require support for a shared-memory programming style rather than just message passing. Many of them are characterized by irregular data structures, and dynamic or unpredictable dat...
(Show Context)

Citation Context

...em configurations. Shared memory has been exploited before in low-level one-sided messaging systems like Active Messages[4] or Nexus [5], and higher-level two-sided messagepassing interfaces like MPI =-=[6]-=- on SMP clusters. However, as the programming models based on the global-address space and message passing are fundamentally distinct, different strategies are needed to pass benefits of shared memory...

Load-Balancing Scatter Operations for Grid Computing

by Stéphane Genaud , Arnaud Giersch, Frédéric Vivien - IN 12TH HETEROGENEOUS COMPUTING WORKSHOP (HCW’2003). IEEE CS , 2003
"... We present solutions to statically load-balance scatter operations in parallel codes run on Grids. Our loadbalancing strategy is based on the modification of the data distributions used in scatter operations. We need to modify the user source code, but we want to keep the code as close as possible t ..."
Abstract - Cited by 12 (4 self) - Add to MetaCart
We present solutions to statically load-balance scatter operations in parallel codes run on Grids. Our loadbalancing strategy is based on the modification of the data distributions used in scatter operations. We need to modify the user source code, but we want to keep the code as close as possible to the original. Hence, we study the replacement of scatter operations with a parameterized scatter, allowing a custom distribution of data. The paper presents: 1) a general algorithm which finds an optimal distribution of data across processors; 2) a quicker guaranteed heuristic relying on hypotheses on communications and computations; 3) a policy on the ordering of the processors. Experimental results with an MPI scientific code of seismic tomography illustrate the benefits obtained from our load-balancing.
(Show Context)

Citation Context

...rogeneous environment consists in using a communication library adapted to heterogeneity. Thus, much work has been devoted to that purpose: for MPI, numerous projects including Magpie [15], MPI-StarT =-=[13]-=-, and MPICH-G2 [8], aim at improving communications performance in presence of heterogeneous networks. Most of the gain is obtained by reworking the design of collective communication primitives. For ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2018 The Pennsylvania State University