Results 1 - 10
of
24
MagPIe: MPI’s Collective Communication Operations for Clustered Wide Area Systems
- Proc PPoPP'99
, 1999
"... Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have d ..."
Abstract
-
Cited by 138 (26 self)
- Add to MetaCart
Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have developed MAGPIE, a library of collective communication operations optimized for wide area systems. MAGPIE's algorithms send the minimal amount of data over the slow wide area links, and only incur a single wide area latency. Using our system, existing MPI applications can be run unmodified on geographically distributed systems. On moderate cluster sizes, using a wide area latency of 10 milliseconds and a bandwidth of 1 MByte/s, MAGPIE executes operations up to 10 times faster than MPICH, a widely used MPI implementation; application kernels improve by up to a factor of 4. Due to the structure of our algorithms, MAGPIE's advantage increases for higher wide area latencies.
Exploiting Hierarchy in Parallel Computer Networks to Optimize Collective Operation Performance
, 2000
"... The ecient implementation of collective communication operations has received much attention. Initial eorts modeled network communication and produced \optimal" trees based on those models. However, the models used by these initial eorts assumed equal point-to-point latencies between any two process ..."
Abstract
-
Cited by 67 (10 self)
- Add to MetaCart
The ecient implementation of collective communication operations has received much attention. Initial eorts modeled network communication and produced \optimal" trees based on those models. However, the models used by these initial eorts assumed equal point-to-point latencies between any two processes. This assumption is violated in heterogeneous systems such as clusters of SMPs and wide-area \computational grids", and as a result, collective operations that utilize the trees generated by these models perform suboptimally. In response, more recent work has focused on creating topology-aware trees for collective operations that minimize communication across slower channels (e.g., a wide-area network). While these efforts have signicant communication benets, they all limit their view of the network to only two layers. We present a strategy based upon a multilayer view of the network. By creating multilevel topology trees we take advantage of communication cost dierences at every lev...
Bandwidth-efficient Collective Communication for Clustered Wide Area Systems
- In Proc. International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun
, 1999
"... Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such wide-area parallel applications is the difference in communication costs inside and between clu ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such wide-area parallel applications is the difference in communication costs inside and between clusters. Latency and bandwidth of WANs often are orders of magnitude worse than those of local networks. Our MagPIe library eases wide-area parallel programming by providing an efficient implementation of MPI's collective communication operations. MagPIe exploits the hierarchical structure of clustered wide-area systems and minimizes the communication overhead over the WAN links. In this paper, we present improved algorithms for collective communication that achieve shorter completion times by simultaneously using the aggregate bandwidth of the available wide-area links. Our new algorithms split messages into multiple segments that are sent in parallel over different WAN links, thus resulting ...
Optimizing Threaded MPI Execution on SMP Clusters
- IN PROC. OF 15TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING
, 2001
"... Our previous work has shown that using threads to execute MPI programs can yield great performance gain on multiprogrammed shared-memory machines. This paper investigates the design and implementation of a thread-based MPI system on SMP clusters. Our study indicates that with a proper design for thr ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Our previous work has shown that using threads to execute MPI programs can yield great performance gain on multiprogrammed shared-memory machines. This paper investigates the design and implementation of a thread-based MPI system on SMP clusters. Our study indicates that with a proper design for threaded MPI execution, both point-to-point and collective communication performance can be improved substantially, compared to a processbased MPI implementation in a cluster environment. Our contribution includes a hierarchy-aware and adaptive communication scheme for threaded MPI execution and a thread-safe network device abstraction that uses event-driven synchronization and provides separated collective and point-to-point communication channels. This paper describes the implementation of our design and illustrates its performance advantage on a Linux SMP cluster.
A Multiprotocol Communication Support for the Global Address Space Programming Model on the IBM SP
- on the IBM SP, Proc. EuroPar-2000, Springer Verlag LNCS-1900
, 2000
"... . The paper describes an efficient communication support for the global address space programming model on the IBM SP, a commercial example of the SMP (symmetric multi-processor) clusters. Our approach integrates shared memory with active messages, threads and remote memory copy between nodes. T ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
. The paper describes an efficient communication support for the global address space programming model on the IBM SP, a commercial example of the SMP (symmetric multi-processor) clusters. Our approach integrates shared memory with active messages, threads and remote memory copy between nodes. The shared memory operations offer substantial performance improvement over LAPI, IBM one-sided communication library, within an SMP node. Based on the experiments with the SPLASH-2 LU benchmark and a molecular dynamics simulation, our multiprotocol support for the global address space is found to improve performance and scalability of applications. This approach could also be used in optimizing the MPI-2 one-sided communication on the SMP clusters. 1 Introduction This work is motivated by applications that require support for a shared-memory programming style rather than just message passing. Many of them are characterized by irregular data structures, and dynamic or unpredictable dat...
Load-Balancing Scatter Operations for Grid Computing
- IN 12TH HETEROGENEOUS COMPUTING WORKSHOP (HCW’2003). IEEE CS
, 2003
"... We present solutions to statically load-balance scatter operations in parallel codes run on Grids. Our loadbalancing strategy is based on the modification of the data distributions used in scatter operations. We need to modify the user source code, but we want to keep the code as close as possible t ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
We present solutions to statically load-balance scatter operations in parallel codes run on Grids. Our loadbalancing strategy is based on the modification of the data distributions used in scatter operations. We need to modify the user source code, but we want to keep the code as close as possible to the original. Hence, we study the replacement of scatter operations with a parameterized scatter, allowing a custom distribution of data. The paper presents: 1) a general algorithm which finds an optimal distribution of data across processors; 2) a quicker guaranteed heuristic relying on hypotheses on communications and computations; 3) a policy on the ordering of the processors. Experimental results with an MPI scientific code of seismic tomography illustrate the benefits obtained from our load-balancing.
A General-Purpose Model for Heterogeneous Computation
, 2000
"... Heterogeneous computing environments are becoming an increasingly popular platform for executing parallel applications. Such environments consist of a diverse set of machines and offer considerably more computational power at a lower cost than a parallel computer. Efficient heterogeneous parallel ap ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Heterogeneous computing environments are becoming an increasingly popular platform for executing parallel applications. Such environments consist of a diverse set of machines and offer considerably more computational power at a lower cost than a parallel computer. Efficient heterogeneous parallel applications must account for the differences inherent in such an environment. For example, faster machines should possess more data items than their slower counterparts and communication should be minimized over slow network links. Current parallel applications are not designed with such heterogeneity in mind. Thus, a new approach is necessary for designing efficient heterogeneous parallel programs.
Exploiting Hierarchy in Heterogeneous Environments
- In IEEE/ACM IPDPS’2001
, 2001
"... Heterogeneous cluster environments are becoming an increasingly popular platform for executing parallel applications. Efficient heterogeneous parallel applications must account for the differences inherent in such an environment. Specifically, faster machines should possess more data items than thei ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Heterogeneous cluster environments are becoming an increasingly popular platform for executing parallel applications. Efficient heterogeneous parallel applications must account for the differences inherent in such an environment. Specifically, faster machines should possess more data items than their slower counterparts and communication should be minimized over slow network links. We propose the k-Heterogeneous Bulk Synchronous Parallel (HBSP k ) model, which is based on the BSP model of computation, as a framework for developing applications for heterogeneous systems. The BSP model is appropriate for 1-level (one communication network) heterogeneous systems. HBSP k extends BSP hierarchically to address k- level heterogeneous machines. The utility of the model is demonstrated through the design and analysis of the gather and one-to-all broadcast operations. Experimental results demonstrate the improved performance that results from effectively exploiting the heterogeneity of the underlying system. By hiding the non-uniformity of the underlying system from the application developer, the HBSP k model offers a framework that encourages the design of heterogeneous parallel software.
Performance of HPC Middleware over InfiniBand WAN ∗
"... High performance interconnects such as InfiniBand (IB) have enabled large scale deployments of High Performance Computing (HPC) systems. High performance communication and IO middleware such as MPI and NFS over RDMA have also been redesigned to leverage the performance of these modern interconnects. ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
High performance interconnects such as InfiniBand (IB) have enabled large scale deployments of High Performance Computing (HPC) systems. High performance communication and IO middleware such as MPI and NFS over RDMA have also been redesigned to leverage the performance of these modern interconnects. With the advent of long haul InfiniBand (IB WAN), IB applications now have inter-cluster reaches. While this technology is intended to enable high performance network connectivity across WAN links, it is important to study and characterize the actual performance that the existing IB middleware achieve in these emerging IB WAN scenarios. In this paper, we study and analyze the performance characteristics of the following three HPC middleware: (i) IPoIB (IP traffic over IB), (ii) MPI and (iii) NFS over RDMA. We

