Results 1 - 10
of
14
ECO: Efficient Collective Operations for Communication on Heterogeneous Networks
- In International Parallel Processing Symposium
, 1995
"... PVM and other distributed computing systems have enabled the use of networks of workstations for parallel computation, but their approach of treating a network as a collection of point-to-point connections does not promote efficient communication--- particularly collective communication. ECO is a ..."
Abstract
-
Cited by 50 (4 self)
- Add to MetaCart
PVM and other distributed computing systems have enabled the use of networks of workstations for parallel computation, but their approach of treating a network as a collection of point-to-point connections does not promote efficient communication--- particularly collective communication. ECO is a package which solves this problem with programs which analyze the network and establish efficient communication patterns which are used by a library of collective operations. The analysis is done off-line, so that after paying the one-time cost of analyzing the network, the execution of application programs is not delayed. This paper gives performance results from using ECO to implement the collective communication in CHARMM, a widely used macromolecular dynamics package. ECO facilitates the development of data parallel applications by providing a simple interface to routines which use the available heterogeneous networks efficiently. This approach gives a naive programmer the abili...
A New Parallel Method for Molecular Dynamics Simulation of Macromolecular Systems
, 1994
"... Short--range molecular dynamics simulations of molecular systems are commonly parallelized by replicated--data methods, where each processor stores a copy of all atom positions. This enables computation of bonded 2--, 3--, and 4--body forces within the molecular topology to be partitioned among p ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
Short--range molecular dynamics simulations of molecular systems are commonly parallelized by replicated--data methods, where each processor stores a copy of all atom positions. This enables computation of bonded 2--, 3--, and 4--body forces within the molecular topology to be partitioned among processors straightforwardly. A drawback to such methods is that the inter--processor communication scales as N , the number of atoms, independent of P , the number of processors. Thus, their parallel efficiency falls off rapidly when large numbers of processors are used. In this article a new parallel method for simulating macromolecular or small--molecule systems is presented, called force--decomposition. Its memory and communication costs scale as N= p P , allowing larger problems to be run faster on greater numbers of processors. Like replicated--data techniques, and in contrast to spatial--decomposition approaches, the new method can be simply load--balanced and performs well eve...
Comparing the Communication Performance and Scalability of a Linux and a NT Cluster of PCs, a Cray Origin 2000, an IBM SP and a Cray T3E-600
- In Proceedings of IEEE Computer Society International Workshop on Cluster Computing
"... This paper presents scalability and communication performance results for a cluster of PCs running Linux with the GM communication library, a cluster of PCs running Windows NT with the HPVM communication library, a Cray T3E-600, an IBM SP and a Cray Origin 2000. Both PC clusters were using a Myri ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
This paper presents scalability and communication performance results for a cluster of PCs running Linux with the GM communication library, a cluster of PCs running Windows NT with the HPVM communication library, a Cray T3E-600, an IBM SP and a Cray Origin 2000. Both PC clusters were using a Myrinet network. Six communication tests using MPI routines were run for a variety of message sizes and numbers of processors. The tests were chosen to represent commonly-used communication patterns with low contention (a ping-pong between processors, a right shift, a binary tree broadcast and a synchronization barrier) to communication patterns with high contention (a naive broadcast and an all-to-all). For most of the tests the T3E provides the best performance and scalability. For an 8 byte message the NT cluster performs about the same as the T3E for most of the tests. For all the tests but one, the T3E, the Origin and the SP outperform the two clusters for the largest message size (10 Kbytes or 1 Mbyte). Keywords: Parallel Computers; Communication Performance; Scalability; Cray Origin 2000; Cray T3E-600; IBM SP; Cluster of PCs Running Window NT ; Cluster of PCs Running Linux; MPI Library. 1
Comparing the Scalability of the Cray T3E-600 and the Cray Origin 2000 Using SHMEM Routines
"... This paper presents relative scalability results for the Cray T3E-600 and the Cray Origin 2000 on five communication tests for a variety of message sizes and for 4, 8, 12, ..., 128 processors. The five communication tests were chosen to represent commonlyused communication patterns with low contenti ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
This paper presents relative scalability results for the Cray T3E-600 and the Cray Origin 2000 on five communication tests for a variety of message sizes and for 4, 8, 12, ..., 128 processors. The five communication tests were chosen to represent commonlyused communication patterns with low contention (accessing distant messages, a right shift and a binary tree broadcast) to communication patterns with high contention (a "basic" broadcast and an all-to-all). Both machines scaled roughly the same on all tests except the right shift with short messages where the T3E scaled significantly better than the Origin. However, it should be noted that the T3E outperformed the Origin 2000 for most of these tests. Keywords: Cray T3E-600; Cray Origin 2000; Relative scalability; Performance evaluation using SHMEM library. 1 Introduction The ability of a parallel computer to efficiently use large numbers of processors is an important performance criterion. Such a computer is often called "scalable". ...
Self-consistent MPI performance requirements
- In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 14th European PVM/MPI Users’ Group Meeting
"... Abstract. The MPI Standard does not make any performance guarantees, but users expect (and like) MPI implementations to deliver good performance. A common-sense expectation of performance is that an MPI function should perform no worse than a combination of other MPI functions that can implement the ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract. The MPI Standard does not make any performance guarantees, but users expect (and like) MPI implementations to deliver good performance. A common-sense expectation of performance is that an MPI function should perform no worse than a combination of other MPI functions that can implement the same functionality. In this paper, we formulate some performance requirements and conditions that good MPI implementations can be expected to fulfill by relating aspects of the MPI standard to each other. Such a performance formulation could be used by benchmarks and tools, such as SKaMPI and Perfbase, to automatically verify whether a given MPI implementation fulfills basic performance requirements. We present examples where some of these requirements are not satisfied, demonstrating that there remains room for improvement in MPI implementations. 1
The Performance of the MPI Collective Communication Routines for Large Messages on the Cray T3E-600, the Cray Origin 2000, and the IBM SP.
- the Cray T3E600, the Cray Origin 2000, and the IBM SP. The Journal of
"... We have implemented eight of the MPI collective routines using MPI point-to-point communication routines with algorithms designed to be efficient for large messages. The performance of our implementations of these collective routines is compared with the vendor implementations on the Cray T3E-600, t ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
We have implemented eight of the MPI collective routines using MPI point-to-point communication routines with algorithms designed to be efficient for large messages. The performance of our implementations of these collective routines is compared with the vendor implementations on the Cray T3E-600, the Cray Origin 2000 and on the IBM SP. Many of our implementations significantly outperformed vendor implementations on the T3E and the Origin 2000. On the SP, only our implementation of the broadcast significantly outperformed IBM's implementation. Keywords: MPI; Collective Communication Routines for Large Messages; Cray T3E; Origin 2000; IBM SP. 1 Introduction Today, MPI [14] is probably the most used message passing library for programming distributed memory parallel computers. Implementations of MPI are available for all commercially available parallel platforms. The MPI collective communication routines provide important functionality for scientific computing and the algorithms chosen...
NIC-based Reduction Algorithms for Large-scale Clusters
- International Journal of High Performance Computing and Networking (IJHPCN
, 2005
"... Abstract — Efficient algorithms for reduction operations across a group of processes are crucial for good performance in many large-scale, parallel scientific applications. While previous algorithms limit processing to the host CPU, we utilize the programmable processors and local memory available o ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract — Efficient algorithms for reduction operations across a group of processes are crucial for good performance in many large-scale, parallel scientific applications. While previous algorithms limit processing to the host CPU, we utilize the programmable processors and local memory available on modern cluster network interface cards (NICs) to explore a new dimension in the design of reduction algorithms. In this paper, we present the benefits and challenges, design issues and solutions, analytical models, and experimental evaluations of a family of NIC-based reduction algorithms. Performance and scalability evaluations were conducted on the ASCI Linux Cluster (ALC), a 960-node, 1920-processor machine at Lawrence Livermore National Laboratory, which uses the Quadrics QsNet interconnect. We find NIC-based reductions on modern interconnects to be more efficient than host-based implementations in both scalability and consistency. In particular, at large-scale—1812 processes— NIC-based reductions of small integer and floating-point arrays provided respective speedups of 121 % and 39% over the host-based, production-level MPI implementation. In addition, the standard deviations in timings for the NICbased reductions were as much as two orders of magnitude smaller than for the host-based reductions.
A tunable collective communication framework on a cluster of smps
- In IASTED International Conference on Parallel and Distributed Computing and Networks, 2004. Proceedings of the Seventh International Conference on High Performance Computing and Grid in Asia Pacific Region (HPCAsia’04) 0-7695-2138-X/04 $ 20.00 IEEE
"... In this paper we investigate a tunable MPI collective communications library on a cluster of SMPs. Most tunable collective communications libraries select optimal algorithms for inter-node communication on a given platform. We add another layer of intra-node communications composed by several tunabl ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this paper we investigate a tunable MPI collective communications library on a cluster of SMPs. Most tunable collective communications libraries select optimal algorithms for inter-node communication on a given platform. We add another layer of intra-node communications composed by several tunable shared memory operations. We explore the advantages of our approach, and discuss when to use our approach, when to switch to another approach on the shared memory layer. Experimental results indicate that collective communications designed by such an approach with proper tuning can outperform vendor implementations.
High-Performance MPI Broadcast Algorithm for Grid Environments Utilizing Multi-lane NICs
"... The performance of MPI collective operations, such as broadcast and reduction, is heavily affected by network topologies, especially in grid environments. Many techniques to construct efficient broadcast trees have been proposed for grids. On the other hand, recent high performance computing nodes a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The performance of MPI collective operations, such as broadcast and reduction, is heavily affected by network topologies, especially in grid environments. Many techniques to construct efficient broadcast trees have been proposed for grids. On the other hand, recent high performance computing nodes are often equipped with multi-lane network interface cards (NICs), most previous collective communication methods fail to harness effectively. Our new broadcast algorithm for grid environments harnesses almost all downward and upward bandwidths of multi-lane NICs; A message to be broadcast is split into two pieces, which are broadcast along two independent binary trees in a pipelined fashion, and swapped between both trees. The salient feature of our algorithm is generality; it works effectively on both large clusters and grid environments. It can be also applied to nodes with a single NIC, by making multiple sockets share the NIC. Experimentations on a emulated network environment show that we achieve higher performance than traditional methods, regardless of network topologies or the message sizes. 1
Discovery and Application of Network Information
, 2000
"... USAF, under agreement number F30602-96-1-0287. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as neces ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
USAF, under agreement number F30602-96-1-0287. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Advanced Research

