Results 1 - 10
of
17
Learning from the Success of MPI
, 2001
"... The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers. ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers.
Runtime Compression of MPI Messages to Improve the Performance and Scalability
- of Parallel Applications.” High-Performance Computing, Networking and Storage Conference
, 2004
"... Communication-intensive parallel applications spend a significant amount of their total execution time exchanging data between processes, which leads to poor performance in many cases. In this paper, we investigate message compression in the context of large-scale parallel message-passing systems to ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Communication-intensive parallel applications spend a significant amount of their total execution time exchanging data between processes, which leads to poor performance in many cases. In this paper, we investigate message compression in the context of large-scale parallel message-passing systems to reduce the communication time of individual messages and to improve the bandwidth of the overall system. We implement and evaluate the cMPI message-passing library, which quickly compresses messages on-the-fly with a low enough overhead that a net execution time reduction can be obtained. Our re-sults on six large-scale benchmark applications show that execution speed improves by up to 98 % when message compres-sion is enabled. 1.
Issues in developing a thread-safe mpi implementation
- In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 13th European PVM/MPI Users’ Group Meeting
, 2006
"... Abstract. The MPI-2 Standard has carefully specified the interaction between MPI and user-created threads, with the goal of enabling users to write multithreaded programs while also enabling MPI implementations to deliver high performance. In this paper, we describe and analyze what the MPI Standard ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Abstract. The MPI-2 Standard has carefully specified the interaction between MPI and user-created threads, with the goal of enabling users to write multithreaded programs while also enabling MPI implementations to deliver high performance. In this paper, we describe and analyze what the MPI Standard says about thread safety and what it implies for an implementation. We classify the MPI functions based on their thread-safety requirements and discuss several issues to consider when implementing thread safety in MPI. We use the example of generating new context ids (required for creating new communicators) to demonstrate how a simple solution for the single-threaded case cannot be used when there are multiple threads and how a naïve thread-safe algorithm can be expensive. We then present an algorithm for generating context ids that works efficiently in both single-threaded and multithreaded cases. 1
Thread safety in an MPI implementation: Requirements and analysis
- Parallel Computing
, 2007
"... The MPI-2 Standard has carefully specified the interaction between MPI and usercreated threads. The goal of this specification is to allow users to write multithreaded MPI programs while also allowing MPI implementations to deliver high performance. However, a simple reading of the thread-safety spe ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
The MPI-2 Standard has carefully specified the interaction between MPI and usercreated threads. The goal of this specification is to allow users to write multithreaded MPI programs while also allowing MPI implementations to deliver high performance. However, a simple reading of the thread-safety specification does not reveal what its implications are for an implementation and what implementers must be aware (and careful) of. In this paper, we describe and analyze what the MPI Standard says about thread safety and what it implies for an implementation. We classify the MPI functions based on their thread-safety requirements and discuss several issues to consider when implementing thread safety in MPI. We use the example of generating new context ids (required for creating new communicators) to demonstrate how a simple solution for the single-threaded case does not naturally extend to the multithreaded case and how a naïve thread-safe algorithm can be expensive. We then present an algorithm for generating context ids that works efficiently in both single-threaded and multithreaded cases. Key words: Message Passing Interface (MPI), thread safety, MPI implementation, multithreaded programming 1
Efficient Communication Using Message Prediction for Cluster of Multiprocessors
- Proceedings of the CANPC’00, Fourth Workshop on Communication, Architecture, and Applications for Networkbased Parallel Computing, held in conjunction with HPCA6
, 1999
"... . With the increasing uniprocessor and SMP computation power available today, interprocessor communication has become an important factor that limits the performance of cluster of workstations. Many factors including communication hardware overhead, communication software overhead, and the user envi ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
. With the increasing uniprocessor and SMP computation power available today, interprocessor communication has become an important factor that limits the performance of cluster of workstations. Many factors including communication hardware overhead, communication software overhead, and the user environment overhead (multithreading, multiuser) affect the performance of the communication subsystems in such systems. A significant portion of the software communication overhead belongs to a number of message copying. Ideally, it is desirable to have a true zero-copy protocol where the message is moved directly from the send buffer in its user space to the receive buffer in the destination without any intermediate buffering. However, due to the fact that message -passing applications at the send side do not know the final receive buffer addresses, early arrival messages have to be buffered at a temporary area. In this paper, we show that there is a message reception communication locality in...
Toward Efficient Support for Multithreaded MPI Communication
, 2008
"... Abstract. To make the most effective use of parallel machines that are being built out of increasingly large multicore chips, researchers are exploring the use of programming models comprising a mixture of MPI and threads. Such hybrid models require efficient support from an MPI implementation for M ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract. To make the most effective use of parallel machines that are being built out of increasingly large multicore chips, researchers are exploring the use of programming models comprising a mixture of MPI and threads. Such hybrid models require efficient support from an MPI implementation for MPI messages sent from multiple threads simultaneously. In this paper, we explore the issues involved in designing such an implementation. We present four approaches to building a fully thread-safe MPI implementation, with decreasing levels of critical-section granularity (from coarse-grain locks to fine-grain locks to lock-free operations) and correspondingly increasing levels of complexity. We describe how we have structured our implementation to support all four approaches and enable one to be selected at build time. We present performance results with a message-rate benchmark to demonstrate the performance implications of the different approaches. 1
Time Warp Simulator Designs for Clusters of SMPs
, 1999
"... Traditionally, parallel discrete event simulators based on the Time Warp synchronization protocol have been implemented using either the shared memory programming model or the distributed memory, message passing programming model. This was because the preferred hardware platform was either a shared ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Traditionally, parallel discrete event simulators based on the Time Warp synchronization protocol have been implemented using either the shared memory programming model or the distributed memory, message passing programming model. This was because the preferred hardware platform was either a shared memory multiprocessor workstation or a network of uniprocessor workstations. However, the advent of clumps (clusters of multiprocessors) , has mandated a change in this dichotomous view. Programming for clumps can be quite novel as the platform allows the implementor to apply both shared memory and distributed memory programming techniques within the same framework. This thesis explores the design and implementation issues involved in exploiting this new platform for Time Warp simulations. Specifically, this thesis presents a few strategies for implementing Time Warp simulators on clumps. In addition, experiences in implementing these strategies on an extant distributed memory, message passing Time Warp simulator (warped) are presented. Performance results comparing the modified clump-specific simulation kernel to the unmodified distributed memory, message passing simulation kernel are also presented. To my parents. Acknowledgements I wish to thank my advisor Dr. Philip A. Wilsey for providing valuable guidance during the course of this work. I thank my thesis committee members Dr. Harold Carter and Dr. Santosh Pande for their suggestions on this thesis work. Mal, Ramanan, and Umesh provided valuable suggestions during the course of this work. I thank them for taking the time and interest in this work. I would also like to thank Mal and Ramanan for reading the initial drafts of my thesis and providing insightful comments. Working in the Computer Architecture Design Labor...
Reducing Communication Time through Message
- Prefetching,” Intl. Conf. on Parallel and Distributed Processing Techniques and Applications
"... Abstract – The latency of large messages often leads to poor performance of parallel applications. In this paper, we investigate a novel latency reduction technique where message receivers prefetch messages from senders before the matching sends are called. When the send is finally called, only the ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract – The latency of large messages often leads to poor performance of parallel applications. In this paper, we investigate a novel latency reduction technique where message receivers prefetch messages from senders before the matching sends are called. When the send is finally called, only the parts of the message that have changed since the prefetch need to be transmitted, resulting in a smaller message. Our message prefetching technique initiates communication while the sender is still in the computation phase and thus overlaps computation with communication to hide part of the message latency. We implement and evaluate our technique in the context of an MPI runtime library. The results show that the execution speed of five MPI applications improves by up to 24 % when message prefetching is enabled.
Implementation and Evaluation of MPI on an SMP Cluster
, 1999
"... . An MPI library, called MPICH-PM/CLUMP, has been implemented on a cluster of SMPs. MPICH-PM/CLUMP realizes zero copy message passing between nodes while using one copy message passing within a node to achieve high performance communication. To realize one copy message passing on an SMP, a kerne ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
. An MPI library, called MPICH-PM/CLUMP, has been implemented on a cluster of SMPs. MPICH-PM/CLUMP realizes zero copy message passing between nodes while using one copy message passing within a node to achieve high performance communication. To realize one copy message passing on an SMP, a kernel primitive has been designed which enables a process to read the data of another process. The get protocol using this primitive was added to MPICH. MPICH-PM/CLUMP has been run on an SMP cluster consisting of 64 Pentium II dual processors and Myrinet. It achieves 98 MByte/sec between nodes and 100 MBytes/sec within a node. 1 Introduction As SMP machines have come into wide use, they are becoming less expensive. Notably, a dual Pentium II machine is now almost the same price as a single Pentium II machine plus the cost of the additional CPU and main memory. This low cost drives construction of cluster systems using dual Pentium II machines. Another advantage is that such a cluster requ...
Tolerating Message Latency through the Early Release of Blocked Receives
"... Abstract. Large message latencies often lead to poor performance of parallel applications. In this paper, we investigate a latency-tolerating technique that immediately releases all blocking receives, even when the message has not yet (completely) arrived, and enforces execution correctness through ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Large message latencies often lead to poor performance of parallel applications. In this paper, we investigate a latency-tolerating technique that immediately releases all blocking receives, even when the message has not yet (completely) arrived, and enforces execution correctness through page protection. This approach eliminates false message data dependencies on incoming messages and allows the computation to proceed as early as possible. We implement and evaluate our early-release technique in the context of an MPI runtime library. The results show that the execution speed of MPI applications improves by up to 60 % when early release is enabled. Our approach also enables faster and easier parallel programming as it frees programmers from adopting more complex nonblocking receives and from tuning message sizes to explicitly reduce false message data dependencies. 1

