Results 1 - 10
of
19
CUMULVS: Providing Fault-Tolerance, Visualization and Steering of Parallel Applications
- International Journal of High Performance Computing Applications
, 1996
"... The use of visualization and computational steering can often assist scientists in analyzing large-scale scientific applications. Fault-tolerance to failures is of great importance when running on a distributed system. However, the details of implementing these features are complex and tedious, l ..."
Abstract
-
Cited by 103 (5 self)
- Add to MetaCart
The use of visualization and computational steering can often assist scientists in analyzing large-scale scientific applications. Fault-tolerance to failures is of great importance when running on a distributed system. However, the details of implementing these features are complex and tedious, leaving many scientists with inadequate development tools. CUMULVS is a library that enables programmers to easily incorporate interactive visualization and computational steering into existing parallel programs. The library is divided into two pieces: one for the application program and one for the, possibly commercial, visualization and steering front-end. Together these two libraries encompass all the connection and data protocols needed to dynamically attach multiple independent viewer front-ends to a running parallel application. Viewer programs can also steer one or more user-defined parameters to "close the loop" for computational experiments and analyses. CUMULVS allows the pr...
The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
- in Proceedings, LACSI Symposium, Sante Fe
, 2003
"... As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback ..."
Abstract
-
Cited by 67 (7 self)
- Add to MetaCart
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernellevel process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible communication performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI. 1
The Interoperable Message Passing Interface (IMPI) Extensions to LAM/MPI
- MPI Developer's Conference, Ithica
, 2000
"... Interoperable MPI (IMPI) is a protocol specification to allow multiple MPI implementations to cooperate on a single MPI job. Unlike portable MPI implementations, an IMPI-connected parallel job allows the use of vendor-tuned message passing libraries on given target architectures, thus potentially al ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Interoperable MPI (IMPI) is a protocol specification to allow multiple MPI implementations to cooperate on a single MPI job. Unlike portable MPI implementations, an IMPI-connected parallel job allows the use of vendor-tuned message passing libraries on given target architectures, thus potentially allowing higher levels of performance than previously possible. Additionally, the IMPI protocol uses a low number of connections, which may be suitable for parallel computations across WAN distances. The IMPI specification defines a low-level wireline protocol that MPI implementations use to communicate with each other; each point-to-point and collective function in MPI-1 automatically uses this low-level protocol when communicating with remote a MPI implementation. When running IMPI jobs, the only change visible to the user is the sequence of steps necessary to run the job; any correct MPI program will run correctly under IMPI. In this paper, we provide an overview of IMPI, describe its incor...
Open MPI’s TEG point-to-point communications methodology: Comparison to existing implementations
- In Proceedings, 11th European PVM/MPI Users’ Group Meeting
, 2004
"... Abstract. TEG is a new methodology for point-to-point messaging developed as a part of the Open MPI project. Initial performance measurements are presented, showing comparable ping-pong latencies in a single NIC configuration, but with bandwidths up to 30 % higher than that achieved by other leading ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
Abstract. TEG is a new methodology for point-to-point messaging developed as a part of the Open MPI project. Initial performance measurements are presented, showing comparable ping-pong latencies in a single NIC configuration, but with bandwidths up to 30 % higher than that achieved by other leading MPI implementations. Homogeneous dual-NIC configurations further improved performance, but the heterogeneous case requires continued investigation. 1
The design and implementation of checkpoint/restart process fault tolerance for Open MPI
- In Workshop on Dependable Parallel, Distributed and Network-Centric Systems(DPDNS), in conjunction with IPDPS
, 2007
"... To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementations that incorporated fault tolerance capabilities have been limited by lack of modularity, scalability and usability. Th ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementations that incorporated fault tolerance capabilities have been limited by lack of modularity, scalability and usability. This paper presents the design and implementation of an infrastructure to support checkpoint/restart fault tolerance in the Open MPI project. We identify the general capabilities required for distributed checkpoint/restart and realize these capabilities as extensible frameworks within Open MPI’s modular component architecture. Our design features an abstract interface for providing and accessing fault tolerance services without sacrificing performance, robustness, or flexibility. Although our implementation includes support for some initial checkpoint/restart mechanisms, the framework is meant to be extensible and to encourage experimentation of alternative techniques within a production quality MPI implementation. 1.
TEG: A high-performance, scalable, multi-network point-to-point communications methodology
- In Proceedings, 11th European PVM/MPI Users’ Group Meeting
, 2004
"... Abstract. TEG is a new component-based methodology for point-to-point messaging. Developed as part of the Open MPI project, TEG provides a configurable fault-tolerant capability for high-performance messaging that utilizes multi-network interfaces where available. Initial performance comparisons wit ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Abstract. TEG is a new component-based methodology for point-to-point messaging. Developed as part of the Open MPI project, TEG provides a configurable fault-tolerant capability for high-performance messaging that utilizes multi-network interfaces where available. Initial performance comparisons with other MPI implementations show comparable ping-pong latencies, but with bandwidths up to 30 % higher. 1
Analysis of the component architecture overhead
- in Open MPI. In Proceedings, 12th European PVM/MPI Users’ Group Meeting
, 2005
"... Abstract. Component architectures provide a useful framework for developing an extensible and maintainable code base upon which largescale software projects can be built. Component methodologies have only recently been incorporated into applications by the High Performance Computing community, in pa ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Abstract. Component architectures provide a useful framework for developing an extensible and maintainable code base upon which largescale software projects can be built. Component methodologies have only recently been incorporated into applications by the High Performance Computing community, in part because of the perception that component architectures necessarily incur an unacceptable performance penalty. The Open MPI project is creating a new implementation of the Message Passing Interface standard, based on a custom component architecture – the Modular Component Architecture (MCA) – to enable straightforward customization of a high-performance MPI implementation. This paper reports on a detailed analysis of the performance overhead in Open MPI introduced by the MCA. We compare the MCA-based implementation of Open MPI with a modified version that bypasses the component infrastructure. The overhead of the MCA is shown to be low, on the order of 1%, for both latency and bandwidth microbenchmarks as well as for the NAS Parallel Benchmark suite. 1
A Thread Taxonomy for MPI
- In Second MPI Developer's Conference, Los Alamos
, 1996
"... In 1994, we presented extensions to MPI and offered an early paper on potential thread extensions to MPI, as well as non-blocking collective extensions to MPI [14]. The present paper is a thorough review of thread issues in MPI, including alternative models, their computational uses, and the impact ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
In 1994, we presented extensions to MPI and offered an early paper on potential thread extensions to MPI, as well as non-blocking collective extensions to MPI [14]. The present paper is a thorough review of thread issues in MPI, including alternative models, their computational uses, and the impact on implementations. A number of issues are addressed: barriers to thread safety in MPI implementations with MPICH as an example and changes of the semantics of non-thread-safe MPI calls, different thread models, their uses, and possible integration. Minimal portable thread management and synchronization mechanisms API extensions for MPI are considered. A tentative design for multi-threaded thread-safe ADI and Channel Device for MPICH is proposed. We consider threads as both an implementation device for MPI and as a user-level mechanism to achieve fine-grain concurrency. The reduction of the process to a simple resource container (as considered by Mach), with the thread as the main named comp...
Distributed Computations Driven by Resource Consumption
- In IEEE International Conference on Computer Languages (ICCL'98
, 1998
"... Millions of computers are now connected together by the Internet. At a fast pace, applications are taking advantage of these new capabilities, and are becoming parallel and distributed, e.g. applets on the WWW or agent technology. As we live in a world with finite resources, an important challenge i ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Millions of computers are now connected together by the Internet. At a fast pace, applications are taking advantage of these new capabilities, and are becoming parallel and distributed, e.g. applets on the WWW or agent technology. As we live in a world with finite resources, an important challenge is to be able to control computations in such an environment. For instance, a user might like to suspend a computation because another one seems to be more promising. In this paper, we present a paradigm that allows the programmer to monitor and control computations, whether parallel or distributed, by mastering their resource consumption. We describe an implementation on top of the thread library PPCR and the message-passing library Nexus. 1 Introduction As we live in a world with finite resources, it is of paramount importance for the user to be able to monitor and control computations. This task is all the more complex since computations may be parallel, distributed, and most probably ma...
A Uniform Approach to Programming the World Wide Web
- Computer Systems Science and Engineering
, 1998
"... We propose a uniform model for programming distributed web applications. The model is based on the concept of web computation places and provides mechanisms to coordinate distributed computations at these places, including peer-to-peer communication between places and a uniform mechanism to initiate ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
We propose a uniform model for programming distributed web applications. The model is based on the concept of web computation places and provides mechanisms to coordinate distributed computations at these places, including peer-to-peer communication between places and a uniform mechanism to initiate computation in remote places. Computations can interact with the flow of http requests and responses, typically as clients, proxies or servers in the web architecture. We have implemented the model using the global pointers and remote service requests provided by the Nexus communication library. We present the model and its rationale, with some illustrative examples, and we describe the implementation. 1 Introduction Many web applications require a significant amount of computation which may be distributed and requires coordination; these applications use the web infrastructure to good advantage but are often constrained by the architecture, which is fundamentally client-server. A variety...

