Results 1 - 10
of
18
A high-performance, portable implementation of the MPI message passing interface standard
- Parallel Computing
, 1996
"... MPI (Message Passing Interface) is a specification for a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed. In this paper, we d ..."
Abstract
-
Cited by 651 (37 self)
- Add to MetaCart
MPI (Message Passing Interface) is a specification for a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed. In this paper, we describe MPICH, unique among existing implementations in its design goal of combining portability with high performance. We document its portability and performance and describe the architecture by which these features are simultaneously achieved. We also discuss the set of tools that accompany the free distribution of MPICH, which constitute the beginnings of a portable parallel programming environment. A project of this scope inevitably imparts lessons about parallel computing, the specification being followed, the current hardware and software environment for parallel computing, and project management; we describe those we have learned. Finally, we discuss future developments for MPICH, including those necessary to accommodate extensions to the MPI Standard now being contemplated by the MPI Forum. 1
CoCheck: Checkpointing and Process Migration for MPI
- IN PROCEEDINGS OF THE 10TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM (IPPS ’96
, 1996
"... Checkpointing of parallel applications can be used as the core technology to provide process migration. Both, checkpointing and migration, are an important issue for parallel applications on networks of workstations. The CoCheck environment which we present in this paper introduces a new approach to ..."
Abstract
-
Cited by 175 (4 self)
- Add to MetaCart
Checkpointing of parallel applications can be used as the core technology to provide process migration. Both, checkpointing and migration, are an important issue for parallel applications on networks of workstations. The CoCheck environment which we present in this paper introduces a new approach to provide checkpointing and migration for parallel applications. In difference to existing systems CoCheck rather sits on top of the message passing library than inside and achieves consistency at a level above the message passing system. It uses an existing single process checkpointer which is available for a wide range of systems. Hence, CoCheck can be easily adapted to both, different message passing systems and new machines.
Managing Checkpoints for Parallel Programs
- In Workshop on Job Scheduling Strategies for Parallel Processing (IPPS '96
"... Checkpointing is a valuable tool for any scheduling system to have. With the ability to checkpoint, schedulers are not locked into a single allocation of resources to jobs, but instead can stop running jobs, and re-allocate resources with out sacrificing any completed computations. Checkpointing tec ..."
Abstract
-
Cited by 51 (1 self)
- Add to MetaCart
Checkpointing is a valuable tool for any scheduling system to have. With the ability to checkpoint, schedulers are not locked into a single allocation of resources to jobs, but instead can stop running jobs, and re-allocate resources with out sacrificing any completed computations. Checkpointing techniques are not new, but they have not been widely available on parallel platforms. We have implemented CoCheck, a system for checkpointing message passing parallel programs. Parallel programs tend to be large in terms of their aggregate memory utilization, so the size of their checkpoint is also large. Because of this, checkpoints must be handled carefully to avoid overloading the system when checkpoints take place. Today's distributed file systems do not handle this situation well. We therefore propose the use of checkpoint servers which are specifically designed to move checkpoints from the checkpointing process, across the interconnection network, and on to stable storage. A scheduling s...
Interfacing Condor and PVM to harness the cycles of workstation clusters
- Journal on Future Generations of Computer Systems
, 1995
"... A continuing challenge to the scientific research and engineering communities is how to fully utilize computational hardware. In particular, the proliferation of clusters of high performance workstations has become an increasingly attractive source of compute power. Developments to take advantage of ..."
Abstract
-
Cited by 45 (7 self)
- Add to MetaCart
A continuing challenge to the scientific research and engineering communities is how to fully utilize computational hardware. In particular, the proliferation of clusters of high performance workstations has become an increasingly attractive source of compute power. Developments to take advantage of this environment have previously focused primarily on managing the resources, or on providing interfaces so that a number of machines can be used in parallel to solve large problems. Both approaches are desirable, and indeed should be complementary. Unfortunately, the resource management and parallel processing systems are usually developed by independent groups, and they usually do not interact well together. To bridge this gap, we have developed a framework for interfacing these two sorts of systems. Using this framework, we have interfaced PVM, a popular system for parallel programming with Condor, a powerful resource management system. This combined system is operational, and we have ma...
Scalable Networked Information Processing Environment (SNIPE)
- in Proceedings of SuperComputing '97
, 1997
"... SNIPE is a metacomputing system that aims to provide a reliable, secure, fault-tolerant environment for long-term distributed computing applications and data stores across the global InterNet. This system combines global naming and replication of both processing and data to support large scale infor ..."
Abstract
-
Cited by 32 (10 self)
- Add to MetaCart
SNIPE is a metacomputing system that aims to provide a reliable, secure, fault-tolerant environment for long-term distributed computing applications and data stores across the global InterNet. This system combines global naming and replication of both processing and data to support large scale information processing applications leading to better availablity and reliability than currently available with typical cluster computing and/or distributed computer environments. Keywords: SNIPE, RCDS, MetaComputing, scalable, secure, reliable Acknowledgements This work was supported in part by the Office of Scientific Computing, U.S. Department of Energy, under Contract DE-AC05-96OR22464, by DARPA under Contract DAAH 04-95-1-0595, and by the National Science Foundation's Center for Research on Parallel Computation, Science and Technology Center Cooperative Agreement No. CCR-8809615. 1. Introduction The beginning of the 21st century will present new challenges for large-scale applications i...
Process Hijacking
, 1999
"... Process checkpointing is a basic mechanism required for providing High Throughput Computing service on distributively owned resources. We present a new process checkpoint and migration technique, called process hijacking, that uses dynamic program re-writing techniques to add checkpointing capabilit ..."
Abstract
-
Cited by 31 (7 self)
- Add to MetaCart
Process checkpointing is a basic mechanism required for providing High Throughput Computing service on distributively owned resources. We present a new process checkpoint and migration technique, called process hijacking, that uses dynamic program re-writing techniques to add checkpointing capability to a running program. Process hijacking makes it possible to checkpoint and migrate proprietary applications that cannot be re-linked with a checkpoint library, and it makes it possible to dynamically hand off an ordinary running process to a distributed resource management system such as Condor. We discuss the problems of adding checkpointing capability to a program already in execution: (1) loading new code into the running process, and (2) replacing functions of the process with calls to dynamically loaded functions. We use the DynInst API process editing library, augmented with a new call for replacing functions, to solve these problems. We discuss problems associated with migrating a ...
P-GRADE: a Grid Programming Environment
"... P-GRADE provides a high-level graphical environment to develop parallel applications transparently both for parallel systems and the Grid. P-GRADE supports the interactive execution of parallel programs as well as the creation of a Condor,Condor-G or Globus job to execute parallel programs in the Gr ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
P-GRADE provides a high-level graphical environment to develop parallel applications transparently both for parallel systems and the Grid. P-GRADE supports the interactive execution of parallel programs as well as the creation of a Condor,Condor-G or Globus job to execute parallel programs in the Grid. In P-GRADE, the user can generate either PVM or MPI code according to the underlying Grid where the parallel application should be executed. PVM applications generated by P-GRADE can migrate between different Grid sites and as a result P-GRADE guarantees reliable, fault-tolerant parallel program execution in the Grid. The GRM/PROVE performance monitoring and visualisation toolset has been extended towards the Grid and connected to a general Grid monitor (Mercury) developed in the EU GridLab project. Using the Mercury/GRM/PROVE Grid application monitoring infrastructure any parallel application launched by P-GRADE can be remotely monitored and analysed at run time even if the application migrates among Grid sites. P-GRADE supports workflow definition and co-ordinated multi-job execution for the Grid. Such workflow management can provide parallel execution at both inter-job and intra-job level. Automatic checkpoint mechanism for parallel programs supports the migration of parallel jobs inside the workflow providing a fault-tolerant workflow execution mechanism. The paper describes all of these features of P-GRADE and their implementation concepts.
TOOL-SET - An Integrated Tool Environment for PVM
- Ecole Normale Superieure de Lyon
, 1995
"... THE TOOL-SET for PVM will comprisea set of integrated tools which can either be used individually or in concert. THE TOOL-SET will be composedof a debugger, a performance analyzer, a visualizer, a deterministic execution controller, a load balancer including a checkpoint generator, anda parallel fil ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
THE TOOL-SET for PVM will comprisea set of integrated tools which can either be used individually or in concert. THE TOOL-SET will be composedof a debugger, a performance analyzer, a visualizer, a deterministic execution controller, a load balancer including a checkpoint generator, anda parallel file system. All tools will be available under the GNU General Public License Agreement. First versions of the tool environment will be released in spring 1996. 1 Introduction PVM has become a de-facto standard for writing parallel applications based on the message passing paradigm. It is also being used as a platform for developing tools for parallel programming. Although some of them are rather sophisticated, currently only few are really usable for application programmers. One reason for this fact is that many tools are pure research prototypes providing only rudimentary, clumsy interfaces. A second reason is that they cover only a single aspect of parallel program development and don't sup...
Load balancing HPF programs by migrating virtual processors
- IN SECOND INTERNATIONAL WORKSHOP ON HIGH-LEVEL PROGRAMMING MODELS AND SUPPORTIVE ENVIRONMENTS, HIPS '97
, 1997
"... This paper explores the integration of load balancing features in the data parallel language HPF targeting semi-regular applications. We show that the HPF virtual processors are good candidates to be the unit of migration. Then, we compare 3 possible implementations and show that threads provide a g ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
This paper explores the integration of load balancing features in the data parallel language HPF targeting semi-regular applications. We show that the HPF virtual processors are good candidates to be the unit of migration. Then, we compare 3 possible implementations and show that threads provide a good trade-off between efficiency and ease of implementation. We finally describe a preliminary implementation. The experimental results, obtained with the Gaussian elimination with partial pivoting are promising.
Distributed Resource Management for Parallel Applications in Networks of Workstations
- In HPCN Europe
, 1997
"... . Running parallel applications in a network of workstations (NOW) requires the use of a resource management system with batch queueing and load balancing functionalities to utilize idle workstations in the NOW and to avoid load imbalance in the network. A resource management system for parallel job ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
. Running parallel applications in a network of workstations (NOW) requires the use of a resource management system with batch queueing and load balancing functionalities to utilize idle workstations in the NOW and to avoid load imbalance in the network. A resource management system for parallel jobs requires special functionalities to schedule jobs to hosts and to support checkpointing and migration of parallel applications. This paper describes the essential components of a distributed resource management system supporting parallel computations in a NOW and how to reuse existing resource management components for this approach. The implementation of a distributed resource manager demonstrates the practical relevance of the design concept 1 . 1 Introduction Networks of workstations (NOWs) nowadays offer computational power to run even resource intense parallel Scientific Computing applications, e.g. in computational fluid dynamics. A resource management system including batch queue...

