Results 11 - 20
of
26
Design, Implementation, and Performance of Checkpointing in NetSolve
- In International Conference on Dependable Systems and Networks (FTCS-30 & DCCA-8
, 2000
"... While a variety of checkpointing techniques and systems have been documented for long-running programs, they are typically not available for programmers that are non systems experts. This paper details a project that integrates three technologies, NetSolve, Starfish, and IBP, for the seamless integr ..."
Abstract
-
Cited by 14 (9 self)
- Add to MetaCart
While a variety of checkpointing techniques and systems have been documented for long-running programs, they are typically not available for programmers that are non systems experts. This paper details a project that integrates three technologies, NetSolve, Starfish, and IBP, for the seamless integration of fault-tolerance into long-running applications. We discuss the design and implementation of this project, and present performance results executing on both local and wide-area networks. 1 Introduction Checkpointing and rollback recovery is a well-studied research area for enabling long-running applications to be fault-tolerant. Many basic checkpointing algorithms [6, 11] and optimization techniques [12] have been developed for uniprocessor and parallel computing systems, and several checkpointing libraries and systems have been implemented [1, 5, 8, 10, 14, 17, 18, 20, 22]. However, for the typical scientific user, actually using a checkpointing system is a difficult task. All sys...
OMIS 2.0 - A Universal Interface for Monitoring Systems
- Proceedings of 4th European PVM/MPI Users' Group Meeting
, 1997
"... . The OMIS project aims at defining a standard interface between tools for parallel systems and monitoring systems. Monitoring systems act as mediators between tools and the parallel program running on some target architecture. Their task is to observe and manipulate the program according to the too ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
. The OMIS project aims at defining a standard interface between tools for parallel systems and monitoring systems. Monitoring systems act as mediators between tools and the parallel program running on some target architecture. Their task is to observe and manipulate the program according to the tools' commands. A standardized interface will allow different research groups to develop tools that can be used concurrently with the same program. OCM, an OMIS compliant monitoring system, is the first implementation of such an environment. It is designed for PVM programs running on workstation clusters. This paper will give an outline of the goals of the OMIS project and describe important details of the OCM design. 1 Motivation In parallel and distributed computing, programmers are still confronted with the situation that there is a lack of support tools for their work. Specifically, when the first prototype is running, it is hard to find efficient debugging facilities to correct e.g. erro...
Deploying Fault Tolerance and Task Migration with NetSolve
, 1999
"... Computational power grids are computing environments with massive resources for processing and storage. While these resources may be pervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Computational power grids are computing environments with massive resources for processing and storage. While these resources may be pervasive, harnessing them is a major challenge for the average user. NetSolve is a software environment that addresses this concern. A fundamental feature of NetSolve is its integration of fault-tolerance and task migration in a way that is transparent to the end user. In this paper, we discuss how NetSolve's structure allows for the seamless integration of fault-tolerance and migration in grid applications, and present the specific approaches that have been and are currently being implemented within NetSolve. Key words: Fault-tolerance, Scientific Computing, Computational Servers, Checkpointing, Migration. 1 Introduction The advances in computer and network technologies that are shaping the global information infrastructure are also producing a new vision of how that infrastructure will be used. The concept of a Computational Power Grid has emerged ...
A Checkpointing Strategy for Scalable Recovery on Distributed Parallel Systems
- In SC97: High Performance Networking and Computing
, 1997
"... : In this paper, we describe a new scheme for checkpointing parallel applications on message-passing scalable distributed memory systems. The novelty of our scheme is that a checkpointed application can be restored, from its checkpointed state, in a reconfigured form. Thus, a parallel application m ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
: In this paper, we describe a new scheme for checkpointing parallel applications on message-passing scalable distributed memory systems. The novelty of our scheme is that a checkpointed application can be restored, from its checkpointed state, in a reconfigured form. Thus, a parallel application may be checkpointed while executing with t 1 tasks on p 1 processors, and then restarted from the checkpointed state with t 2 tasks on p 2 processors. As a result, applications can recover from partial failures in the underlying system. Also, the reconfigurable checkpointed states can be migrated from one parallel system to another even if they do not have the same number of processors. We describe a new programming model for implementing a reconfigurable checkpointing scheme for parallel programs. This new model is derived from the DRMS programming model, developed in the context of run-time reconfiguration of parallel applications. A key component of our implementation is the distri...
The average availability of parallel checkpointing systems and its importance in selecting runtime parameters
- IN 29TH INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING
, 1999
"... Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particu ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we briefly present a performance model for long-running parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today’s parallel computing environments and software, and present case studies of using the model to select runtime parameters.
Breaking the Curse of Dynamics by Task Migration: Pilot Experiments in the Polder Metacomputer
- In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Volume 1332 of Lecture Notes in Computer Science
, 1997
"... With the advent of high speed networks, distributed cluster computing and metacomputing have assumed an enormous interest. However, software methods and techniques to make the full potential of these distributed environments available, are not yet mature. In this paper, we focus on dynamic load bala ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
With the advent of high speed networks, distributed cluster computing and metacomputing have assumed an enormous interest. However, software methods and techniques to make the full potential of these distributed environments available, are not yet mature. In this paper, we focus on dynamic load balancing of resources and applications as one of the crucial techniques to optimize performance in distributed environments. Some design and implementation details are described, and early experimental results are presented.
Dynamic Load Distribution in MIST
- In International Conference on Parallel and Distributed Processing Techniques and Applications
, 1997
"... Abstract: This paper presents an algorithm for scheduling parallel applications in large-scale, multiuser, heterogeneous distributed systems. The approach is primarily targeted at systems that harvest idle cycles in general-purpose workstation networks, but is also applicable to clustered computer s ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract: This paper presents an algorithm for scheduling parallel applications in large-scale, multiuser, heterogeneous distributed systems. The approach is primarily targeted at systems that harvest idle cycles in general-purpose workstation networks, but is also applicable to clustered computer systems and massively parallel processors. The algorithm handles unequal processor capacities, multiple architecture types and dynamic variations in the number of processes and available processors. Scheduling decisions are driven by the desire to minimize turnaround time while maintaining fairness among competing applications. For efficiency, the virtual processors (VPs) of each application are gang scheduled on some subset of the available physical processors.
Lightweight Process Migration and Memory Prefetching in openMosix ∗
"... We propose a lightweight process migration mechanism and an adaptive memory prefetching scheme called AM-PoM (Adaptive Memory Prefetching in openMosix), whose goal is to reduce the migration freeze time in openMosix while ensuring the execution efficiency of migrants. To minimize the freeze time, ou ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We propose a lightweight process migration mechanism and an adaptive memory prefetching scheme called AM-PoM (Adaptive Memory Prefetching in openMosix), whose goal is to reduce the migration freeze time in openMosix while ensuring the execution efficiency of migrants. To minimize the freeze time, our system transfers only a few pages to the destination node during process migration. After the migration, AMPoM analyzes the spatial locality of memory access and iteratively prefetches memory pages from remote to hide the latency of inter-node page faults. AMPoM adopts a unique algorithm to decide which and how many pages to prefetch. It tends to prefetch more aggressively when a sequential access pattern is developed, when the paging rate of the process is high or when the network is busy. This advanced strategy makes AMPoM highly adaptive to different application behaviors and system dynamics. The HPC Challenge benchmark results show that AMPoM can avoid 98 % of migration freeze time while preventing 85-99 % of page fault requests after the migration. Compared to openMosix which does not have remote page fault, AMPoM induces a modest overhead of 0-5 % additional runtime. When the working set of a migrant is small, AMPoM outperforms openMosix considerably due to the reduced amount of data transfer. These results indicate that by exploiting memory access locality and prefetching, process migration can be a lightweight operation with little software overhead in remote paging. 1
Adaptive Load Balancing of Distributed SPMD Computations: A Transparent Approach
- Future Generation Computer Systems 16 (2000) 571–584 Tech. Rep. DISP-RR-97.02, Dipartimento di Informatica, Sistemi e Produzione, Universita’ di Roma Tor Vergata
, 1997
"... Efficient parallel computing on distributed platforms still presents many obstacles. This paper addresses the important issue of masking the power heterogeneity and variability of non-dedicated nodes. To this purpose, we present a load balancing support that autonomously adapts the workload of Singl ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Efficient parallel computing on distributed platforms still presents many obstacles. This paper addresses the important issue of masking the power heterogeneity and variability of non-dedicated nodes. To this purpose, we present a load balancing support that autonomously adapts the workload of Single Program Multiple Data (SPMD) applications to platform conditions. This support checks the load status of the nodes at the beginning and during program execution and, if necessary, carries out data migrations from overloaded to underloaded nodes without requiring the programmer to insert load balancing primitives. As additional important contribution to the transparency and efficiency of the framework, we propose a stochastic model for the automatic choice of the optimum interval of activation of the load balancer. Unlike task migration supports for task parallelism and other data migration frameworks for master/slave-based applications, our load balancer is transparent and works for the en...
Running Scientific Computations In A Web Operating System Environment
, 1999
"... The efficient execution of scientific applications on the world-wide interconnected networks (the Internet, Web) requires an integration of fine and coarse grain load balancing strategies (Banicescu, Ghafoor, and Bilderback 1998). While the implementation of the new Web Operating System (WOS, a spec ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The efficient execution of scientific applications on the world-wide interconnected networks (the Internet, Web) requires an integration of fine and coarse grain load balancing strategies (Banicescu, Ghafoor, and Bilderback 1998). While the implementation of the new Web Operating System (WOS, a specialized operating system for global computing) is in progress (Kropf, Plaice, and Unger 1997), we identify the most competitive fine and coarse grain load balancing strategies that proved to be extremely effective in networks of workstations (Russ, Banicescu, Ghafoor, Janapareddi, Robinson, and Lu 1998), and propose a novel scheme that uses an integration of both for the WOS. The paper will asses the feasibility and the significance of the proposed scheme and will outline details of the implementation. 1. INTRODUCTION With the rapid development of new forms and concepts of networked and mobile computing, it is increasingly clear that operating system environments must evolve such that all...

