Results 1 -
3 of
3
Distributed peer-to-peer control in harness
- In ICCS
, 2002
"... Abstract. Harness is an adaptable fault-tolerant virtual machine environment for next-generation heterogeneous distributed computing developed as a follow on to PVM. It additionally enables the assembly of applications from plug-ins and provides fault-tolerance. This work describes the distributed c ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
Abstract. Harness is an adaptable fault-tolerant virtual machine environment for next-generation heterogeneous distributed computing developed as a follow on to PVM. It additionally enables the assembly of applications from plug-ins and provides fault-tolerance. This work describes the distributed control, which manages global state replication to ensure a high-availability of service. Group communication services achieve an agreement on an initial global state and a linear history of global state changes at all members of the distributed virtual machine. This global state is replicated to all members to easily recover from single, multiple and cascaded faults. A peer-to-peer ring network architecture and tunable multi-point failure conditions provide heterogeneity and scalability. Finally, the integration of the distributed control into the multi-threaded kernel architecture of Harness offers a fault-tolerant global state database service for plug-ins and applications. 1
High Availability for Ultra-Scale High-End Scientific Computing
- PROCEEDINGS OF COSET-2
, 2005
"... Ultra-scale architectures for scientific high-end computing with tens to hundreds of thousands of processors, such as the IBM Blue Gene/L and the Cray X1, suffer from availability deficiencies, which impact the efficiency of running computational jobs by forcing frequent checkpointing of application ..."
Abstract
-
Cited by 10 (9 self)
- Add to MetaCart
Ultra-scale architectures for scientific high-end computing with tens to hundreds of thousands of processors, such as the IBM Blue Gene/L and the Cray X1, suffer from availability deficiencies, which impact the efficiency of running computational jobs by forcing frequent checkpointing of applications. Most systems are unable to handle runtime system configuration changes caused by failures and require a complete restart of essential system services, such as the job scheduler or MPI, or even of the entire machine. In this paper, we present a flexible, pluggable and component-based high availability framework that expands today`s effort in high availability computing of keeping a single server alive to include all machines cooperating in a high-end scientific computing environment, while allowing adaptation to system properties and application needs.
A Lightweight Kernel for the Harness Metacomputing Framework
- Proceedings of HCW
, 2005
"... Harness is a pluggable heterogeneous Distributed Virtual Machine (DVM) environment for parallel and distributed scientific computing. This paper describes recent improvements in the Harness kernel design. By using a lightweight approach and moving previously integrated system services into software ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Harness is a pluggable heterogeneous Distributed Virtual Machine (DVM) environment for parallel and distributed scientific computing. This paper describes recent improvements in the Harness kernel design. By using a lightweight approach and moving previously integrated system services into software modules, the software becomes more versatile and adaptable. This paper outlines these changes and explains the major Harness kernel components in more detail. A short overview is given of ongoing efforts in integrating RMIX, a dynamic heterogeneous reconfigurable communication framework, into the Harness environment as a new plug-in software module. We describe the overall impact of these changes and how they relate to other ongoing work.

