Results 1 -
6 of
6
Symmetric active/active high availability for high-performance computing system services
- Journal of Computers (JCP
, 2006
"... This paper summarizes our efforts over the last 3-4 years in providing symmetric active/active high availability for high-performance computing (HPC) system services. This work paves the way for high-level reliability, availability and serviceability in extreme-scale HPC systems by focusing on the m ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
This paper summarizes our efforts over the last 3-4 years in providing symmetric active/active high availability for high-performance computing (HPC) system services. This work paves the way for high-level reliability, availability and serviceability in extreme-scale HPC systems by focusing on the most critical components, head and service nodes, and by reinforcing them with appropriate high availability solutions. This paper presents our accomplishments in the form of concepts and respective prototypes, discusses existing limitations, outlines possible future work, and describes the relevance of this research to other, planned efforts. 1.
Transparent symmetric active/active replication for service-level high availability
- In Proceedings of the 7 th IEEE International Symposium on Cluster Computing and the Grid
"... As service-oriented architectures become more important in parallel and distributed computing systems, individual service instance reliability as well as appropriate service redundancy becomes an essential necessity in order to increase overall system availability. This paper focuses on providing re ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
As service-oriented architectures become more important in parallel and distributed computing systems, individual service instance reliability as well as appropriate service redundancy becomes an essential necessity in order to increase overall system availability. This paper focuses on providing redundancy strategies using service-level replication techniques. Based on previous research using symmetric active/active replication, this paper proposes a transparent symmetric active/active replication approach that allows for more reuse of code between individual service-level replication implementations by using a virtual communication layer. Serviceand client-side interceptors are utilized in order to provide total transparency. Clients and servers are unaware of the replication infrastructure as it provides all necessary mechanisms internally. 1.
A fast delivery protocol for total order broadcasting
- In Proceedings of the 16 th IEEE International Conference on Computer Communications and Networks
"... Abstract—Sequencer, privilege-based, and communication history algorithms are popular approaches to implement total ordering, where communication history algorithms are most suitable for parallel computing systems, because they provide best performance under heavy work load. Unfortunately, posttrans ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Abstract—Sequencer, privilege-based, and communication history algorithms are popular approaches to implement total ordering, where communication history algorithms are most suitable for parallel computing systems, because they provide best performance under heavy work load. Unfortunately, posttransmission delay of communication history algorithms is most apparent when a system is idle. In this paper, we propose a fast delivery protocol to reduce the latency of message ordering. The protocol optimizes the total ordering process by waiting for messages only from a subset of the machines in the group, and by fast acknowledging messages on behalf of other machines. Our test results indicate that the fast delivery protocol is suitable for both idle and heavy load systems, while reducing the latency of message ordering. I.
Towards high availability for high-performance computing system services: Accomplishments and limitations
- In Proceedings of High Availability and Performance Workshop
, 2006
"... During the last several years, our teams at Oak Ridge National Laboratory, Louisiana Tech University, and Tennessee Technological University focused on efficient redundancy strategies for head and service nodes of high-performance computing (HPC) systems in order to pave the way for high availabilit ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
During the last several years, our teams at Oak Ridge National Laboratory, Louisiana Tech University, and Tennessee Technological University focused on efficient redundancy strategies for head and service nodes of high-performance computing (HPC) systems in order to pave the way for high availability (HA) in HPC. These nodes typically run critical HPC system services, like job and resource management, and represent single points of failure and control for an entire HPC system. The overarching goal of our research is to provide high-level reliability, availability, and serviceability (RAS) for HPC systems by combining HA and HPC technology. This paper summarizes our accomplishments, such as developed concepts and implemented proof-of-concept prototypes, and describes existing limitations, such as performance issues, which need to be dealt with for production-type deployment.
On programming models for service-level high availability
- In Proceedings of 2 nd International Conference on Availability, Reliability and Security
"... This paper provides an overview of existing programming models for service-level high availability and investigates their differences, similarities, advantages, and disadvantages. Its goal is to help to improve reuse of code and to allow adaptation to quality of service requirements. It further aims ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
This paper provides an overview of existing programming models for service-level high availability and investigates their differences, similarities, advantages, and disadvantages. Its goal is to help to improve reuse of code and to allow adaptation to quality of service requirements. It further aims at encouraging a discussion about these programming models and their provided quality of service, such as availability, performance, serviceability, usability and applicability. Within this context, the presented research focuses on providing high availability for services running on head and service nodes of high-performance computing systems. The proposed conceptual service model and the discussed service-level high availability programming models are applicable to many parallel and distributed computing scenarios as a networked system‘s availability can invariably be improved by increasing the availability of its most critical services.
Symmetric active/active replication for dependent services
- In Proceedings of the 3 rd International Conference on Availability, Reliability and Security
"... During the last several years, we have established the symmetric active/active replication model for servicelevel high availability and implemented several proofof-concept prototypes. One major deficiency of our model is its inability to deal with dependent services, since its original architecture ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
During the last several years, we have established the symmetric active/active replication model for servicelevel high availability and implemented several proofof-concept prototypes. One major deficiency of our model is its inability to deal with dependent services, since its original architecture is based on the clientservice model. This paper extends our model to dependent services using its already existing mechanisms and features. The presented concept is based on the idea that a service may also be a client of another service, and multiple services may be clients of each other. A high-level abstraction is used to illustrate dependencies between clients and services, and to decompose dependencies between services into respective client-service dependencies. This abstraction may be used for providing high availability in distributed computing systems with complex service-oriented architectures.

