Results 1 -
8 of
8
Large-Scale Newscast Computing on the Internet
, 2002
"... This paper introduces the newscast model of computation for large-scale computing on the Internet. The engine realizing this model is a lazy fully distributed information propagation protocol among the participants which is responsible for membership management and communication. It maintains a cons ..."
Abstract
-
Cited by 39 (14 self)
- Add to MetaCart
This paper introduces the newscast model of computation for large-scale computing on the Internet. The engine realizing this model is a lazy fully distributed information propagation protocol among the participants which is responsible for membership management and communication. It maintains a constantly changing communication graph over the participants. This graph has useful emergent properties like small diameter and sufficiently random structure without deploying special purpose protocols to achieve these properties. For adding a new participant only the address of an arbitrary member is needed and for removal no action is necessary. We provide theoretical and empirical evidence that besides being simple and lightweight our newscast computing engine is extremely scalable and robust. We also suggest some interesting application areas including information dissemination, monitoring of large systems, resource sharing and efficient multicasting.
Experimental analysis of a gossip-based service for scalable, distributed failure detection and consensus, Cluster Computing
- http://www.hcs.ufl.edu/pubs/GOSSIP2001.pdf, Cluster Computing
"... Abstract. Gossip protocols and services provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. Extending the gossip protocol such that a system reaches consensus on d ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. Gossip protocols and services provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. Extending the gossip protocol such that a system reaches consensus on detected faults can be performed via a flat structure, or it can be hierarchically distributed across cooperating layers of nodes. In this paper, the performance of gossip services employing flat and hierarchical schemes is analyzed on an experimental testbed in terms of consensus time, resource utilization and scalability. Performance associated with a hierarchically arranged gossip scheme is analyzed with varying group sizes and is shown to scale well. Resource utilization of the gossip-style failure detection and consensus service is measured in terms of network bandwidth utilization and CPU utilization. Analytical models are developed for resource utilization and performance projections are made for large system sizes.
GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems
- Cluster Comput
"... Abstract. Gossip protocols have proven to be effective means by which failures can be detected in large, distributed systems in an asynchronous manner without the limitations associated with reliable multicasting for group communications. In this paper, we discuss the development and features of a G ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Gossip protocols have proven to be effective means by which failures can be detected in large, distributed systems in an asynchronous manner without the limitations associated with reliable multicasting for group communications. In this paper, we discuss the development and features of a Gossip-Enabled Monitoring Service (GEMS), a highly responsive and scalable resource monitoring service, to monitor health and performance information in heterogeneous distributed systems. GEMS has many novel and essential features such as detection of network partitions and dynamic insertion of new nodes into the service. Easily extensible, GEMS also incorporates facilities for distributing arbitrary system and application-specific data. We present experiments and analytical projections demonstrating scalability, fast response times and low resource utilization requirements, making GEMS a potent solution for resource monitoring in distributed computing.
Fault management in P2PMPI
- In In proceedings of International Conference on Grid and Pervasive Computing, GPC’07, LNCS
, 2007
"... Abstract. We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detection. P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Application ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detection. P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Applications are monitored by a distributed set of external modules called failure detectors. The contribution of this paper is the analysis of the advantages and drawbacks of such detectors for a real implementation, and its integration in P2P-MPI. We pay especially attention to the reliability of the failure detection service and to the failure detection speed. We propose a variant of the binary round-robin protocol, which is more reliable than the application execution in any case. Experiments on applications of up to 256 processes, carried out on Grid’5000 show that the real detection times closely match the predictions. keywords: Grid computing, middleware, Parallelism, Fault-tolerance. 1
GEMS: Gossip-Enabled Monitoring Service for Heterogeneous Distributed Systems,” http://www.hcs.ufl.edu/pubs/GEMS2002.pdf, submitted to Journal of Network and Systems Management
"... Abstract – Gossip protocols provide a scalable means for detecting failures in heterogeneous distributed systems in an asynchronous manner without the limits associated with group communication. In this paper, we discuss the development and features of a hierarchical Gossip-Enabled Monitoring Servic ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract – Gossip protocols provide a scalable means for detecting failures in heterogeneous distributed systems in an asynchronous manner without the limits associated with group communication. In this paper, we discuss the development and features of a hierarchical Gossip-Enabled Monitoring Service (GEMS), which extends the gossip-style failure detection service to support resource monitoring. By dividing the system into groups of nodes and layers of communication, the GEMS paradigm scales well. Easily extensible, GEMS incorporates facilities for distributing arbitrary system and application-specific data. In this paper we present experiments and analytical projections demonstrating fast response times and low resource utilization requirements, making GEMS a superior solution for resource monitoring issues in distributed computing. Also, we demonstrate the utility of GEMS through the development of a simple dynamic load balancing service for which GEMS forms the information base.
Int J Parallel Prog (2009) 37:433–461 DOI 10.1007/s10766-009-0115-8 Fault-Management in P2P-MPI
"... Abstract We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardw ..."
Abstract
- Add to MetaCart
Abstract We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and
Author manuscript, published in "In proceedings of International Conference on Grid and Pervasive Computing, GPC'07 4459 (2007)" Fault management in P2P-MPI
, 2010
"... Abstract. We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detection. P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Application ..."
Abstract
- Add to MetaCart
Abstract. We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detection. P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Applications are monitored by a distributed set of external modules called failure detectors. The contribution of this paper is the analysis of the advantages and drawbacks of such detectors for a real implementation, and its integration in P2P-MPI. We pay especially attention to the reliability of the failure detection service and to the failure detection speed. We propose a variant of the binary round-robin protocol, which is more reliable than the application execution in any case. Experiments on applications of up to 256 processes, carried out on Grid’5000 show that the real detection times closely match the predictions. keywords: Grid computing, middleware, Parallelism, Fault-tolerance. 1

