Results 1 -
4 of
4
Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems
- ICS06
, 2006
"... Reliability is increasingly becoming a challenge for highperformance computing (HPC) systems with thousands of nodes, such as IBM’s Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can pro ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
Reliability is increasingly becoming a challenge for highperformance computing (HPC) systems with thousands of nodes, such as IBM’s Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This paper contributes a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and singledigit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.
Scalable Hierarchical Locking for Distributed Systems
- Journal of Parallel Distributed Computing
, 2003
"... Middleware components are becoming increasingly important as applications share computational resources in distributed environments, such as high-end clusters with ever larger number of processors, computational grids and increasingly large server farms. One of the main challenges in such environmen ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Middleware components are becoming increasingly important as applications share computational resources in distributed environments, such as high-end clusters with ever larger number of processors, computational grids and increasingly large server farms. One of the main challenges in such environments is to achieve scalability of synchronization. In general, concurrency services arbitrate resource requests in distributed systems. But concurrency protocols currently lack scalability. Adding such guarantees enables resource sharing and computing with distributed objects in systems with a large number of nodes.
unknown title
, 2007
"... www.elsevier.com/locate/jpdc A priority-based distributed group mutual exclusion algorithm when group access is non-uniform � ..."
Abstract
- Add to MetaCart
www.elsevier.com/locate/jpdc A priority-based distributed group mutual exclusion algorithm when group access is non-uniform �
Author manuscript, published in "17th Euromicro International Conference on Parallel, Distributed and network-based Processing- PDP 2009 (2009)" Byte-Range Asynchronous Locking in Distributed Settings
, 2009
"... This paper investigate a mutual exclusion algorithm on distributed systems. We introduce a new algorithm based on the Naimi-Trehel algorithm, taking advantage of the distributed approach of Naimi-Trehel while allowing to request partial locks. Such ranged locks offer a semantic close to POSIX file l ..."
Abstract
- Add to MetaCart
This paper investigate a mutual exclusion algorithm on distributed systems. We introduce a new algorithm based on the Naimi-Trehel algorithm, taking advantage of the distributed approach of Naimi-Trehel while allowing to request partial locks. Such ranged locks offer a semantic close to POSIX file locking, where threads lock some parts of the shared file. We evaluate our algorithm by comparing its performance with to the original Naimi-Trehel algorithm and to a centralized mutual exclusion algorithm. The considered performance metric is the average time to obtain a lock. 1.

