• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

A Log(n) Multi-Mode Locking Protocol for Distributed Systems (0)

by Nirmit Desai, Frank Mueller
Add To MetaCart

Tools

Sorted by:
Results 1 - 5 of 5

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

by Jyothish Varma, Chao Wang, Frank Mueller, Christian Engelmann, Stephen L. Scott - ICS06 , 2006
"... Reliability is increasingly becoming a challenge for highperformance computing (HPC) systems with thousands of nodes, such as IBM’s Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can pro ..."
Abstract - Cited by 7 (6 self) - Add to MetaCart
Reliability is increasingly becoming a challenge for highperformance computing (HPC) systems with thousands of nodes, such as IBM’s Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This paper contributes a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and singledigit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.

High Performance Distributed Lock Management Services using Networkbased Remote Atomic Operations

by S. Narravula, A. Mamidala, A. Vishnu, K. Vaidyanathan, D. K. Panda - In Proceedings of Int’l Symposium on Cluster Computing and the Grid (CCGrid , 2007
"... Recently there has been a massive increase in computing requirements for parallel applications. These parallel applications and supporting cluster services often need to share system-wide resources. The coordination of these applications is typically managed by a distributed lock manager. The perfor ..."
Abstract - Cited by 7 (3 self) - Add to MetaCart
Recently there has been a massive increase in computing requirements for parallel applications. These parallel applications and supporting cluster services often need to share system-wide resources. The coordination of these applications is typically managed by a distributed lock manager. The performance of the lock manager is extremely critical for application performance. Researchers have shown that the use of two sided communication protocols, like TCP/IP, (used by current generation lock managers) can have significant impact on the scalability of distributed lock managers. In addition, existing onesided communication based locking designs support locking in exclusive access mode only and can pose significant scalability limitations on applications that need shared and exclusive access modes like cooperative/file-system caching. Hence the utility of these existing designs in high performance scenarios can be limited. In this paper, we present a novel protocol, for distributed locking services, utilizing the advanced network level one-sided atomic operations provided by InfiniBand. Our approach augments existing approaches by eliminating the need for two sided communication protocols in the critical locking path. Further, we also demonstrate that our approach provides significantly higher performance in scenarios needing both shared and exclusive mode access to resources. Our experimental results show 39 % improvement in basic locking latencies over traditional send/receive based implementations. Further, we also observe a significant (upto 317 % for 16 nodes) improvement over existing RDMA based distributed queuing schemes for shared mode locking scenarios.

Scalable Hierarchical Locking for Distributed Systems

by Nirmit Desai, Frank Mueller - Journal of Parallel Distributed Computing , 2003
"... Middleware components are becoming increasingly important as applications share computational resources in distributed environments, such as high-end clusters with ever larger number of processors, computational grids and increasingly large server farms. One of the main challenges in such environmen ..."
Abstract - Cited by 5 (2 self) - Add to MetaCart
Middleware components are becoming increasingly important as applications share computational resources in distributed environments, such as high-end clusters with ever larger number of processors, computational grids and increasingly large server farms. One of the main challenges in such environments is to achieve scalability of synchronization. In general, concurrency services arbitrate resource requests in distributed systems. But concurrency protocols currently lack scalability. Adding such guarantees enables resource sharing and computing with distributed objects in systems with a large number of nodes.

Distributed Queue-based Locking using Advanced Network Features

by Ananth Devulapalli, Pete Wyckoff , 2005
"... A Distributed Lock Manager (DLM) provides advisory locking services to applications such as databases and file systems that run on distributed systems. Lock management at the server is implemented using First-In-First-Out (FIFO) queues. In this paper, we demonstrate a novel way of delegating the loc ..."
Abstract - Add to MetaCart
A Distributed Lock Manager (DLM) provides advisory locking services to applications such as databases and file systems that run on distributed systems. Lock management at the server is implemented using First-In-First-Out (FIFO) queues. In this paper, we demonstrate a novel way of delegating the lock management to the participating lock-requesting nodes, using advanced network primitives such as Remote Direct Memory Access (RDMA) and Atomic operations. This nicely complements the original idea of DLM, where management of the lock space is distributed. Our implementation achieves better load balancing, reduction in server load and improved throughput over traditional designs.

Dissertation Committee: Approved by

by Matthew J. Koop, Prof D. K. P, Prof F. Qin, Matthew J. Koop
"... In the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (M ..."
Abstract - Add to MetaCart
In the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling higher and higher. The scalability and the performance of the MPI library if very important for the end application performance. InfiniBand is a cluster interconnect which is based on open-standards and is gaining rapid acceptance. This dissertation explores the different transports provided by Infini-Band to determine the scalabilty and performance aspects of each. Further, new MPI designs have been proposed and implemented for transports that have never been used for MPI in the past. These designs have significantly decreased the resource consumption, increased the performance and increased the reliability of ultra-scale InfiniBand clusters. A
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University