Results 1 - 10
of
13
Subtleties in tolerating correlated failures in wide-area storage systems
- In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI
, 2006
"... High availability is widely accepted as an explicit requirement for distributed storage systems. Tolerating correlated failures is a key issue in achieving high availability in today’s wide-area environments. This paper systematically revisits previously proposed techniques for addressing correlated ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
High availability is widely accepted as an explicit requirement for distributed storage systems. Tolerating correlated failures is a key issue in achieving high availability in today’s wide-area environments. This paper systematically revisits previously proposed techniques for addressing correlated failures. Using several real-world failure traces, we qualitatively answer four important questions regarding how to design systems to tolerate such failures. Based on our results, we identify a set of design principles that system builders can use to tolerate correlated failures. We show how these lessons can be effectively used by incorporating them into IRISSTORE, a distributed read-write storage layer that provides high availability. Our results using IRISSTORE on the PlanetLab over an 8-month period demonstrate its ability to withstand large correlated failures and meet preconfigured availability targets. 1
Exploring event correlation for failure prediction in coalitions of clusters
- in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’07
, 2007
"... In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and space domain. In this paper, we develop a spher ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to describe spatial correlation. We further utilize the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. We implemented a failure prediction framework, called PREdictor of Failure Events Correlated Temporal-Spatially (hPrefects), which explores correlations among failures and forecasts the time-between-failure of future instances. We evaluate the performance of hPrefects in both offline prediction of failure by using the Los Alamos HPC traces and online prediction in an institute-wide clusters coalition environment. Experimental results show the system achieves more than 76 % accuracy in offline prediction and more than 70 % accuracy in online prediction during the time from May 2006 to April 2007.
Availability in bittorrent systems
- in Proc. of INFOCOM
, 2007
"... Abstract — In this paper, we investigate the problem of highly available, massive-scale file distribution in the Internet. To this end, we conduct a large-scale measurement study of BitTorrent, a popular class of systems that use swarms of actively downloading peers to assist each other in file dist ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Abstract — In this paper, we investigate the problem of highly available, massive-scale file distribution in the Internet. To this end, we conduct a large-scale measurement study of BitTorrent, a popular class of systems that use swarms of actively downloading peers to assist each other in file distribution. The first generation of BitTorrent systems used a central tracker to enable coordination among peers, resulting in low availability due to the tracker’s single point of failure. Our study analyzes the prevalence and impact of two recent trends to improve BitTorrent availability: (i) use of multiple trackers, and (ii) use of Distributed Hash Tables (DHTs), both of which also help to balance load better. The study considered more than 1,400 trackers and 24,000 DHT nodes (extracted from about 20,000 torrents) over a period of two months. We find that both trends improve availability, but for different and somewhat unexpected reasons. Our findings include: (i) multiple trackers improve availability, but the improvement largely comes from the choice of a single highly available tracker, (ii) such improvement is reduced by the presence of correlated failures, (iii) multiple trackers can significantly reduce the connectivity of the overlay formed by peers, (iv) the DHT improves information availability, but induces a higher response latency to peer queries. I.
Group Therapy for Systems: Using link attestations to manage failures
- In IPTPS
, 2006
"... Managing failures and configuring systems properly are of critical importance for robust distributed services. Unfortunately, protocols offering strong fault-tolerance guarantees are generally too costly and insensitive to performance criteria. Yet, system management in practice is often ad-hoc and ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Managing failures and configuring systems properly are of critical importance for robust distributed services. Unfortunately, protocols offering strong fault-tolerance guarantees are generally too costly and insensitive to performance criteria. Yet, system management in practice is often ad-hoc and ill-defined, leading to under-utilized capacity or adverse effects from poorly-behaving machines. This paper proposes a new abstraction called linkattestation groups (LA-Groups) for building robust distributed systems. Developers specify application-level correctness conditions or performance requirements for nodes. Nodes vouch for each other's acceptability within small groups of nodes through digitally-signed link attestations, and then apply a link-state protocol to determine these group relationships.
Computational Risk Management for Building Highly Reliable Network Services
- In Proceedings of the 1st Workshop on Hot Topics in System Dependability
, 2005
"... consistent high performance to clients in the presence of failures and bursty demand is expensive and inefficient. Resources often need to be heavily overprovisioned to accommodate peak demand and the cost of such overprovisioning "prices out" many applications that could stand to benefit from a per ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
consistent high performance to clients in the presence of failures and bursty demand is expensive and inefficient. Resources often need to be heavily overprovisioned to accommodate peak demand and the cost of such overprovisioning "prices out" many applications that could stand to benefit from a performance safety-net and ultimately provide more reliable service to end users. To address these problems, we propose an approach based on a shared Computational Service Provider (CSP). A CSP is an entity which provides massive amounts of widely distributed computation and storage and makes resources available through a mix of spot and derivative markets. Services obtain resources through the CSP and, drawing inspiration from finance, employ quantitative risk management techniques for trading off cost, performance, and risk to probabilistically achieve target levels of delivered client performance.
Exploiting Redundancy for Robust Sensing
, 2005
"... views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity.
Research Interests Sensor Networks, Distributed Systems, and Databases Education
, 2003
"... I am writing to apply for a tenure-track position as an assistant professor in your department. Currently, I am a Ph.D. candidate in the Computer Science Department at Carnegie Mellon University. I expect to finish my thesis dissertation by August 2005. My research interests cover a wide variety of ..."
Abstract
- Add to MetaCart
I am writing to apply for a tenure-track position as an assistant professor in your department. Currently, I am a Ph.D. candidate in the Computer Science Department at Carnegie Mellon University. I expect to finish my thesis dissertation by August 2005. My research interests cover a wide variety of topics in the intersection of sensor networks, databases, and distributed systems. The distinguishing aspect of my research is that I seek efficient and theoretically sound techniques to qualitatively enhance the robustness of large-scale distributed systems and I validate the techniques by implementing them in real systems. In my thesis work, I have developed new algorithms and techniques for distributed sensing services, which exploit the unique properties and requirements of target systems to provide substantially higher availability than possible with state-of-the-art techniques. I have enclosed my Curriculum Vitae along with a list of references, my research and teaching statements, and selected publications. For your convenience, my application materials are also available online at
Research Statement
"... My goal is to design and develop deployable large-scale systems. My work has a strong focus on building systems as I firmly believe that this exercise gives a deeper understanding of a problem and exposes issues that are otherwise hard to grasp while working at an abstract level. My current research ..."
Abstract
- Add to MetaCart
My goal is to design and develop deployable large-scale systems. My work has a strong focus on building systems as I firmly believe that this exercise gives a deeper understanding of a problem and exposes issues that are otherwise hard to grasp while working at an abstract level. My current research goal is to design a Scalable Distributed Information Management System (SDIMS) that aggregates and manages information in a large-scale distributed system comprising tens of thousands of machines distributed across the Internet. Previously, I have worked on prefetching for improving the World Wide Web performance and on transparent mobility for uninterrupted connectivity between devices in personal networks. In the SDIMS research, I seek to construct a distributed operating systems backbone that simplifies design, development, and deployment of distributed applications in large-scale networked systems. Recently, there has been an emergence of large-scale wide-area networked systems such as enterprise networks comprising tens of thousands of machines spread across multiple sites, traffic engineering systems with large number of sensors and cameras installed on highways to monitor the flow of traffic, etc. Distributed applications on such networked systems monitor and react to changes in the information at individual nodes and to reconfigurations in the system. Few examples of such applications are multicast, file location, system monitoring and management, publish-subscribe, domain name service, and resource discovery. These applications require a key component that gathers and manages information at individual
A Quantitative Approach
- Computer
, 2005
"... wide-area storage systems; replicas must be created as storage nodes permanently fail to avoid data loss. Many failures in the wide-area are transient, however, where the node returns with data intact. Given a goal of minimizing replicas created to maintain a desired replication level, creating repl ..."
Abstract
- Add to MetaCart
wide-area storage systems; replicas must be created as storage nodes permanently fail to avoid data loss. Many failures in the wide-area are transient, however, where the node returns with data intact. Given a goal of minimizing replicas created to maintain a desired replication level, creating replicas in response to transient failures is wasted effort. In this paper, we present a principled way of minimizing costs while maintaining a desired data availability. Design choices include choosing data redundancy type, number of replicas, extra redundancy, and data placement. We demonstrate via trace-driven simulation that significant maintenance efficiency gains can be realized in existing storage systems with the correct choice of strategies and parameters. For example, we show that DHash can reduce its costs by a factor of 31 while maintaining the same desired data availability.

