Results 1 - 10
of
35
Microreboot - A Technique for Cheap Recovery
, 2004
"... A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we sep ..."
Abstract
-
Cited by 171 (2 self)
- Add to MetaCart
A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we separate process recovery from data recovery to enable microrebooting -- a fine-grain technique for surgically recovering faulty application components, without disturbing the rest of the application.
FAB: Building Distributed Enterprise Disk Arrays from Commodity Components
, 2004
"... This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a collection of bricks, small storage appliances con ..."
Abstract
-
Cited by 123 (7 self)
- Add to MetaCart
This paper describes the design, implementation, and evaluation of a Federated Array of Bricks (FAB), a distributed disk array that provides the reliability of traditional enterprise arrays with lower cost and better scalability. FAB is built from a collection of bricks, small storage appliances containing commodity disks, CPU, NVRAM, and network interface cards. FAB deploys a new majority-votingbased algorithm to replicate or erasure-code logical blocks across bricks and a reconfiguration algorithm to move data in the background when bricks are added or decommissioned. We argue that voting is practical and necessary for reliable, high-throughput storage systems such as FAB. We have implemented a FAB prototype on a 22-node Linux cluster. This prototype sustains 85MB/second of throughput for a database workload, and 270MB/second for a bulk-read workload. In addition, it can outperform traditional masterslave replication through performance decoupling and can handle brick failures and recoveries smoothly without disturbing client requests.
Detecting Application-Level Failures in Component-based Internet Services
, 2004
"... Pinpoint is an application-generic framework for using statistical learning techniques to detect and localize likely application-level failures in component-based Internet services. Assuming that most of the system is working most of the time, Pinpoint looks for anomalies in low-level behaviors that ..."
Abstract
-
Cited by 75 (7 self)
- Add to MetaCart
Pinpoint is an application-generic framework for using statistical learning techniques to detect and localize likely application-level failures in component-based Internet services. Assuming that most of the system is working most of the time, Pinpoint looks for anomalies in low-level behaviors that are likely to reflect high-level application faults, and correlates these anomalies to their potential causes within the system. In our experiments, Pinpoint correctly detected and localized over 70-88% of the faults, depending on the type of fault, we injected into our testbed system, as compared to the 50-70% detected by current techniques. By demonstrating the applicability of statistical learning and providing an application-generic platform on which additional machine learning techniques can be applied to the problem of fast failure detection, we hope to hasten the adoption of statistical approaches to dependability for complex software systems.
Dynamically Scaling Applications in the Cloud
"... Scalability is said to be one of the major advantages brought by the cloud paradigm and, more specifically, the one that makes it different to an “advanced outsourcing ” solution. However, there are some important pending issues before makingthedreamedautomatedscalingforapplicationscome true. In thi ..."
Abstract
-
Cited by 49 (1 self)
- Add to MetaCart
(Show Context)
Scalability is said to be one of the major advantages brought by the cloud paradigm and, more specifically, the one that makes it different to an “advanced outsourcing ” solution. However, there are some important pending issues before makingthedreamedautomatedscalingforapplicationscome true. In this paper, the most notable initiatives towards whole application scalability in cloud environments are presented. We present relevant efforts at the edge of state of the art technology, providing an encompassing overview of the trends they each follow. We also highlight pending challenges that will likely be addressed in new research efforts and present an ideal scalable cloud system. Categoriesand SubjectDescriptors C.4[Performance of Systems]: reliability availabilityand serviceability, design studies
Oncall: Defeating spikes with a freemarket server cluster
- In Proceedings of the 1st International Conference on Autonomic Computing (ICAC
, 2004
"... Even with reasonable overprovisioning, today’s Internet application clusters are unable to handle major traffic spikes and flash crowds. As an alternative to fixed-size, dedicated clusters, we propose a dynamically-shared application cluster model based on virtual machines. The system is dubbed “OnC ..."
Abstract
-
Cited by 35 (2 self)
- Add to MetaCart
(Show Context)
Even with reasonable overprovisioning, today’s Internet application clusters are unable to handle major traffic spikes and flash crowds. As an alternative to fixed-size, dedicated clusters, we propose a dynamically-shared application cluster model based on virtual machines. The system is dubbed “OnCall” for the extra computing capacity that is always on call in case of traffic spikes. OnCall’s approach to spike management relies on the use of an economically-efficient marketplace of cluster resources. OnCall works autonomically by allowing applications to trade computing capacity on a free market through the use of automated market policies; the appropriate applications are then automatically activated on the traded nodes. As demonstrated in our prototype implementation, OnCall allows applications to handle spikes while still maintaining inter-application performance isolation and providing useful resource guarantees to all applications on the cluster.
Combining Statistical Monitoring and Predictable Recovery for Self-Management
- IN PROC. WORKSHOP ON SELF-MANAGED SYSTEMS
, 2004
"... Complex distributed Internet services form the basis not only of e-commerce but increasingly of mission-critical network-based applications. What is new is that the workload and internal architecture of three-tier enterprise applications presents the opportunity for a new approach to keeping them ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Complex distributed Internet services form the basis not only of e-commerce but increasingly of mission-critical network-based applications. What is new is that the workload and internal architecture of three-tier enterprise applications presents the opportunity for a new approach to keeping them running in the face of many common recoverable failures. The core of the approach is anomaly detection and localization based on statistical machine learning techniques. Unlike previous approaches, we propose anomaly detection and pattern mining not only for operational statistics such as mean response time, but also for structural behaviors of the system---what parts of the system, in what combinations, are being exercised in response to di#erent kinds of external stimuli. In addition, rather than building baseline models a priori, we extract them by observing the behavior of the system over a short period of time during normal operation. We explain the necessary underlying assumptions and why they can be realized by systems research, report on some early successes using the approach, describe benefits of the approach that make it competitive as a path toward self-managing systems, and outline some research challenges. Our hope is that this approach will enable "new science" in the design of self-managing systems by allowing the rapid and widespread application of statistical learning theory techniques (SLT) to problems of system dependability.
Centrifuge: Integrated Lease Management and Partitioning for Cloud Services
- In Proceedings of USENIX NSDI
, 2010
"... Abstract: Making cloud services responsive is critical to providing a compelling user experience. Many largescale sites, including LinkedIn, Digg and Facebook, address this need by deploying pools of servers that operate purely on in-memory state. Unfortunately, current technologies for partitioning ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
(Show Context)
Abstract: Making cloud services responsive is critical to providing a compelling user experience. Many largescale sites, including LinkedIn, Digg and Facebook, address this need by deploying pools of servers that operate purely on in-memory state. Unfortunately, current technologies for partitioning requests across these inmemory server pools, such as network load balancers, lead to a frustrating programming model where requests for the same state may arrive at different servers. Leases are a well-known technique that can provide a better programming model by assigning each piece of state to a single server. However, in-memory server pools host an extremely large number of items, and granting a lease per item requires fine-grained leasing that is not supported in prior datacenter lease managers. This paper presents Centrifuge, a datacenter lease manager that solves this problem by integrating partitioning and lease management. Centrifuge consists of a set of libraries linked in by the in-memory servers and a replicated state machine that assigns responsibility for data items (including leases) to these servers. Centrifuge has been implemented and deployed in production as part of Microsoft’s Live Mesh, a large-scale commercial cloud service in continuous operation since April 2008. When cloud services within Mesh were built using Centrifuge, they required fewer lines of code and did not need to introduce their own subtle protocols for distributed consistency. As cloud services become ever more complicated, this kind of reduction in complexity is an increasingly urgent need. 1
Autonomous Recovery in Componentized Internet Applications
, 2004
"... In this paper we show how to reduce downtime of J2EE appli-cations by rapidly and automatically recovering from transient and intermittent software failures, without requiring applica-tion modifications. Our prototype combines three application-agnostic techniques: macroanalysis for fault detection ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
In this paper we show how to reduce downtime of J2EE appli-cations by rapidly and automatically recovering from transient and intermittent software failures, without requiring applica-tion modifications. Our prototype combines three application-agnostic techniques: macroanalysis for fault detection and localization, microrebooting for rapid recovery, and external management of recovery actions. The individual techniques are autonomous and work across a wide range of componentized Internet applications, making them well-suited to the rapidly changing software of Internet services. The proposed framework has been integrated with JBoss, an open-source J2EE application server. Our prototype provides an execution platform that can automatically recover J2EE applications within seconds of the manifestation of a fault. Our system can provide a subset of a system’s active end users with the illusion of continuous uptime, in spite of failures occurring behind the scenes, even when there is no functional redundancy in the system.
Tempest: Soft State Replication in the Service Tier ∗
"... Soft state in the middle tier is key to enabling scalable and responsive three tier service architectures. While softstate can be reconstructed upon failure, replicating it across multiple service instances is critical for rapid fail-over and high availability. Current techniques for storing and man ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
(Show Context)
Soft state in the middle tier is key to enabling scalable and responsive three tier service architectures. While softstate can be reconstructed upon failure, replicating it across multiple service instances is critical for rapid fail-over and high availability. Current techniques for storing and managing replicated soft state require mapping data structures to different abstractions such as database records, which can be difficult and introduce inefficiencies. Tempest is a system that provides programmers with data structures that look very similar to conventional Java Collections but are automatically replicated. We evaluate Tempest against alternatives such as in-memory databases and we show that Tempest does scale well in real world service architectures. 1
A Microrebootable System -- Design, Implementation, and Evaluation
, 2004
"... A significant fraction of software failures in large scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we use ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
A significant fraction of software failures in large scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we use separation of process recovery from data recovery to enable frequent use of the microreboot, a fine grain recovery mechanism that restarts only suspected-faulty application components without disturbing the rest. We evaluate this recovery approach on an eBay-like Internet auction application running on our microreboot-enabled application server. We find that microreboots recover from most of the same failures as full reboots, but do so an order of magnitude faster, resulting in an order of magnitude savings in lost work. Unlike full reboots, microreboot-based recovery is sufficiently inexpensive to be employed at the first sign of failure, even when mistakes in failure detection are likely. The cost of our microreboot-enabling modifications is a reduction of less than 1 % in failure-free steady-state throughput.