Results 1 - 10
of
131
Skipnet: A scalable overlay network with practical locality properties
, 2003
"... Abstract: Scalable overlay networks such as Chord, Pastry, and Tapestry have recently emerged as a flexible infrastructure for building large peer-to-peer systems. In practice, two disadvantages of such systems are that it is difficult to control where data is stored and difficult to guarantee that ..."
Abstract
-
Cited by 253 (5 self)
- Add to MetaCart
Abstract: Scalable overlay networks such as Chord, Pastry, and Tapestry have recently emerged as a flexible infrastructure for building large peer-to-peer systems. In practice, two disadvantages of such systems are that it is difficult to control where data is stored and difficult to guarantee that routing paths remain within an administrative domain. SkipNet is a scalable overlay network that provides controlled data placement and routing locality guarantees by organizing data primarily by lexicographic key ordering. SkipNet also allows for both fine-grained and coarsegrained control over data placement, where content can be placed either on a pre-determined node or distributed uniformly across the nodes of a hierarchical naming subtree. An additional useful consequence of SkipNet’s locality properties is that partition failures, in which an entire organization disconnects from the rest of the system, result in two disjoint, but well-connected overlay networks. 1
Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?
, 2007
"... Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance ..."
Abstract
-
Cited by 108 (7 self)
- Add to MetaCart
Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data, some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%. We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4 % common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF. We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years. Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks. Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.
Path-Based Failure and Evolution Management
- IN PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION (NSDI’04
, 2004
"... We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as they move through the system as our core abstraction, and our "macro" approach focuses on component interactions rather than the details of ..."
Abstract
-
Cited by 91 (5 self)
- Add to MetaCart
We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as they move through the system as our core abstraction, and our "macro" approach focuses on component interactions rather than the details of the components themselves. Paths record component performance and interactions, are user- and request-centric, and occur in sufficient volume to enable statistical analysis, all in a way that is easily reusable across applications. Automated statistical analysis of multiple paths allows for the detection and diagnosis of complex failures and the assessment of evolution issues. In particular, our approach enables significantly stronger capabilities in failure detection, failure diagnosis, impact analysis, and understanding system evolution. We explore these capabilities with three real implementations, two of which service millions of requests per day. Our contributions include the approach; the maintainable, extensible, and reusable architecture; the various statistical analysis engines; and the discussion of our experience with a high-volume production service over several years.
Undo for Operators: Building an Undoable E-mail Store
- In Proceedings of the 2003 USENIX Annual Technical Conference
, 2003
"... System operators play a critical role in maintaining server dependability yet lack powerful tools to help them do so. To help address this unfulfilled need, we describe Operator Undo, a tool that provides a forgiving operations environment by allowing operators to recover from their own mistakes, ..."
Abstract
-
Cited by 65 (3 self)
- Add to MetaCart
System operators play a critical role in maintaining server dependability yet lack powerful tools to help them do so. To help address this unfulfilled need, we describe Operator Undo, a tool that provides a forgiving operations environment by allowing operators to recover from their own mistakes, from unanticipated software problems, and from intentional or accidental data corruption. Operator Undo starts by intercepting and logging user interactions with a network service before they enter the system, creating a record of user intent. During an undo cycle, all system hard state is physically rewound, allowing the operator to perform arbitrary repairs; after repairs are complete, lost user data is reintegrated into the repaired system by replaying the logged user interactions while tracking and compensating for any resulting externally-visible inconsistencies. We describe the design and implementation of an application-neutral framework for Operator Undo, and detail the process by which we instantiated the framework in the form of an undo-capable e-mail store supporting SMTP mail delivery and IMAP mail retrieval. Our proof-of-concept e-mail implementation imposes only a small performance overhead, and can store days or weeks of recovery log on a single disk.
A large-scale study of failures in high-performance computing systems
- In Proc. of the 2006 International Conference on Dependable Systems and Networks (DSN’06
, 2006
"... systems ..."
PeerReview: Practical accountability for distributed systems
"... We describe PeerReview, a system that provides accountability in distributed systems. PeerReview ensures that Byzantine faults whose effects are observed by a correct node are eventually detected and irrefutably linked to a faulty node. At the same time, PeerReview ensures that a correct node can al ..."
Abstract
-
Cited by 62 (8 self)
- Add to MetaCart
We describe PeerReview, a system that provides accountability in distributed systems. PeerReview ensures that Byzantine faults whose effects are observed by a correct node are eventually detected and irrefutably linked to a faulty node. At the same time, PeerReview ensures that a correct node can always defend itself against false accusations. These guarantees are particularly important for systems that span multiple administrative domains, which may not trust each other. PeerReview works by maintaining a secure record of the messages sent and received by each node. The record is used to automatically detect when a node’s behavior deviates from that of a given reference implementation, thus exposing faulty nodes. PeerReview is widely applicable: it only requires that a correct node’s actions are deterministic, that nodes can sign messages, and that each node is periodically checked by a correct node. We demonstrate that Peer-Review is practical by applying it to three different types of distributed systems: a network filesystem, a peer-to-peer system, and an overlay multicast system.
Practical Dynamic Software Updating
, 2008
"... This dissertation makes the case that programs can be updated while they run, with modest programmer effort, while providing certain update safety guarantees, and without imposing a significant performance overhead. Few systems are designed with on-the-fly updating in mind. Those systems that permit ..."
Abstract
-
Cited by 55 (20 self)
- Add to MetaCart
This dissertation makes the case that programs can be updated while they run, with modest programmer effort, while providing certain update safety guarantees, and without imposing a significant performance overhead. Few systems are designed with on-the-fly updating in mind. Those systems that permit it support only a very limited class of updates, and generally provide no guarantees that following the update, the system will behave as intended. We tackle the on-the-fly updating problem using a compiler-based approach called dynamic software updating (DSU), in which a program is patched with new code and data while it runs. The challenge is in making DSU practical: it should support changes to programs as they occur in practice, yet be safe, easy to use, and not impose a large overhead. This dissertation makes both theoretical contributions—formalisms for reasoning about, and ensuring update safety—and practical contributions—Ginseng, a DSU implementation for C. Ginseng supports a broad range of changes to C programs, and performs a suite of safety analyses to ensure certain update safety
Detecting Application-Level Failures in Component-based Internet Services
, 2004
"... Pinpoint is an application-generic framework for using statistical learning techniques to detect and localize likely application-level failures in component-based Internet services. Assuming that most of the system is working most of the time, Pinpoint looks for anomalies in low-level behaviors that ..."
Abstract
-
Cited by 53 (10 self)
- Add to MetaCart
Pinpoint is an application-generic framework for using statistical learning techniques to detect and localize likely application-level failures in component-based Internet services. Assuming that most of the system is working most of the time, Pinpoint looks for anomalies in low-level behaviors that are likely to reflect high-level application faults, and correlates these anomalies to their potential causes within the system. In our experiments, Pinpoint correctly detected and localized over 70-88% of the faults, depending on the type of fault, we injected into our testbed system, as compared to the 50-70% detected by current techniques. By demonstrating the applicability of statistical learning and providing an application-generic platform on which additional machine learning techniques can be applied to the problem of fast failure detection, we hope to hasten the adoption of statistical approaches to dependability for complex software systems.
Understanding and dealing with operator mistakes in internet services
- In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI ’04
, 2004
"... Operator mistakes are a significant source of unavailability in modern Internet services. In this paper, we first characterize these mistakes by performing an extensive set of experiments using human operators and a realistic three-tier auction service. The mistakes we observed range from software m ..."
Abstract
-
Cited by 42 (12 self)
- Add to MetaCart
Operator mistakes are a significant source of unavailability in modern Internet services. In this paper, we first characterize these mistakes by performing an extensive set of experiments using human operators and a realistic three-tier auction service. The mistakes we observed range from software misconfiguration, to fault misdiagnosis, to incorrect software restarts. We next propose to validate operator actions before they are made visible to the rest of the system. We demonstrate how to accomplish this task via the creation of a validation environment that is an extension of the online system, where components can be validated using real workloads before they are migrated into the running service. We show that our prototype validation system can detect 66 % of the operator mistakes that we have observed. 1
CONMan: A Step Towards Network Manageability
- In Proc. of ACM SIGCOMM
, 2007
"... Networks are hard to manage and in spite of all the so called holistic management packages, things are getting worse. We argue that the difficulty of network management can partly be attributed to a fundamental flaw in the existing architecture: protocols expose all their internal details and hence, ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
Networks are hard to manage and in spite of all the so called holistic management packages, things are getting worse. We argue that the difficulty of network management can partly be attributed to a fundamental flaw in the existing architecture: protocols expose all their internal details and hence, the complexity of the ever-evolving data plane encumbers the management plane. Guided by this observation, in this paper we explore an alternative approach and propose Complexity Oblivious Network Management (CONMan), a network architecture in which the management interface of data-plane protocols includes minimal protocol-specific information. This restricts the operational complexity of protocols to their implementation and allows the management plane to achieve high level policies in a structured fashion. We built the CON-Man interface of a few protocols and a management tool that can achieve high-level configuration goals based on this interface. Our preliminary experience with applying this tool to real world VPN configuration indicates the architecture’s potential to alleviate the difficulty of configuration management.

