Results 1 - 10
of
13
CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems
"... We propose a new approach for developing and deploying distributed systems, in which nodes predict distributed consequences of their actions, and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighbor ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
We propose a new approach for developing and deploying distributed systems, in which nodes predict distributed consequences of their actions, and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighborhood and predicts possible future violations of specified safety properties. We describe a new state exploration algorithm, consequence prediction, which explores causally related chains of events that lead to property violation. This paper describes the design and implementation of this approach, termed CrystalBall. We evaluate CrystalBall on RandTree, BulletPrime, Paxos, and Chord distributed system implementations. We identified new bugs in mature Mace implementations of three systems. Furthermore, we show that if the bug is not corrected during system development, CrystalBall is effective in steering the execution away from inconsistent states at runtime.
Distributed Clustering for Robust Aggregation in Large Networks
"... We present a scalable protocol for robust data aggregation in a large, error-prone network. The protocol aggregates the multidimensional distribution of any number of data samples (sensor reads) and removes data errors using constant size synopses, by clustering samples and detecting outliers. Initi ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We present a scalable protocol for robust data aggregation in a large, error-prone network. The protocol aggregates the multidimensional distribution of any number of data samples (sensor reads) and removes data errors using constant size synopses, by clustering samples and detecting outliers. Initial simulations show that the protocol achieves robustness to both crashes and data errors. 1
State Monitoring in Cloud Datacenters
, 2011
"... Monitoring global states of a distributed cloud application is a critical functionality for cloud datacenter management. State monitoring requires meeting two demanding objectives: high level of correctness, which ensures zero or low error rate, and high communication efficiency, which demands minim ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Monitoring global states of a distributed cloud application is a critical functionality for cloud datacenter management. State monitoring requires meeting two demanding objectives: high level of correctness, which ensures zero or low error rate, and high communication efficiency, which demands minimal communication cost in detecting state updates. Most existing work follows an instantaneous model which triggers state alerts whenever a constraint is violated. This model may cause frequent and unnecessary alerts due to momentary value bursts and outliers. Countermeasures of such alerts may further cause problematic operations. In this paper, we present a WIndow-based StatE monitoring (WISE) framework for efficiently managing cloud applications. Window-based state monitoring reports alerts only when state violation is continuous within a time window. We show that it is not only more resilient to value bursts and outliers, but also able to save considerable communication when implemented in a distributed manner based on four technical contributions. First, we present the architectural design and deployment options for window-based state monitoring with centralized parameter tuning. Second, we develop a new distributed parameter tuning scheme enabling WISE to scale to much more monitoring nodes as each node tunes its monitoring parameters reactively without global information. Third, we introduce two optimization techniques, including their design rationale, correctness and usage model, to further reduce the communication cost. Finally, we provide an in-depth empirical study of the scalability of WISE, and evaluate the improvement brought by the distributed tuning scheme and the two performance optimizations. Our results show that WISE reduces communication by 50-90 percent compared with instantaneous monitoring approaches, and the improved WISE gains a clear scalability advantage over its centralized version.
Self-Tuning, Bandwidth-Aware Monitoring for Dynamic Data Streams
"... monitoring system that maximizes result precision of continuous aggregate queries over dynamic data streams. While prior approaches minimize bandwidth cost under fixed precision constraints, they may still overload a monitoring system during traffic bursts. To facilitate practical deployment of moni ..."
Abstract
- Add to MetaCart
monitoring system that maximizes result precision of continuous aggregate queries over dynamic data streams. While prior approaches minimize bandwidth cost under fixed precision constraints, they may still overload a monitoring system during traffic bursts. To facilitate practical deployment of monitoring systems, SMART therefore bounds the worst-case bandwidth cost for overload resilience. The primary challenge for SMART is how to dynamically select updates at each node to maximize query precision while keeping per-node monitoring bandwidth below a specified budget. To address this challenge, SMART’s hierarchical algorithm (1) allocates bandwidth budgets in a near-optimal manner to maximize global precision and (2) selftunes bandwidth settings to improve precision under dynamic workloads. Our prototype implementation of SMART provides key solutions to (a) prioritize pending updates for multi-attribute queries, (b) build bounded fan-in, load-aware aggregation trees to improve accuracy, and (c) combine temporal batching with arithmetic filtering to reduce load and to quantify result staleness. Our evaluation using simulations and a network monitoring application shows that SMART incurs low overheads, improves accuracy by up to an order of magnitude compared to uniform bandwidth allocation, and performs close to the optimal algorithm under modest bandwidth budgets. I.
Robust Large-Scale Distributed Systems
, 2008
"... My research has focused on constructing robust large-scale distributed systems. The bulk of this work can be understood in the context of two intertwined efforts: constructing cooperative and peer-to-peer services and understanding the fundamental principles of large-scale data replication. Cooperat ..."
Abstract
- Add to MetaCart
My research has focused on constructing robust large-scale distributed systems. The bulk of this work can be understood in the context of two intertwined efforts: constructing cooperative and peer-to-peer services and understanding the fundamental principles of large-scale data replication. Cooperative and peer-to-peer services Cooperative and peer-to-peer services both seek to provide a way to scale services beyond what can be provided by even a high-end server machine. Cooperative services do this by treating a service as a parallel program and then running the program across a cluster of machines. Key challenges include rearchitecting services not only to provide good performance by balancing parallelism and locality but also to ensure good reliability and simple management. Peer-to-peer services go further and enlist machines controlled by different users to collectively provide a service to each other. In addition to the problems of cooperative services, peer-to-peer services have to cope with new issues of trust that arise when a service runs across machines spanning multiple administrative domains with limited trust or competing interests. Some highlights of this stream of work include • Serverless file systems and cooperative caching. We constructed xFS [ADN + 96] to explore an extreme point in the design space of constructing file systems as parallel programs: xFS’s goals included
Almost-Invariants: FromBugs in Distributed SystemstoInvariants
"... It is notoriously hard to develop dependable distributed systems. This is partly due to the difficulties in foreseeing various corner cases and failure scenarios while implementing a system that will be deployed over an asynchronous network. In contrast, reasoning about the desired distributed syste ..."
Abstract
- Add to MetaCart
It is notoriously hard to develop dependable distributed systems. This is partly due to the difficulties in foreseeing various corner cases and failure scenarios while implementing a system that will be deployed over an asynchronous network. In contrast, reasoning about the desired distributed system behavior and the corresponding invariants is easier than reasoning about the codeitself. Further,theinvariantscanbeusedfortesting, theorem proving,andruntime enforcement. In this paper, we propose an approach to observe the systembehaviorandautomaticallyinferinvariantswhich reveal implementation bugs. Using our tool, Avenger, we automatically generate a large number of potentially relevant properties, check them within the time and spatialdomainsusingtracesofsystemexecutions,andfilter out all but a few properties before reporting them to the developer. Ourkeyinsightinfiltering isthatagoodcandidate for an invariant is the one that holds in all but a few cases, i.e., an “almost-invariant”. Our experimental results with the BGP, RandTree, and Chord implementations demonstrate Avenger’s ability to identify the almost-invariantsthatleadthedevelopertoprogramming errors. 1
Do you know your IQ? A research agenda for information quality in systems
"... Information quality (IQ) is a measure of how fit information is for a purpose. Sometimes called Quality of Information (QoI) by analogy with Quality of Service (QoS), it quantifies whether the correct information is being used to make a decision or take an action. Failure to understand whether infor ..."
Abstract
- Add to MetaCart
Information quality (IQ) is a measure of how fit information is for a purpose. Sometimes called Quality of Information (QoI) by analogy with Quality of Service (QoS), it quantifies whether the correct information is being used to make a decision or take an action. Failure to understand whether information is of adequate quality can lead to bad decisions and catastrophic effects. The results can include system outages, increased costs, lost revenue – and worse. Quantifying information quality can help improve decision making, but the ultimate goal should be to select or construct information sources that have the appropriate balance between information quality and the cost of providing it. In this paper, we provide a brief introduction to the field, argue the case for applying information quality metrics in the systems domain, and propose a research agenda to explore this space. Categories and Subject Descriptors
Abstract Keeping Track of 70,000+ Servers: The Akamai Query System
"... The Akamai platform is a network of over 73,000 servers supporting numerous web infrastructure services including the distribution of static and dynamic HTTP content, delivery of live and on-demand streaming media, high-availability storage, accelerated web applications, and intelligent routing. The ..."
Abstract
- Add to MetaCart
The Akamai platform is a network of over 73,000 servers supporting numerous web infrastructure services including the distribution of static and dynamic HTTP content, delivery of live and on-demand streaming media, high-availability storage, accelerated web applications, and intelligent routing. The maintenance of such a network requires significant monitoring infrastructure to enable detailed understanding of its state at all times. For that purpose, Akamai has developed and uses Query, a distributed monitoring system in which all Akamai machines participate. Query collects data at the edges of the Internet and aggregates it at several hundred places to be used to answer SQL queries about the state of the Akamai network. We explain the design of Query, outline some of its critical features, discuss who some of its users are and what Query allows them to do, and explain how Query scales to meet demand as the Akamai network grows. 1
Scaling a Monitoring Infrastructure for the Akamai Network
"... We describe the design of, and experience with, Query, a monitoring system that supports the Akamai EdgePlatform. Query is a foundation of Akamai’s approach to administering its distributed computing platform, allowing administrators, operations staff, developers, customers, and automated systems ne ..."
Abstract
- Add to MetaCart
We describe the design of, and experience with, Query, a monitoring system that supports the Akamai EdgePlatform. Query is a foundation of Akamai’s approach to administering its distributed computing platform, allowing administrators, operations staff, developers, customers, and automated systems near real-time access to data about activity in Akamai’s network. Users extract information regarding the current state of the network via a SQL-like interface. Versions of Query have been deployed since the inception of Akamai’s platform, and it has scaled to support a distributed platform of 60,000+ servers, collecting over 200 gigabytes of data and answering over 30,000 queries approximately every 2 minutes. Categories and Subject Descriptors C.2.4 [Distributed Systems]: Distributed applications, Distributed databases
A Quality-Centric Data Model for Distributed Stream Management Systems
"... It is challenging for large-scale stream management systems to return always perfect results when processing data streams originating from distributed sources. Data sources and intermediate processing nodes may fail during the lifetime of a stream query. In addition, individual nodes may become over ..."
Abstract
- Add to MetaCart
It is challenging for large-scale stream management systems to return always perfect results when processing data streams originating from distributed sources. Data sources and intermediate processing nodes may fail during the lifetime of a stream query. In addition, individual nodes may become overloaded due to processing demands. In practice, users have to accept incomplete or inaccurate query results because of failure or overload. In this case, stream processing systems would benefit from knowing the impact of imperfect processing on data quality when making decisions about query optimisation and fault recovery. In addition, users would want to know how much the result quality was degraded. In this paper, we propose a quality-centric relational stream data model that can be used together with existing query processing methods over distributed data streams. Besides giving useful feedback about the quality of tuples to users, the model provides the distributed stream management system with information on how to optimise query processing and enhance fault tolerance. We demonstrate how our data model can be applied to an existing distributed stream management system. Our evaluation shows that it enables quality-aware load-shedding, while introducing only a small pertuple overhead. 1.

