Results 1 - 10
of
14
CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems
"... We propose a new approach for developing and deploying distributed systems, in which nodes predict distributed consequences of their actions, and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighbor ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
We propose a new approach for developing and deploying distributed systems, in which nodes predict distributed consequences of their actions, and use this information to detect and avoid errors. Each node continuously runs a state exploration algorithm on a recent consistent snapshot of its neighborhood and predicts possible future violations of specified safety properties. We describe a new state exploration algorithm, consequence prediction, which explores causally related chains of events that lead to property violation. This paper describes the design and implementation of this approach, termed CrystalBall. We evaluate CrystalBall on RandTree, BulletPrime, Paxos, and Chord distributed system implementations. We identified new bugs in mature Mace implementations of three systems. Furthermore, we show that if the bug is not corrected during system development, CrystalBall is effective in steering the execution away from inconsistent states at runtime.
Cloud Resource Orchestration: A Data-Centric Approach
, 2011
"... Cloud computing provides users near instant access to seemingly unlimited resources, and provides service providers the opportunity to deploy complex information technology infrastructure, as a service, to their customers. Providers benefit from economies of scale and multiplexing gains afforded by ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Cloud computing provides users near instant access to seemingly unlimited resources, and provides service providers the opportunity to deploy complex information technology infrastructure, as a service, to their customers. Providers benefit from economies of scale and multiplexing gains afforded by sharing of resources through virtualization of the underlying physical infrastructure. However, the scale and highly dynamic nature of cloud platforms impose significant new challenges to cloud service providers. In particular, realizing sophisticated cloud services requires a cloud control framework that can orchestrate cloud resource provisioning, configuration, utilization and decommissioning across a distributed set of physical resources. In this paper we advocate a data-centric approach to cloud orchestration. Following this approach, cloud resources are modeled as structured data that can be queried by a declarative language, and updated with well-defined transactional semantics. We examine the feasibility, benefits and challenges of the approach, and present our design and prototype implementation of the Data-centric Management Framework (DMF) as a solution, with data models, query languages and semantics that are specifically designed for cloud resource orchestration.
Declarative Automated Cloud Resource Orchestration
"... As cloud computing becomes widely deployed, one of the challenges faced involves the ability to orchestrate a highly complex set of subsystems (compute, storage, network resources) that span large geographic areas serving diverse clients. To ease this process, we present COPE (Cloud Orchestration Po ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
As cloud computing becomes widely deployed, one of the challenges faced involves the ability to orchestrate a highly complex set of subsystems (compute, storage, network resources) that span large geographic areas serving diverse clients. To ease this process, we present COPE (Cloud Orchestration Policy Engine), a distributed platform that allows cloud providers to perform declarative automated cloud resource orchestration. In COPE, cloud providers specify system-wide constraints and goals using COPElog, a declarative policy language geared towards specifying distributed constraint optimizations. COPE takes policy specifications and cloud system states as input and then optimizes compute, storage and network resource allocations within the cloud such that provider operational objectives and customer SLAs can be better met. We describe our proposed integration with a cloud orchestration platform, and present initial evaluation results that demonstrate the viability of COPE using production traces from a large hosting company in the US. We further discuss an orchestration scenario that involves geographically distributed data centers, and conclude with an ongoing status of our work. Categories and Subject Descriptors
Understanding Transactional Memory Performance
"... Abstract—Transactional memory promises to generalize transactional programming to mainstream languages and data structures. The purported benefit of transactions is that they are easier to program correctly than fine-grained locking and perform just as well. This performance claim is not always born ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract—Transactional memory promises to generalize transactional programming to mainstream languages and data structures. The purported benefit of transactions is that they are easier to program correctly than fine-grained locking and perform just as well. This performance claim is not always borne out because an application may violate a common-case assumption of the TM designer or because of external system effects. This paper carefully studies a range of factors that can adversely influence transactional memory performance. In order to help programmers assess the suitability of their code for transactional memory, this paper introduces a formal model of transactional memory as well as a tool, called Syncchar. Syncchar can predict the speedup of a conversion from locks to transactions within 25 % for the STAMP benchmarks. We also use the Syncchar tool to diagnose and eliminate a starvation pathology in the TxLinux kernel, improving the performance of the Modified Andrew Benchmark by 55 % over Linux. The paper also presents the first detailed study of how the performance of user-level transactional programs (from the STAMP benchmarks) are influenced by factors outside of the transactional memory system. The study includes data about the interaction of transactional programs with the architecture, memory allocator, and compiler. Because many factors influence the performance of transactional programs, getting good performance from transactions is more difficult than commonly appreciated. I.
Tolerating Concurrency Bugs Using Transactions as Lifeguards
"... Abstract—Parallel programming is hard, because it is impractical to test all possible thread interleavings. One promising approach to improve a multi-threaded program’s reliability is to constrain a production run’s thread interleavings in such a way that untested interleavings are avoided as much a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Parallel programming is hard, because it is impractical to test all possible thread interleavings. One promising approach to improve a multi-threaded program’s reliability is to constrain a production run’s thread interleavings in such a way that untested interleavings are avoided as much as possible. Such an approach would avoid hard-to-test rare thread interleavings in production runs, and thereby improve correctness. However, a key challenge in realizing this goal is in determining thread interleaving constraints from the tested correct interleavings, and enforcing them efficiently in production runs. In this paper, we propose a new method to determine thread interleaving constraints from the tested interleavings in the form of lifeguard transactions (LifeTxes). An untested code region initially is contained in a single LifeTx. As the code region is tested over more thread interleavings, its original LifeTx is automatically split into multiple smaller LifeTxes so that the newly tested interleavings are permitted in production runs. To efficiently enforce LifeTx constraints in production runs, we propose a hardware design similar to the eager conflict detection capability that exist in a conventional hardware transactional memory (TM) systems, but without the need for versioning, rollback and unbounded TM support. We show that 11 out of 14 real concurrency bugs in programs like Apache, MySQL and Mozilla could be avoided using the proposed approach for a negligible performance overhead.
Atomic Boxes: Coordinated Exception Handling with Transactional Memory
"... Abstract. In concurrent programs raising an exception in one thread does not prevent others from operating on an inconsistent shared state. Instead, exceptions should ideally be handled in coordination by all the threads that are affected by their cause. In this paper, we propose a Java language ext ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. In concurrent programs raising an exception in one thread does not prevent others from operating on an inconsistent shared state. Instead, exceptions should ideally be handled in coordination by all the threads that are affected by their cause. In this paper, we propose a Java language extension for coordinated exception handling where a named abox (atomic box) is used to demarcate a region of code that must execute atomically and in isolation. Upon an exception raised inside an abox, threads executing in dependent aboxes, roll back their changes, and execute their recovery handler in coordination. We provide a dedicated compiler framework, CXH, to evaluate experimentally our atomic box construct. Our evaluation indicates that, in addition to enabling recovery, an atomic box executes a reasonably small region of code twice as fast as when using a failbox, the existing coordination alternative that has no recovery support.
Improving Server Applications with System Transactions
"... Server applications must process requests as quickly as possible. Because some requests depend on earlier requests, there is often a tension between increasing throughput and maintaining the proper semantics for dependent requests. Operating system transactions make it easier to write reliable, high ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Server applications must process requests as quickly as possible. Because some requests depend on earlier requests, there is often a tension between increasing throughput and maintaining the proper semantics for dependent requests. Operating system transactions make it easier to write reliable, high-throughput server applications because they allow the application to execute non-interfering requests in parallel, even if the requests operate on OS state, such as file data. By changing less than 200 lines of application code, we improve performance of a replicated Byzantine Fault Tolerant (BFT) system by up to 88 % using server-side speculation, and we improve concurrent performance up to 80 % for an IMAP email server by changing only 40 lines. Achieving these results requires substantial enhancements to system transactions, including the ability to pause and resume transactions, and an API to commit transactions in a pre-defined order.
MECHANISMS FOR UNBOUNDED, CONFLICT-ROBUST HARDWARE TRANSACTIONAL MEMORY
"... COPYRIGHT 2010 Colin BlundellThis dissertation is dedicated to my wife, Angelina. Without you, this would not have been possible. iii Acknowledgements This dissertation would not have been possible without the love and support of my family. My deepest thanks go to my wife, Angelina. The opportunity ..."
Abstract
- Add to MetaCart
COPYRIGHT 2010 Colin BlundellThis dissertation is dedicated to my wife, Angelina. Without you, this would not have been possible. iii Acknowledgements This dissertation would not have been possible without the love and support of my family. My deepest thanks go to my wife, Angelina. The opportunity to meet her has been the greatest reward of my decision to go to graduate school. She is both the source of my success and the reason that this success has meaning. I also thank Jacob for the joy that he has brought to my life and Merlin for his constant good humor, support, and loyalty. The support of my mother, father, and brother has been instrumental in me reaching this point. They have shared in the joy of my successes and have helped me weather the setbacks. The foundation of my later success was laid in my parents ’ teaching when I was young. My brother David has been there with me through the ups and downs of our entire lives; he is the best friend a brother could ever hope for.
c ○ 2010 by Andrew David LenharthAUTOMATIC RECOVERY FOR REQUEST ORIENTED SYSTEMS BY
"... Gracefully recovering from software and hardware faults is important to ensuring highly reliable and available systems. Operating systems have privileged access to all aspects of system operation, thus a fault related to them is able to affect the entire system. Existing approaches to operating syst ..."
Abstract
- Add to MetaCart
Gracefully recovering from software and hardware faults is important to ensuring highly reliable and available systems. Operating systems have privileged access to all aspects of system operation, thus a fault related to them is able to affect the entire system. Existing approaches to operating system recovery either do not protect the entire system or require a completely new operating system design. This dissertation presents a new approach to fault recovery in operating systems called Recovery Domains. This approach allows recovery from unanticipated faults in commodity operating systems. Recovery is organized around the concept of a dynamic request. Operating system entry points initiate requests to perform some action. System calls, for example, are a request by an application to the operating system. When a fault is detected, the recovery system rolls back the effects of the offending recovery domain while leaving the remainder of the system running. To ensure that the entire system (including the state of other concurrent kernel threads) remains consistent
Operating System Support for Application-Specific Speculation
"... Speculative execution is a technique that allows serial tasks to execute in parallel. An implementation of speculative execution can be divided into two parts: (1) a policy that specifies what operations and values to predict, what actions to allow during speculation, and how to compare results; and ..."
Abstract
- Add to MetaCart
Speculative execution is a technique that allows serial tasks to execute in parallel. An implementation of speculative execution can be divided into two parts: (1) a policy that specifies what operations and values to predict, what actions to allow during speculation, and how to compare results; and (2) the mechanisms that support speculative execution, such as checkpointing, rollback, causality tracking, and output buffering. In this paper, we show how to separate policy from mechanism. We implement a speculation mechanism in the operating system, where it can coordinate speculations across all applications and kernel state. Policy decisions are delegated to applications, which have the most semantic information available to direct speculation. We demonstrate how custom policies can be used in existing applications to add new features that would otherwise be difficult to implement. Using custom policies in our separated speculation system, we can hide 85 % of program load time by predicting the program’s launch, decrease SSL connection latency by 15 % in Firefox, and increase a BFT client’s request rate by 82%. Despite the complexity of the applications, small modifications can implement these features since they only specify policy choices and rely on the system to realize those policies. We provide this increased programmability with a modest performance trade-off, executing only 8 % slower than an optimized, applicationimplemented speculation system.

