BibTeX
@MISC{Burckhardt_consistencyin,
author = {Sebastian Burckhardt},
title = {Consistency in Distributed Systems},
year = {}
}
OpenURL
Abstract
Abstract. Data replication is a common technique for programming distributed systems, and is often important to achieve performance or reliability goals. Unfortunately, the replication of data can compromise its consistency, and thereby break programs that are unaware. In particular, in weakly consistent systems, programmers must assume some responsibility to properly deal with queries that return stale data, and to avoid state corruption under conflicting updates. The fundamental tension between performance (favoring weak consistency) and correctness (favoring strong consistency) is a recurring theme when designing concurrent and distributed systems, and is both practically relevant and of theoretical interest. In this course, we investigate how to understand and formalize consistency guarantees, and how we can determine if a system implementation is correct with respect to such specifications. We start by examining consensus, a classic problem in distributed systems, and then proceed to study various specifications and implementations of eventually consistent systems. As more and more developers write programs that execute on a virtualized cloud infrastructure, they find themselves confronted with the subtleties that have long been the hallmark of distributed systems research. Devising message protocols, reading and writing weakly consistent shared data, and handling failures are notoriously challenging, and are gaining relevance for a new generation of developers. With this in mind, I devised this course to provide a mix of techniques and results that may prove either interesting, or useful, or both. In the first half, I am presenting well-known results and techniques from the area of distributed systems research, including: -A beautiful, classic result: the impossibility of implementing consensus in the presence of silent crashes on an asynchronous system In the second half, I focus on the main topic, which are consistency models for shared data. This part includes: Consistency in Distributed Systems 85 -A formalization of strong consistency (sequential consistency, linearizability) and a proof of the CAP theorem These lecture notes are not meant to serve as a transcript. Rather, their purpose is to complement the slides Update: Since giving the original lectures at the LASER summer school, I have expanded and revised much of the material presented in Sects. 3 and 4. The result is now available as a short textbook Preliminaries We introduce some basic mathematical notations for sets, sequences, and relations. We assume standard set notations for set. Note that we write A ⊆ B to denote ∀a ∈ A : a ∈ B. In particular, the notation A ⊆ B does neither imply nor rule out either A = B or A = B. We let N be the set of all natural numbers (starting with number 1), and N 0 = N ∪ {0}. The power set P(A) is the set of all subsets of A. Sequences. Given a set A, we let A * be the set of finite sequences (or "words") of elements of A, including the empty sequence which is denoted . We let A + ⊆ A * be the set of nonempty sequences of elements of A. Thus, A * = A + ∪ { }. For two sequences u, v ∈ A * , we write u · v to denote the concatenation (which is also in A * ). If f : A → B is a function, and w ∈ A * is a sequence, then we let f (w) ∈ B * be the sequence obtained by applying f to each element of w. Sometimes we write A ω for the set of ω-infinite sequences of elements of A. Multisets. A finite multiset m over some base set A is defined to be a function m : A → N 0 such that m(a) = 0 for almost all a (= all but finitely many). The idea is that we represent the multiset as the function that defines how many times each element of A is in the set. We let M(A) denote the set of all finite multisets over A. When convenient, we interpret an element a as the singleton multiset containing a. We use the following notations for typical operations on multisets (using a mix of symbols taken from set notations and vector notations), Note that partial orders are acyclic (if there were a cycle, transitivity would imply a → a for some a, contradicting irreflexivity). We often visualize partial orders as directed acyclic graphs. Moreover, in such drawings, we usually omit transitively implied edges, to avoid overloading the picture. A partial order does not necessarily order all elements. In fact, that is precisely what distinguishes it from a total order: a partial order r over A is a total order if for all a, b ∈ A such that a = b, either a r − → b or b r − → a. All total orders are also partial orders. Many authors define partial orders to be reflexive rather than irreflexive. We chose to define them as irreflexive, to keep them more similar to total orders, and to keep the definition more consistent with our favorite visualization, directed acyclic graphs, whose vertices never have self-loops. This choice is only superficial and not a deep distinction: consider the familiar notations < and ≤. Conceptually, they represent the same ordering relation, but one of them is reflexive, the other one is irreflexive. In fact, if r is a total or partial order, we sometimes write a < r b to represent a A total order can be used to sort a set. For some finite set A ⊆ A and a total order r over A, we let A .sort(r) ∈ A * be the sequence obtained by sorting the elements of A in ascending < r -order. Models and Machines To reason about protocols and consistency, we need terminology and notation that helps us to abstract from details. In particular, we need models for machines, and ways to characterize their behavior by stating and then proving or refuting their properties. Consistency in Distributed Systems 87 Labeled Transition Systems Labeled transitions systems provide a useful formalization and terminology that applies to a wide range of machines. When using an LTS to model a system, a configuration represents a global snapshot of the state of every component of the system. Actions are abstractions that can model a number of activities, such as sending or receiving of messages, interacting with a user, doing some internal processing, or combinations thereof. Labeled transition systems are often visualized using labeled graphs, with vertices representing the states and labeled edges representing the actions. We say an action a ∈ Act is enabled in state s ∈ Cnf if there exists a s ∈ Cnf such that s a − → s . More than one action can be enabled in a state, and in general, an action can lead to more than one successor state. We say an action a is deterministic if that is never the case, that is, if for all s ∈ Cnf, there is at most one s ∈ S such that s a − → s . Defining an LTS to represent a concurrent system helps us to reason precisely about its executions and their correctness. An execution fragment E is a (finite or infinite) alternating sequence of states and actions: and an execution is an execution fragment that starts in an initial state. We formalize these definitions as follows. Definition 2. Given some LTS We define pre(E) = E.cnf(0) and post(E) = E.cnf(E.len) (we write post(E) = ⊥ if E.len = ∞). Two execution fragments E 1 , E 2 can be concatenated to form another execution fragment E 1 · E 2 if E 1 .len = ∞ and post(E 1 ) = pre(E 2 ). S. Burckhardt We say a configuration c ∈ Cnf is reachable from a configuration c ∈ Cnf if there exists an execution fragment E such that c = pre(E) and c = post(E). We say a configuration c ∈ Cnf is reachable if it is reachable from an initial configuration. Reasoning about executions usually involves reasoning about events. An event is an occurrence of an action (the same action can occur several times in an execution, each being a separate event). Technically, we define the events of an execution fragment E to be the set of numbers Evt(E) = {1, 2, . . . , E.len}. Then, for events e, e ∈ Evt(E), e < e means e occurs before e in the execution, and E.act(e) is the action of event e. Given an execution fragment E of an LTS L, we let trc(E) ∈ (L.Act * ∪L.Act ω ) be the (finite of infinite) sequence of actions in E, called the trace of E. If all actions of L are deterministic, then E is completely determined by E.pre and E.trc. For that reason, traces are sometimes called schedules. In our proofs, we often need to take an existing execution, and modify it slightly by reordering certain actions. Given a configuration c and a deterministic action a, we write post(c, a) to be the uniquely determined c satisfying c a − → c , or ⊥ if it is not possible (because a is not enabled in c). Similarly, we write post(c, w), for an action sequence w ∈ A * , to denote the state reached from c by performing the actions in w, or ⊥ if not possible. In the remainder of this text, all of our LTS are constructed in such a way that all actions are deterministic. Working with deterministic actions can have practical advantages. For testing and debugging protocols, we often need to analyze or reproduce failures based on partial information about the execution, such as a trace log. If the log contains the sequence of actions in the order they happened, and if the actions are deterministic, it means that the log contains sufficient information to fully reproduce the execution. Asynchronous Message Protocols An LTS can express many different kinds of concurrent systems, but we care mostly about message passing protocols in this context. Therefore, we specialize the general LTS definition above to define such systems. Throughout this text, we assume that Pid is a set of process identifiers (possibly infinite, to model dynamic creation). Furthermore, we assume that there is a total order defined on the process identifiers Pid. For example, Pid = N. Msg, Act, ini, ori, dst, pid, cnd, rcv, snd, upd) where -Pst is a set of process states, with a function Definition 3. A protocol definition is a tuple -only finitely many actions apply at a time: We call actions a that receive no message (i.e. rcv(a) = ⊥) spontaneous. The meaning is that each configuration is a pair (P, M ) with P being a function that maps each process identifier to the current state of that process, and M being a multiset that represents messages that are currently "in flight". For a configuration c, we write c.P and c.M to denote its components. When reasoning about an execution E of L Φ , we define the following notational shortcut: Example. Consider a simple protocol where the processes try to reach consensus on a single bit. We assume that the initial state of each process contains the bit value it is going to propose. We can implement a simple leader-based protocol to reach consensus by fixing some leader process l ∈ Pid. The idea is based on a "race to the leader", which works in three stages: (1) each process sends a message containing the bit value it is proposing to the leader, (2) the leader, upon receiving any message, announces this value to all other processes, and (3) upon receiving the announced message, each recipient decides on that value. We show how to write pseudocode for this protocol in Each message has a name and several named typed parameters. We show how the functions ori and dst (which determine the origin and destination of each message) are defined in the comment at the end of each line. -The remaining sections define the actions, with one section per action. The entries have the following meaning: • The first line of each action section defines the action label, which is a name together with named typed parameters. All action labels together constitute the set Act. The comment at the end of the line defines the pid function, which determines the process to which this action belongs. Consistency in Distributed Systems 91 • The receives section defines the rcv function. If there is a receives line present, it defines the message that is received by this action, and if there is no receives line, it specifies that this action is spontaneous. • The sends section defines the snd function. It specifies the message, or the multiset of messages, to be sent by this action. We use the multiset notations as described in Sect. 1, in particular, the sum symbol is used to describe a collection of messages. We omit this section if no messages are sent. • The condition section defines the cnd function, representing a condition that is necessary for this action to be performed. It describes a predicate over the local process state (i.e. over the variables defined in the process state section). We omit this section if the action is unconditional. • The updates section defines the upd function, by specifying how to update the local process state. We omit this section if the process state is not changed. One could conceivably formalize these definitions and produce a practically usable programming language for protocols; in fact, this has already been done for the programming language used by the Murφ tool Consider the consensus protocol shown in Consensus Protocols What makes a protocol a consensus protocol? Somehow, we start out with a bit on each participant describing its preference. When the protocol is done, everyone should agree on some bit value that was one of the proposed values. And, there should be progress eventually, i.e. the protocol should terminate with a decision. We now formalize what we mean by a consensus protocol, by adding functions to formalize the notions of initial preference and of decisions. (Pst, Msg, Act, ini, ori, dst, pid, cnd, rcv, snd, upd, pref, dec) such that -(Pst, . . . , upd) is a protocol. For example, for the strawman protocol, we define pref(p, b).preference = b and pref (p, b).decision = ⊥, and we define dec(s) = s.decision. Definition 5. A consensus protocol is a tuple Next, we formalize the correctness conditions we briefly outlined at the beginning of this section, and then examine if they hold for our strawman. For an execution E, we define the following properties: 1. Stability. If a value is decided at a process p, it remains decided forever: 2. Agreement. No two processes should decide differently: If a value is decided, this value must match the preference of at least one of the processes: Eventually, a decision is reached on all correct 1 processes: Does our strawman protocol satisfy all of these properties, for all of its executions? Certainly, this is true for the first three. 1. Strawman satisfies agreement and stability. There can be at most one announce event, because only the leader can perform the announce action, and the leader sets the decided variable to true after doing the announce, which prevents further announce actions. Therefore, all decide actions must receive a Announcement message sent by the same announce event, thus all the actions that write a decision value write the same value. Decision values are stable: there is no action that writes ⊥ to the decision variable. 2. Strawman satisfies validity. Any announce event (for some bit b) receives a Proposal message that must have originated in some propose event (with the same bit b), which has as a precondition that the variable proposal = b. Thus, b matches the preference of that process. Termination is however not satisfied for all executions. For example, in an execution of length 0, no decision is reached. Perhaps it would be more reasonable to restrict our attention to complete executions: 1 We talk more about failures later. For now, just assume that the set F of faulty processes is empty. Clearly, no progress is made and an unbounded number of messages is sent. No decision is reached. Still, it appears that this criticism is not fair! It is hard to imagine how any protocol can achieve termination unless the transport layer and the process scheduler cooperate. Clearly, if the system simply does not deliver messages, or never executes actions even though they are enabled, nothing good can happen. We need fairness: some assumptions about the "minimal level of service" we may expect. Informally, what we want to require is that messages are eventually delivered unless they become undeliverable, and that spontaneous actions are eventually performed unless they become disabled. We say an action a ∈ Act receives message m ∈ Msg if rcv(a) = m. We say m ∈ Msg is receivable in a configuration s if there exists an action a that is enabled and that receives m. Definition 7. A message m is neglected by an execution E if it is receivable in infinitely many configurations, but received by only finitely many actions. A spontaneous action a is neglected by an execution E, if it is enabled in infinitely many configurations, but performed only finitely many times. Definition 8. An execution E of some protocol Φ is fair if it does not neglect any messages or spontaneous actions. Definition 9. A consensus protocol is a correct consensus protocol if all fair complete executions satisfy stability, agreement, validity, and termination. Strawman is Correct. We already discussed agreement and validity. Termination is also satisfied for fair executions, for the following reasons. Because the propose action is always enabled for all p, it must happen at least once (in fact, it will happen infinitely many times for all p). After it happens just once, announce is now enabled, and remains enabled forever if announce does not happen. Thus announce must happen (otherwise fairness is violated). But now, for each q, decide is enabled, and thus must happen eventually. Fair Schedulers. The definition of fairness is purposefully quite general; it does not describe how exactly a scheduler is guaranteeing fairness. However, it is useful to consider how to construct a scheduler that guarantees fairness. One way to do so is to schedule an action that has maximal seniority, in the sense that it is executing a spontaneous action or receiving a message that has been waiting (i.e. been enabled/receivable but not executed/received) the longest: Proof. Assume to the contrary that there exists an execution that is not fair, that is, neglects a message or spontaneous action. First, consider that a message m is neglected. This means that the message is receivable infinitely often, but received only finitely many times. Consider the first configuration where it is receivable after the last time it is received, say E.cnf(k). Since m is receivable in infinitely many configurations {E.cnf(k ) | k > k} but never received, there must be infinitely many configurations {E.cnf(k ) | k > k} where some enabled action is more senior than the one that receives m (otherwise the scheduler would pick that one). However, an action can only be more senior than the one that receives m if it is either receiving some message that has been waiting (i.e. has been receivable without being received) at least as long as m, or a spontaneous action that has been waiting (i.e. has been enabled without being performed) at least as long as m. But there can only be finitely many such messages or spontaneous actions, since there are only finitely many configurations {E.cnf(j) | j ≤ k}, and each such configuration has only finitely many receivable messages and enabled spontaneous actions, by the last condition in Definition 3; thus we have a contradiction. Now, consider that a spontaneous action is neglected. We get a contradiction by the same reasoning. Independence. The notion of independence of actions and schedules is also often useful. We can define independence for general labeled transition systems as follows: For protocols, actions performed by different nodes are independent. This is because executing an action for process p can only remove messages destined for p from the message pool, it can thus not disable any actions on any other process. Actions by different processes always commute, because their effect on the local state targets local states by different processes, and their effects on the message pool commute. Consistency in Distributed Systems 95 We call two schedules s, s ∈ Act * independent if for all a ∈ s and a ∈ s , a and a are independent. Note that if two schedules s, s are independent and possible in some configuration c, then post(c, s · s ) = post(c, s · s). Visually, this can be seen by doing a typical tiling argument. Failures As we probably all know from experience, failures are common in distributed systems. Failures can originate in the transport layer (a logical abstraction of the network, including switches, links, proxies, etc.) or the nodes (computers running the protocol software). Sometimes, the distinction is not that clear (for example, messages that are waiting in buffers are conceptually in the transport layer, but are subject to loss if the node fails). We now show how, given a protocol Φ and its LTS as defined in Sect. 2.2, Definition 3, we can model failures by adding failure actions to the LTS defined in Definition 4. Modeling Transport Failures. Failures for message delivery often include (1) reordering, (2) loss, (3) duplication, and (4) injection of messages. In our protocol model, reorderings are already allowed, thus we do not consider them to be a failure. To model message loss, we can add the following action to the LTS: Similarly, we can add an action for message duplication: We can also model injection of arbitrary messages: However, we will not talk more about the latter, which is considered a byzantine failure, and which opens up a whole new category of challenges and results. Masking Transport Failures. Protocols can mask message reordering, loss, and duplication by affixing sequence numbers to messages, and using send and receive buffers. Receivers can detect missing messages in the sequence and rerequest them. In fact, socket protocols (such as TCP) use this type of mechanism (e.g. sliding window) to achieve reliable in-order delivery of a byte stream. In practice, however, just using TCP is not always good enough, because TCP 96 S. Burckhardt connections can themselves fail. Often, resilience against transport failures needs to be built into the protocol in some form. A common trick to tolerate message duplication in services is to design the service calls to be idempotent, meaning that executing a message twice has the same effect as executing it just once. For example, setting the value of some parameter twice is harmless. Properly written REST protocols use the verb PUT to mark such requests as idempotent, allowing browsers and proxies to duplicate them. Modeling Node Failures. Typical node failures considered by protocol designers are crash failures (a process permanently stops at some point), and crashrecovery failures (a process stops at some point, then recovers later). Sometimes, byzantine failures are also considered, where faulty nodes exhibit arbitrary behavior, but we are skipping that topic. Typical terminology is to call a process correct if it does never experience a crash failure, and if it encounters only finitely many crash-recovery failures. We let F ⊂ Pid be the subset of faulty processes, i.e. processes that may be incorrect (it is acceptable for processes in F to be actually correct in any given execution). In a crash failure, the process state is permanently lost, and the process never takes another action. In a crash-recovery failure, the process can recover some or all of its state from some form of durable storage (if it cannot, there is little reason for a process to continue under the same identity). The part of the state that is lost in crashes is called "soft state". Often, message buffers are soft state, thus it is possible that messages are lost or duplicated if the crash occurred during a transition that receives or sends messages. In asynchronous systems, it is often important to distinguish between silent crashes and noisy crashes. Silent crashes mean that other processes have no way to distinguish between a slow response and a crashed process, which can be a real problem as we shall see below. Noisy crashes mean that other processes can use failure detectors to get information about whether a crash occurred. In some situations (e.g. inside a data center), it is often quite feasible to build failure detectors, in particular approximate failure detectors, and they can be very helpful for designing protocols. However, in other situations failure detection is impossible. For example, if a server loses contact to a JavaScript app running in somebody's browser, it does not know if this was a temporary connection failure and the app will reconnect at some future time, or if the user has closed the browser and will never return. In the following, we consider only silent crash failures. To model them, we use a modified definition of fairness: we allow executions to be 'unfair' if this unfairness is consistent with processes crashing, in the sense that crashed processes perform no more actions and receive no more messages after they crash. Definition 12. An execution E of L Φ for some Φ is a complete F -fair execution if there exists a partial function fails : F → ⊥ ∪ {0 . . . E.len} such that Asynchronous Consensus Under Silent Crash Failures is Impossible We now show the famous impossibility result for asynchronous consensus protocols under just 1 silent crash failure, following the same proof structure as in Fischer, Lynch and Paterson Definition 13. A simple consensus protocol is a consensus protocol (Pst, Msg, Act, ini, ori, dst, pid, cnd, rcv, snd, upd, pref, dec) such that the only actions are: and such that: and where the actions have no guard: Proof. Assume to the contrary that all F -fair executions with |F | ≤ 1 satisfy validity, agreement, stability, and termination. We then prove (using a sequence of lemmas) that a contradiction results. The key to the proof is the idea of examining the valence of system configuration, meaning how many different decisions are possible when starting in that configuration. For a system configuration c ∈ Cnf Φ , we define V (c) ⊆ Cnf Φ to be the set of decisions reachable from c: Using the two lemmas, we will now construct an infinite, fair execution consisting entirely of bivalent configurations, which contradicts the correctness of the protocol. -We start with some bivalent initial configuration, whose existence is guaranteed by Lemma 2. -We pick the most senior enabled action a (as defined in Definition 10). -We execute the action sequence w ∈ Act * (whose existence is guaranteed by Lemma 3), then the action a, and end up in another bivalent configuration. -Continue with step 2.5. This construction yields an infinite execution; it is fair because we pick the most senior enabled action in step 2.5 and then execute it after a few more other steps w, which means that there is no neglect (as explained in the proof of Lemma 1). Finally, we can lift the restriction and allow general protocols as defined in Definition 5. Proof (Sketch only). The idea is to construct a simple consensus protocol P that simulates P , and whose F -fair executions correspond to F -fair executions of P . Thus, P can not be correct, otherwise we could use it to build a correct simple consensus protocol which we know does not exist. S. Burckhardt The messages are the same (Msg = Msg). The local state Pst stores (1) the process state Pst, (2) an "inbox", i.e. a multiset representing messages that are available, and (3) a step counter recording how many times this process has taken a step, and (4) a data structure recording the timestamps (i.e. step counts) for messages in Msg and spontaneous actions in Act, used to calculate the seniority of actions as defined in Definition 10. On receive(p, m), the received message is simply added to the inbox. On run(p), we look for the most senior action, and execute it. The key requirement is that for every fair execution E of P we find a corresponding fair execution E of P . Consider a message m: if it does not get neglected in E, it must be received, meaning that it reaches the inbox; and because run(dst(m)) does not get neglected in E, it executes infinitely many times. Because the scheduler that is simulated by run is fair, as shown by Lemma 1, the simulated execution is fair as well. Ways Around Impossibility. Impossibility results are often called negative results, but in fact, they usually help us to discover new ways in which to change our approach or our definitions, in order to succeed. There are many ways to work around the impossibility result we just proved: -The result applies only to asynchronous systems. We can solve consensus in synchronous systems, e.g. if we have some bounds on message delays. -The result assumes that crashes are silent. We can solve consensus if we have failure detectors (for an extensive list of various consensus algorithms, see The last item is perhaps the most interesting. In the next section, we show an asynchronous protocol for consensus that can be tuned to terminate quite efficiently in practice. The PAXOS Protocol We now have a closer look at the PAXOS protocol for asynchronous consensus by Leslie Lamport The basic idea is to perform a leader-based consensus: a leader p performs a voting round (whose goal is to reach consensus on a bit) by sending a proposal for a consensus value to all participants, and if p gets a majority to agree with the proposal, p informs all participant about the winning value. Voting rounds can fail for various reasons, but a leader can always start a new round, which can still succeed (i.e. the protocol never gets stuck with no chance of success). The trick is to (1) design the protocol to satisfy agreement, validity, and stability even if there are many competing leaders, and (2) make it unlikely Consistency in Distributed Systems 101 types | Round = (N0 × Pid) using lexicographic order | Vote = (Round × {0, 1}) using lexicographic order process state | state : //sent from leader p to learner q (using ad-hoc heuristics) that there are many competing leaders at a time, thus termination is likely in practice. There are three roles of participants (leaders, acceptors, learners) which we represent by three different process subsets Pid l , Pid a , Pid r of Pid. Leaders conduct the organizational part of a voting round (solicit, collect, and analyze votes); acceptors perform the actual voting; and learners are informed about the successful outcome, if any. It is perfectly acceptable (and common in practice) for a process to play multiple roles. If everybody plays every role we have Pid l = Pid a = Pid r = Pid. The number of acceptors must be finite (|Pid a | < ∞) so that they can form majorities. Some key ideas include: -Voting rounds are identified by a unique round identifier. This identifier is a tuple (n, p) consisting of a sequence number n and the process identifier p of the leader for this round. There is just one leader for each round, but different rounds can be initiated by different leaders, possibly concurrently. -Each round has two and a half phases. In the first phase, the leader sends an inquiry message to all acceptors. The acceptors respond with a special message containing the last vote they cast (in a previous round), or a pseudovote containing their initial preference (if they have not cast any votes in a real round yet). -When the leader has received a last-vote message from a quorum (i.e. at least half) of acceptors, it starts the second phase. In this phase, it proposes a consensus value and asks the quorum to vote for it. -If the leader receives votes from all members of the quorum, it informs all learners about the successful outcome. We show the definitions of local states (for each role) and of message formats in The following properties of the protocol are key to ensure consensus even under concurrent voting rounds: Consistency in Distributed Systems 103 -Rounds are totally ordered (lexicographically based on the order, then the process id). Participants are no longer allowed to participate in a lower round once they are participating in a higher round. -When transitioning from the first phase (gather last vote messages) to the second phase (send out proposal messages), the leader chooses the consensus value belonging to the highest vote among all the last-vote messages. This ensures that if a prior round was actually successful (i.e. it garnered a majority of votes), the new round uses the same bit value. The following lemma formalized these intuititons, and constitutes the core of the correctness proof. Lemma 4 (Competing Leaders). If E is an execution and announce(n, p, b, Q) ∈ trc(E) and propose(n , p , b , Q , lv ) ∈ trc(E), Proof. By contradiction. Assume the lemma is not true, then there exist E, p, n, b, Q, p , n , b , Q , lv falsifying the condition, and without loss of generality we can assume (n , p ) are chosen minimal among all such. To perform propose (n , p , b , Q , lv ), the leader p received several LastVote messages; Let ((n , p ), b ) = max q∈Q lv (q) be the maximal vote received. Distinguish cases: -(n , p ) < (n, p) this is impossible: because Q and Q must intersect, there exists q ∈ Q∩Q . Since q must have voted for the round (n, p) before answering in the round (n , p ) (otherwise it would not have voted), the LastVote message sent from q to p must contain a vote whose round is no lower than (n, p) (note that the lastvote variable is always monotonically increasing). -(n , p ) = (n, p) in that case, b = b because all votes for the same round have the same bit value. Contradiction. -(n , p ) > (n, p). Because p is at least 1, so is p , thus ((n , p ), b ) is a vote for a non-zero round, so there must exist some propose (n , p , b , , ) in the execution. Because we chose (n , p ) minimal among all such violating the lemma, this implies that b = b . Contradiction. The following lemma shows that no matter how many crashes occur, how many messages are lost, or how many leaders are competing, safety is always guaranteed. Theorem 2. All executions of PAXOS satisfy agreement, validity, and stability. Proof. Validity is easy because all votes can be tracked back to some initial vote, which is the preference of some processor. Stability and agreement follow because if we had two announce (n, p, b, Q) and announce(n , p , b , Q ) with b = b , and suppose that (n, p) < (n , p ) without loss of generality, then there must also be a propose (n , p , b , Q , lv ), which contradicts Lemma 4. S. Burckhardt Of course, termination is not possible for arbitrary fair schedules in the presence of failures because of Theorem 1. However, the following property holds: success always remains possible as long as there remains some non-crashed leader, some non-crashed learner, and at least |Pid a /2| non-crashed acceptors. The reason is that: -A leader cannot get stuck in any state: if it is waiting for something (such as the receipt of some message), and that something is not happening (for example, due to a crash), the leader can perform the spontaneous action abandon to return to a neutral state, from which it can start a new, higher round. -If a leader p starts a new round (n, p) that is larger than any previous rounds, and if no other leaders are starting even higher rounds, and if at least |Pid a /2| acceptors remain, and if there are no more crashes, then the round succeeds. The PAXOS algorithm shown, and the correctness proof, are both based on the original paper by Lamport Strong Consistency and CAP In this section we examine how to understand the consistency of shared data. We explore the cost of strong consistency (in terms of reliability or performance). We develop abstractions that help system implementors to articulate the consistency guarantees they are providing to programmers. Objects and Operations We assume that the shared data is organized as a collection of named objects Obj. As in the last section, we assume a set of processes Pid. The sets of objects and processes may be infinite, to model their dynamic creation. Processes interact with the shared data by performing operations on objects. Each object x ∈ Obj has a type τ = type(x) ∈ Type, whose type signature (Op τ , Val τ ) determines the set of supported operations Op τ and the set of their return values Val τ . We assume that a special value ⊥ ∈ Val τ belongs to all sets Val τ and is used for operations that return no value. Example 1. An integer register intreg can be defined as follows: Val intreg = Z ∪ {⊥}, and Op intreg = {rd} ∪ {wr(a) | a ∈ Z} Example 2. A counter object ctr can be defined as follows: Val ctr = Z ∪ {⊥}, and Op ctr = {rd, inc}. Consistency in Distributed Systems 105 Sequential Semantics. The type of an object, as defined above, does not actually describe the semantics of the operation, only their syntax. We formally specify the sequential semantics of a data type τ by a function which, given an operation and sequence of prior operations, specifies the expected return value. For a register, read operations return the value of the last preceding write, or zero if there is no prior write. For a counter, read operations return the number of preceding increments. Thus, for any sequence of operations ξ: ξ 2 does not contain wr operations; S ctr (rd, ξ) = (the number of inc operations in ξ); Our definition of the sequential semantics uses sequences of prior operations (representing all earlier updates), rather than the current state of an object, to define the behavior of reads. This choice is useful: for many implementations, there are multiple versions of the state, and these versions are often best understood as the result of using various update sequences (such as logs), subsequences, or segments. Moreover, for objects such as the integer register, only the last update matters, since it overwrites completely all information in the object. For the counter, however, all updates matter. Similarly, if considering objects that have multiple fields and support partial updates, e.g. updates that modify individual fields, it is not enough to look at the last update to determine the current state of the object. In general, operations may both read and modify the state. Operations that return no value are called update-only operations. Similarly, we call an operation o of a type τ read-only if it has no side effect, i.e. if for all o ∈ Op τ and u, v ∈ Op * τ , we have S τ (o , u · o · v What is an Object? There is often some ambiguity to the question of what we should consider to be an object. For example, consider a cloud table storage API that provides tables that store records (consisting of several fields that have values) indexed by keys. Then: -We can consider each record to be an object, named by the combination of the table name and the key, and supporting operations for reading and writing fields or removing the object. -We can consider the whole table to be an object, named by the table name. Operations specify the key (and the field, if accessing individual fields). -We can consider each field to be an object, named by the combination of the table name, the key, and the field name. This approach seems most consistent with the types shown above (integer registers, counters). -We can consider the entire storage to be a single object, and have operations to target a specific (table, key, field) combination. S. Burckhardt We propose the following definition, or perhaps we should say guideline: -An object is the largest unit of data that can be written atomically without using transactions. -A transactional domain is the largest unit of data that can be written atomically by using transactions. Traditional databases follow a philosophy without objects (nothing can be written outside of a transaction) and large transactional domains (the entire database), which requires strong transaction support. Cloud storage and web programming rely more commonly on moderately to large sized objects, and transactional domains that do not contain all data (transaction support is typically nonexistent, or at best limited). The reason is that the latter approach is easier to guarantee as a scalable service. Unfortunately, it is also harder to program. Strong Consistency Intuitively, programmers expect operations on shared data to be linearizable. Informally, this means that when they call into some API to read or write a shared value, they expect a behavior that is consistent with (i.e. observationally undistinguishable from): -a single copy of the shared data being maintained somewhere. -the read or write operations being applied to that copy somewhere in between the call and the return. Unfortunately, guaranteeing these conditions can be a performance and reliability problem, if communication between processes is expensive and/or unavailable. Many systems thus relax the consistency. A good test to see whether a system is indeed linearizable (in fact, sequentially consistent) is shown in -If the system decides that A's write to x happens before B's write to y, then it must also happen before B's read from x, thus the value read must be 1, so B does not win. -If the system decides that B's write to y happens before A's write to x, then it must also happen before A's read from y, thus the value read must be 1, so A does not win. This reasoning seems still a bit informal -talking about 'happens before' without a solid foundation can get quite confusing. In order to give a more rigorous reasoning, we first need a precise definition of what sequential consistency and linearizability mean. If we run these two concurrently on a sequentially consistent or linearizable system, there is at most one winner. Abstract Executions. To specify consistency models, we use abstract executions. The basic idea is very simple: 1. A consistency model is formalized as a set of abstract executions, which are mathematical structures (visualized using graphs) consisting of operation events (vertices) and relations (edges), subject to conditions. Abstract executions capture "the essence" of an execution (that is, what operations occurred, and how those operations are related), without including low-level details (such as exactly what messages were sent when and where). 2. We describe what it means for a concrete execution of a system to correspond to an abstract execution. 3. We say that a system is correct if all of its concrete executions correspond to some abstract execution of the consistency model.