## Algorithms for Sequential Decision Making (1996)

Citations: | 177 - 8 self |

### BibTeX

@MISC{Littman96algorithmsfor,

author = {Michael Lederman Littman},

title = {Algorithms for Sequential Decision Making},

year = {1996}

}

### Years of Citing Articles

### OpenURL

### Abstract

Sequential decision making is a fundamental task faced by any intelligent agent in an extended interaction with its environment; it is the act of answering the question "What should I do now?" In this thesis, I show how to answer this question when "now" is one of a finite set of states, "do" is one of a finite set of actions, "should" is maximize a long-run measure of reward, and "I" is an automated planning or learning system (agent). In particular,

### Citations

10922 |
Computers and Intractability: A Guide to the Theory of NP-Completeness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...e in S 0? A polynomial-time algorithm for solving this problem could be used to solve quanti ed-boolean-formula problems in polynomial time. Since the quanti ed-boolean-formula problem is PSPACE-hard =-=[55]-=-, this shows that the polynomial-horizon, booleanreward pomdp problem is also PSPACE-hard. The proof is due to Papadimitriou and 130sTsitsiklis [116]. 6.5.3 In nite Horizon, Deterministic The unobserv... |

8530 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1990
(Show Context)
Citation Context ...ction for a given policy appears in Table 2.1; it will be used later in more complex algorithms. The system of linear equations can be solved by Gaussian elimination or any ofanumber of other methods =-=[40]-=-. Now we know how to compute a value function, given a policy. We can also de ne a policy based on a value function. Given any value function V , the greedy policy with respect to that value function,... |

2611 |
Dynamic Programming
- Bellman
- 1957
(Show Context)
Citation Context ...m the traditional model and compute solutions in the form of policies instead of action sequences. Much of the content of this chapter is a recapitulation of work in the operationsresearch literature =-=[126, 15, 44, 46,68, 13]-=- and the reinforcement-learning literature [153, 173, 10, 145]. The concepts and background introduced here will be built upon in all the succeeding chapters. 2.2 Markov Decision Processes Markov deci... |

2343 | Communication complexity
- Papadimitriou, Sipser
- 1984
(Show Context)
Citation Context ... algorithms. If P6=NP, then the classes from NP-complete up to EXPTIME do not have e cient algorithms. Additional background information on complexity classes can be found in Papadimitriou's textbook =-=[115]-=- and the summary in Section A.1. 19s1.3.3 Reinforcement-learning Algorithms In a reinforcement-learning scenario, the agentmust solve the same basic problem faced in planning, but must do so without a... |

2112 | A New Approach to Linear Filtering and Prediction Problems - Kalman - 1960 |

1740 |
STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving
- Fikes, Nilsson
- 1971
(Show Context)
Citation Context ...r-science perspective. I undertake this type of analysis throughout the thesis. Planning has been one of the primary subject areas in arti cial intelligence since the development of the STRIPS system =-=[53]-=-. Early work on planning focused on the generation of plans for reaching some goal state in a deterministic environment. A more recent trend has been to consider decision-theoretic planning, in which ... |

1459 |
Theory of linear and integer programming
- Schrijver
- 1986
(Show Context)
Citation Context ...e components of the optimal value function for M are rational numbers with numerator and denominator needing no more B bits, and B is bounded by a polynomial in jSj, jAj, and B. Proof: Itiswell known =-=[140, 162]-=- that the solution to a rational linear program in which each numerator and denominator is represented using no more than B bits, can itself be written using rational numbers. The value of each variab... |

1321 |
Learning from Delayed Rewards
- Watkins
- 1989
(Show Context)
Citation Context ...es instead of action sequences. Much of the content of this chapter is a recapitulation of work in the operationsresearch literature [126, 15, 44, 46,68, 13] and the reinforcement-learning literature =-=[153, 173, 10, 145]-=-. The concepts and background introduced here will be built upon in all the succeeding chapters. 2.2 Markov Decision Processes Markov decision processes are the simplest family of models I will consid... |

1303 | Reinforcement Learning: A Survey
- Kaelbling, Littman, et al.
- 1996
(Show Context)
Citation Context ...ortions of this chapter have appeared inearlier papers: \Planning and acting in partially observable stochastic domains" [73] with Kaelbling and Cassandra, \An introduction to reinforcement learning" =-=[74]-=- with Kaelbling and Moore, and \On the complexity of solving Markov decision problems" [96] with Dean and Kaelbling. Consider the problem of creating a policy to guide a robot through an o ce building... |

1227 | Learning to predict by the methods of temporal differences - Sutton - 1988 |

1040 |
Theory of Games and Economic Behavior
- Neumann, Morgenstern
- 1944
(Show Context)
Citation Context ...researchers have adapted mdp-based learning algorithms to a very general class of games [90] and many researchers have used reinforcement learning in these environments; economists and game theorists =-=[168, 166, 143]-=- have studied Markov games as a model for understanding the behavior of individuals in multiagent systems. 4.2 Alternating Markov Games In this chapter, I describe alternating Markov games, in which s... |

825 | A.R.: Planning and acting in partially observable stochastic domains
- Kaelbling, Littman, et al.
- 1998
(Show Context)
Citation Context ...sed by many other researchers. s 1 0.2 24sChapter 2 Markov Decision Processes Portions of this chapter have appeared inearlier papers: \Planning and acting in partially observable stochastic domains" =-=[73]-=- with Kaelbling and Cassandra, \An introduction to reinforcement learning" [74] with Kaelbling and Moore, and \On the complexity of solving Markov decision problems" [96] with Dean and Kaelbling. Cons... |

811 |
Programming and Extensions
- Dantzig, Linear
- 1963
(Show Context)
Citation Context ...actical use; however, re nements of Karmarkar's [78] polynomial-time algorithm are competitive with the fastest practical algorithms. Another algorithm for solving linear programs, the simplex method =-=[41]-=-, is theoretically ine cient but runs extremely quickly in practice. An excellent book by Schrijver [140] describes the theory of linear programs and the algorithms used to solve them. 192sAppendix B ... |

801 |
Matrix multiplications via arithmetic progressions
- Coppersmith, Winograd
- 1990
(Show Context)
Citation Context ...on a 2 results in a transition to random state 2 In theory, policy evaluation can be performed faster, because it primarily requires inverting a jSj jSjmatrix, which can be done in O(jSj 2:376 ) time =-=[39]-=-. 41s+0 s 5 – 1 s 4 a 1 a 2 s 3 a 1 .5 .5 s4 ′ .5 s3 ′ .5 s2 ′ Figure 2.3: Simple policy iteration requires an exponential number of iterations to generate an optimal solution to the family of mdps il... |

772 |
A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains
- Baum, Petrie, et al.
- 1970
(Show Context)
Citation Context ... from Chapter 7; both algorithmic methods and learning methods are appropriate. This section describes several attempts at learning the model itself. Chrisman [34] showed how the Baum-Welsh algorithm =-=[11]-=- for learning hidden Markov models (HMMs) could be adapted to learning transition and observation functions for pomdps. He, and later McCallum [104], gave heuristic state-splitting rules to attempt to... |

647 | A new polynomial-time algorithm for linear programming, Combinatorica 4
- Karmarkar
- 1984
(Show Context)
Citation Context ...omial in B. There are algorithms for solving rational linear programs that take time polynomial in the numberofvariables and constraints as well as the number of bits used to represent the coe cients =-=[78, 79]-=-. Thus, mdps can be solved in time polynomial in jSj, jAj, and B. Descendants of Karmarkar's algorithm [78] are considered among the most practically e cient linearprogramming algorithms. It is popula... |

619 | Tsitsiklis, Parallel and Distributed Computation: Numerical Methods - Bertsekas, N - 1989 |

613 | Some studies in machine learning using the game of checkers
- Samuel
- 1959
(Show Context)
Citation Context ...ating Games As mentioned in the introduction, game playing is one of the best studied domains for reinforcement learning. One application, well ahead of its time, was Samuel's checkers playing system =-=[134]-=-; it employed a training scheme similar to the updates used in value iteration and Q-learning. Tesauro [160] used the TD( ) algorithm [154] to nd an excellent policy for backgammon. Tesauro's work is ... |

561 |
A stochastic approximation method
- ROBBINS, MONRO
- 1951
(Show Context)
Citation Context ...1 , t(xt))U(x)+ t(xt)(rt + V (x0 t)); if x = xt U(x); otherwise. and de ne 67 Ut+1 = Ht(Ut;V): (3.7) Conditions for the convergence of Ut to HV are provided by classic stochastic-approximation theory =-=[130]-=-. A more advanced reinforcement-learning problem is computing V = HV , the xed point ofH, instead of the value of HV for a xed value function. Consider the natural learning algorithm that begins with ... |

526 | Learning to act using real-time dynamic programming
- Barto, Bradtke, et al.
- 1995
(Show Context)
Citation Context ...and building a complete universal plan is infeasible. Using a reinforcementlearning algorithm in such anenvironment can help the agent nd appropriate behavior for the most common and important states =-=[9, 43]-=-. The most noteworthy example of this technique remains one of the biggest successes of reinforcement learning|Tesauro's backgammon-learning program [159], which isnow reliably ranked as one of the wo... |

516 |
Dynamic Programming and and Markov Processes
- Howard
- 1960
(Show Context)
Citation Context ...m the traditional model and compute solutions in the form of policies instead of action sequences. Much of the content of this chapter is a recapitulation of work in the operationsresearch literature =-=[126, 15, 44, 46,68, 13]-=- and the reinforcement-learning literature [153, 173, 10, 145]. The concepts and background introduced here will be built upon in all the succeeding chapters. 2.2 Markov Decision Processes Markov deci... |

498 | Markov games as a framework for multi-agent reinforcement learning
- Littman
- 1994
(Show Context)
Citation Context ...ive mdp [62] maxu f(x; u) minx02N (x;u) g(x0 ) P evaluating risk-sensitive u (x; u)f(x; u) minx02N (x;u) g(x0 ) P exploration-sens. mdp [71] max 2P0 u (x; u)f( ) P x0 T (x; u; x0 )g(x0 ) Markov games =-=[90]-=- see text see text P information-state mdp [117] maxu f(x; u) x02N (x;u) T (x; u; x0 )g(x0 ) Table 3.1: Examples of generalized Markov decision processes and their summary operators. between the summa... |

473 | Integrated architectures for learning, planning, and reacting based on approximating dynamic programming
- Sutton
- 1990
(Show Context)
Citation Context ...sive to gather, at the cost of invoking a full- edged planning system whenever the agent needs to update its policy. An intermediate approach is to plan incrementally as new experience is encountered =-=[155, 111, 119]-=-. 9sIn simulated reinforcement learning, a reinforcement-learning agent isintroduced into an environment with a known structure, but is forced to behave as if the structure is not known. Although this... |

457 | Dynamic Programming and Optimal Control, Athena Scientific, 3rd edition - Bertsekas - 2007 |

373 |
Dynamic Programming: Deterministic and Stochastic Models
- Bertsekas
- 1987
(Show Context)
Citation Context ...m the traditional model and compute solutions in the form of policies instead of action sequences. Much of the content of this chapter is a recapitulation of work in the operationsresearch literature =-=[126, 15, 44, 46,68, 13]-=- and the reinforcement-learning literature [153, 173, 10, 145]. The concepts and background introduced here will be built upon in all the succeeding chapters. 2.2 Markov Decision Processes Markov deci... |

373 | Temporal difference learning and TD-gammon - Tesauro - 1995 |

336 |
The Optimal Control of Partially Observable Markov Processes
- Sondik
- 1971
(Show Context)
Citation Context ...mdps with informational discounting easier to solve than general pomdps? Are they decidable? Can the idea of informational discounting make it possible to analyze approximate state estimators? Sondik =-=[149]-=- de nes the class of nitely transient policies, and shows that these policies can be represented as nite-memory policies. The pomdp decision problem described earlier is decidable for the class of pom... |

324 | Universal plans for reactive robots in unpredictable environments
- Schoppers
- 1987
(Show Context)
Citation Context ...n of part, use painting procedure A if part rightside up, use painting procedure B if part upside down. An extreme form of conditional plan is a stationary policy, sometimes called a \universal plan" =-=[138]-=-. This type of policy has no speci ed sequence at all; instead, the agent examines the entire state at each step and then chooses the action most appropriate in the current state. For an agent to foll... |

314 | Prioritized sweeping: reinforcement learning with less data and less time
- Moore, Atkeson
- 1993
(Show Context)
Citation Context ...sive to gather, at the cost of invoking a full- edged planning system whenever the agent needs to update its policy. An intermediate approach is to plan incrementally as new experience is encountered =-=[155, 111, 119]-=-. 9sIn simulated reinforcement learning, a reinforcement-learning agent isintroduced into an environment with a known structure, but is forced to behave as if the structure is not known. Although this... |

307 |
The complexity of markov decision processes
- Papadimitriou, Tsitsiklis
- 1987
(Show Context)
Citation Context ...s e ciently in parallel, and the other links the deterministic mdp problem to a general class of shortest-path problems, which results in an e cient sequential algorithm. Papadimitriou and Tsitsiklis =-=[116]-=- give a dynamic-programming algorithm for solving deterministic mdps e ciently on a parallel machine. A sequential version of their algorithm (given in Table 2.6) runs in jSj 2 + jSjjAj +2jSj 4 time. ... |

295 |
The optimal control of partially observable Markov processes over a finite horizon
- Smallwood, Sondik
- 1973
(Show Context)
Citation Context ...val := val g g return maxtree g Table 7.10: Computing a useful policy tree at x, given action a. will we know when we are done?" The rst question was answered in the context of Smallwood and Sondik's =-=[148]-=- pomdp algorithm. The subroutine UsefulPolicyTreeFromState in Table 7.10 shows how to construct a useful (with respect to , a t ) t-step policy tree for action a that is useful at information state x,... |

276 |
Reinforcement learning with selective perception and hidden state
- McCallum
- 1995
(Show Context)
Citation Context ...pomdps, it is not su cient for learning; even if immediate observations are enough to make optimal action choices, learning which choices to make can require additional information about past history =-=[103]-=-. Q-learning Many researchers have used Q-learning and other mdp-based reinforcement-learning algorithms to learn policies for partially observable domains. One interesting example is Wilson's work on... |

275 | Acting optimally in partially observable stochastic domains
- Cassandra, Kaelbling, et al.
- 1994
(Show Context)
Citation Context ...endix have appeared inearlier papers: \Planning and acting in partially observable stochastic domains" [73] with Kaelbling and Cassandra, \Acting optimally in partially observable stochastic domains" =-=[32]-=- with Cassandra and Kaelbling, and \An introduction to reinforcement learning" [74] with Kaelbling and Moore. Chapter 2 began with an example of a robot deciding how tonavigate in a large o ce buildin... |

258 | An algorithm for probabilistic planning
- Kushmerick, Hanks, et al.
- 1995
(Show Context)
Citation Context ...ard over the in nite horizon [102] maximum expected undiscounted reward until goal (cost-to-go) [29] minimax expected undiscounted goal probability [36] maximum expected undiscounted goal probability =-=[87]-=- maximum multiagent discounted expected reward [22] Table 1.3: Several popular objective functions. 17sUnder the discounted objective, the discount factor 0 < <1 controls how much e ect future rewards... |

253 | Probabilistic robot navigation in partially observable environments
- Simmons, Koenig
- 1995
(Show Context)
Citation Context ...o make the corresponding pomdps solvable. Some progress has been made: Hansen [60] blended completely unobservable and completely observable mdps to form an intermediate model, and Simmons and Koenig =-=[144]-=- controlled a robot using a pomdp model. From the interest that has been generated, I believe itislikely that a great deal of additional progress will be made in the next few years. 141sChapter 7 Info... |

251 | Generalization in reinforcement learning: Safely approximating the value function
- Boyan, Moore
- 1995
(Show Context)
Citation Context ...horizon [62] minimax expected average reward over the in nite horizon [183] maximum expected average reward over the in nite horizon [102] maximum expected undiscounted reward until goal (cost-to-go) =-=[29]-=- minimax expected undiscounted goal probability [36] maximum expected undiscounted goal probability [87] maximum multiagent discounted expected reward [22] Table 1.3: Several popular objective functio... |

246 |
Temporal credit assignment in reinforcement learning
- Sutton
- 1984
(Show Context)
Citation Context ...es instead of action sequences. Much of the content of this chapter is a recapitulation of work in the operationsresearch literature [126, 15, 44, 46,68, 13] and the reinforcement-learning literature =-=[153, 173, 10, 145]-=-. The concepts and background introduced here will be built upon in all the succeeding chapters. 2.2 Markov Decision Processes Markov decision processes are the simplest family of models I will consid... |

237 | Residual Algorithms: Reinforcement Learning with Function Approximation
- Baird
- 1995
(Show Context)
Citation Context ... for solving a particular class of continuous state-space mdps, Gordon [58] and Tsitsiklis and Van Roy [164] demonstrated closely related provably convergent dynamic-programming algorithms, and Baird =-=[7]-=- derived a gradient-descent rule for adjusting the parameters representing avalue function in a reinforcement-learning setting; a survey of these techniques and others has recently been compiled [30].... |

237 |
Stochastic Games
- Shapley
- 1953
(Show Context)
Citation Context ...these quantities can be de ned in terms of each other. In this section, I discuss methods for nding these quantities. 3.3.1 Value Iteration The method of value iteration, or successive approximations =-=[13, 143]-=-, is a way of iteratively computing arbitrarily good approximations to the optimal value function V . A single step of the process starts with an estimate, Vt,1, of the optimal value function, and pro... |

232 | Learning Policies for Partially Observable Environments: Scaling Up
- Littman, Cassandra, et al.
- 1995
(Show Context)
Citation Context ...domains" [32] with Cassandra and Kaelbling, \The witness algorithm: Solving partially observable Markov decision processes" [92], \Learning policies for partially observable environments: Scaling up" =-=[94]-=- with Cassandra and Kaelbling, and \An e cient algorithm for dynamic programming in partially observable Markov decision processes" [95] with Cassandra and Kaelbling. In this chapter, I present anumbe... |

224 | Exploiting structure in policy construction
- Dearden, R, et al.
- 1995
(Show Context)
Citation Context ...w that the problem is somehow inherently intractable? There are representations for rewards and transitions that make it possible to specify compact models for mdps with exponential-size state spaces =-=[87, 21,24, 113]-=-. What are the complexity issues? It is probably computationally intractable 47sto nd -optimal policies using compact representations, but are there useful subclasses of mdps that can be solved e cien... |

224 | The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces
- Moore, Atkeson
- 1995
(Show Context)
Citation Context ...tinuum of possible beliefs that result from a particular nite-state problem. I also consider a particular limiting case of nite state spaces, that of an environment with a single state. Other authors =-=[112]-=- have addressed environments with continuous state spaces. nite vs. continuous actions 10sThe set of action choices can also be either nite or continuous; again I will be mainly concerned with the nit... |

214 |
Interactions between learning and evolution
- Ackley, Littman
- 1991
(Show Context)
Citation Context ...d" environments faced by biological agents. In particular, the only true reward signal in a biological system is death, which is perceptible by the agent too late to be of use. Simulation experiments =-=[3]-=- have shown that, over the span of many generations, arti cial agents can evolve their own proximal reward functions that are useful in predicting the relative goodness and badness of situations; in p... |

213 |
NP is as easy as detecting unique solutions
- Valiant, Vazirani
- 1986
(Show Context)
Citation Context ...les and negated variables). A satisfying assignment maps each of the variables to either \true" or \false" so the entire formula evaluates to \true." There is a result, proved by Valiant and Vazirani =-=[165]-=-, that implies that there exists a polynomial-time algorithm for nding a satisfying assignment for a formula that is 240sguaranteed to have at most one satisfying assignment only if RP=NP 1 . I will s... |

207 | Stable Function Approximation in Dynamic Programming
- Gordon
- 1995
(Show Context)
Citation Context ...nd dynamic programming using approximate value functions is attracting increasing interest. Boyan and Moore [29] examined methods for solving a particular class of continuous state-space mdps, Gordon =-=[58]-=- and Tsitsiklis and Van Roy [164] demonstrated closely related provably convergent dynamic-programming algorithms, and Baird [7] derived a gradient-descent rule for adjusting the parameters representi... |

207 | Convergence of stochastic iterative dynamic programming algorithms
- Jaakkola, Jordan, et al.
- 1994
(Show Context)
Citation Context ... value function. The speci c generalized mdp model presented here is both more and less general than Szepesvari's model; however, Theorem 3.7 is useful in both frameworks. Jaakkola, Jordan, and Singh =-=[69]-=- and Tsitsiklis [163] developed the connection between stochastic-approximation theory and reinforcement learning, focusing on the mdp model. The mathematics and insight used in Theorem 3.7 are not su... |

200 | Probabilistic planning with information gathering and contingent execution
- Draper, Hanks, et al.
- 1994
(Show Context)
Citation Context ...t makes it di cult and worthy of study is that the states of the environment are represented in a propositional form. Let us consider a simple example, adapted from a paper by Draper, Hanks, and Weld =-=[50, 49]-=-. The environment isamanufacturing plant and the agent's task is to process and ship a particular widget. At any moment in time, the widget is either painted (PA) or not, awed (FL) or not, blemished (... |

195 | Learning and sequential decision making
- Barto, Sutton, et al.
- 1989
(Show Context)
Citation Context ...es instead of action sequences. Much of the content of this chapter is a recapitulation of work in the operationsresearch literature [126, 15, 44, 46,68, 13] and the reinforcement-learning literature =-=[153, 173, 10, 145]-=-. The concepts and background introduced here will be built upon in all the succeeding chapters. 2.2 Markov Decision Processes Markov decision processes are the simplest family of models I will consid... |

194 | Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach
- Chrisman
- 1992
(Show Context)
Citation Context ...f \external memory." 135susing techniques from Chapter 7; both algorithmic methods and learning methods are appropriate. This section describes several attempts at learning the model itself. Chrisman =-=[34]-=- showed how the Baum-Welsh algorithm [11] for learning hidden Markov models (HMMs) could be adapted to learning transition and observation functions for pomdps. He, and later McCallum [104], gave heur... |

191 |
A survey of partially observable Markov decision processes: Theory, models, and algorithms
- Monahan
- 1982
(Show Context)
Citation Context ...obot navigation, pomdps are useful for solving problems of factory process control, resource allocation under uncertainty, cost-sensitive testing, and a variety of other complex real-world challenges =-=[109]-=-. One important facet of the pomdp approach is that there is no distinction drawn between actions taken to change the state of the world and actions taken to gain information. This is important becaus... |