#### DMCA

## Algorithms for Sequential Decision Making (1996)

Citations: | 211 - 8 self |

### Citations

14035 |
Computers and intractability: a guide to the theory of NP-completeness, Freeman,
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...e in S 0? A polynomial-time algorithm for solving this problem could be used to solve quanti ed-boolean-formula problems in polynomial time. Since the quanti ed-boolean-formula problem is PSPACE-hard =-=[55]-=-, this shows that the polynomial-horizon, booleanreward pomdp problem is also PSPACE-hard. The proof is due to Papadimitriou and 130sTsitsiklis [116]. 6.5.3 In nite Horizon, Deterministic The unobserv... |

10589 | Introduction to Algorithms
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ...ction for a given policy appears in Table 2.1; it will be used later in more complex algorithms. The system of linear equations can be solved by Gaussian elimination or any ofanumber of other methods =-=[40]-=-. Now we know how to compute a value function, given a policy. We can also de ne a policy based on a value function. Given any value function V , the greedy policy with respect to that value function,... |

3931 |
Dynamic Programming
- Bellman
- 1957
(Show Context)
Citation Context ...m the traditional model and compute solutions in the form of policies instead of action sequences. Much of the content of this chapter is a recapitulation of work in the operationsresearch literature =-=[126, 15, 44, 46,68, 13]-=- and the reinforcement-learning literature [153, 173, 10, 145]. The concepts and background introduced here will be built upon in all the succeeding chapters. 2.2 Markov Decision Processes Markov deci... |

2214 |
Strips: A new approach to the application of theorem proving to problem solving
- Fikes, Nilsson
- 1971
(Show Context)
Citation Context ...r-science perspective. I undertake this type of analysis throughout the thesis. Planning has been one of the primary subject areas in arti cial intelligence since the development of the STRIPS system =-=[53]-=-. Early work on planning focused on the generation of plans for reaching some goal state in a deterministic environment. A more recent trend has been to consider decision-theoretic planning, in which ... |

1711 | Reinforcement learning: A survey
- Kaelbling, Littman, et al.
- 1996
(Show Context)
Citation Context ...ortions of this chapter have appeared inearlier papers: \Planning and acting in partially observable stochastic domains" [73] with Kaelbling and Cassandra, \An introduction to reinforcement learning" =-=[74]-=- with Kaelbling and Moore, and \On the complexity of solving Markov decision problems" [96] with Dean and Kaelbling. Consider the problem of creating a policy to guide a robot through an o ce building... |

1275 |
Linear Programming and Extensions
- Dantzig
- 1963
(Show Context)
Citation Context ...actical use; however, re nements of Karmarkar's [78] polynomial-time algorithm are competitive with the fastest practical algorithms. Another algorithm for solving linear programs, the simplex method =-=[41]-=-, is theoretically ine cient but runs extremely quickly in practice. An excellent book by Schrijver [140] describes the theory of linear programs and the algorithms used to solve them. 192sAppendix B ... |

1132 |
A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics
- Baum, Petrie, et al.
- 1970
(Show Context)
Citation Context ... from Chapter 7; both algorithmic methods and learning methods are appropriate. This section describes several attempts at learning the model itself. Chrisman [34] showed how the Baum-Welsh algorithm =-=[11]-=- for learning hidden Markov models (HMMs) could be adapted to learning transition and observation functions for pomdps. He, and later McCallum [104], gave heuristic state-splitting rules to attempt to... |

1095 | Planning and acting in partially observable stochastic domains
- Kaelbling, Littman, et al.
- 1998
(Show Context)
Citation Context ...sed by many other researchers. s 1 0.2 24sChapter 2 Markov Decision Processes Portions of this chapter have appeared inearlier papers: \Planning and acting in partially observable stochastic domains" =-=[73]-=- with Kaelbling and Cassandra, \An introduction to reinforcement learning" [74] with Kaelbling and Moore, and \On the complexity of solving Markov decision problems" [96] with Dean and Kaelbling. Cons... |

986 |
Matrix multiplication via arithmetic progressions
- Coppersmith, Winograd
- 1990
(Show Context)
Citation Context ...on a 2 results in a transition to random state 2 In theory, policy evaluation can be performed faster, because it primarily requires inverting a jSj jSjmatrix, which can be done in O(jSj 2:376 ) time =-=[39]-=-. 41s+0 s 5 – 1 s 4 a 1 a 2 s 3 a 1 .5 .5 s4 ′ .5 s3 ′ .5 s2 ′ Figure 2.3: Simple policy iteration requires an exponential number of iterations to generate an optimal solution to the family of mdps il... |

854 | A new polynomial-time algorithm for linear programming. Combina- torica
- Karmarkar
- 1984
(Show Context)
Citation Context ...omial in B. There are algorithms for solving rational linear programs that take time polynomial in the numberofvariables and constraints as well as the number of bits used to represent the coe cients =-=[78, 79]-=-. Thus, mdps can be solved in time polynomial in jSj, jAj, and B. Descendants of Karmarkar's algorithm [78] are considered among the most practically e cient linearprogramming algorithms. It is popula... |

751 | Dynamic Programming and Optimal Control. Athena Scientific - Bertsekas - 2005 |

740 |
Dynamic Programming and Markov processes
- Howard
- 1960
(Show Context)
Citation Context ...m the traditional model and compute solutions in the form of policies instead of action sequences. Much of the content of this chapter is a recapitulation of work in the operationsresearch literature =-=[126, 15, 44, 46,68, 13]-=- and the reinforcement-learning literature [153, 173, 10, 145]. The concepts and background introduced here will be built upon in all the succeeding chapters. 2.2 Markov Decision Processes Markov deci... |

632 | Learning to act using real-time dynamic programming
- Barto, Bradtke, et al.
- 1995
(Show Context)
Citation Context ...and building a complete universal plan is infeasible. Using a reinforcementlearning algorithm in such anenvironment can help the agent nd appropriate behavior for the most common and important states =-=[9, 43]-=-. The most noteworthy example of this technique remains one of the biggest successes of reinforcement learning|Tesauro's backgammon-learning program [159], which isnow reliably ranked as one of the wo... |

601 | Markov games as a framework for multi-agent reinforcement learning
- Littman
- 1994
(Show Context)
Citation Context ...ive mdp [62] maxu f(x; u) minx02N (x;u) g(x0 ) P evaluating risk-sensitive u (x; u)f(x; u) minx02N (x;u) g(x0 ) P exploration-sens. mdp [71] max 2P0 u (x; u)f( ) P x0 T (x; u; x0 )g(x0 ) Markov games =-=[90]-=- see text see text P information-state mdp [117] maxu f(x; u) x02N (x;u) T (x; u; x0 )g(x0 ) Table 3.1: Examples of generalized Markov decision processes and their summary operators. between the summa... |

482 |
Dynamic Programming: Deterministic and Stochastic Models
- Bertsekas
- 1987
(Show Context)
Citation Context ...m the traditional model and compute solutions in the form of policies instead of action sequences. Much of the content of this chapter is a recapitulation of work in the operationsresearch literature =-=[126, 15, 44, 46,68, 13]-=- and the reinforcement-learning literature [153, 173, 10, 145]. The concepts and background introduced here will be built upon in all the succeeding chapters. 2.2 Markov Decision Processes Markov deci... |

327 | Acting optimally in partially observable stochastic domains.
- Cassandra, Kaelbling, et al.
- 1994
(Show Context)
Citation Context ...endix have appeared inearlier papers: \Planning and acting in partially observable stochastic domains" [73] with Kaelbling and Cassandra, \Acting optimally in partially observable stochastic domains" =-=[32]-=- with Cassandra and Kaelbling, and \An introduction to reinforcement learning" [74] with Kaelbling and Moore. Chapter 2 began with an example of a robot deciding how tonavigate in a large o ce buildin... |

322 |
Reinforcement learning with selective perception and hidden state.
- McCallum
- 1996
(Show Context)
Citation Context ...pomdps, it is not su cient for learning; even if immediate observations are enough to make optimal action choices, learning which choices to make can require additional information about past history =-=[103]-=-. Q-learning Many researchers have used Q-learning and other mdp-based reinforcement-learning algorithms to learn policies for partially observable domains. One interesting example is Wilson's work on... |

306 | Residual algorithms: Reinforcement learning with function approximation
- Baird
- 1995
(Show Context)
Citation Context ... for solving a particular class of continuous state-space mdps, Gordon [58] and Tsitsiklis and Van Roy [164] demonstrated closely related provably convergent dynamic-programming algorithms, and Baird =-=[7]-=- derived a gradient-descent rule for adjusting the parameters representing avalue function in a reinforcement-learning setting; a survey of these techniques and others has recently been compiled [30].... |

305 | Generalization in reinforcement learning: Safely approximating the value function
- Boyan, Moore
- 1995
(Show Context)
Citation Context ...horizon [62] minimax expected average reward over the in nite horizon [183] maximum expected average reward over the in nite horizon [102] maximum expected undiscounted reward until goal (cost-to-go) =-=[29]-=- minimax expected undiscounted goal probability [36] maximum expected undiscounted goal probability [87] maximum multiagent discounted expected reward [22] Table 1.3: Several popular objective functio... |

296 | Learning Policies for Partially Observable Environments: Scaling Up.
- Littman, Cassandra, et al.
- 1995
(Show Context)
Citation Context ...domains" [32] with Cassandra and Kaelbling, \The witness algorithm: Solving partially observable Markov decision processes" [92], \Learning policies for partially observable environments: Scaling up" =-=[94]-=- with Cassandra and Kaelbling, and \An e cient algorithm for dynamic programming in partially observable Markov decision processes" [95] with Cassandra and Kaelbling. In this chapter, I present anumbe... |

285 | An algorithm for probabilistic planning
- Kushmerick, Hanks, et al.
- 1995
(Show Context)
Citation Context ...ard over the in nite horizon [102] maximum expected undiscounted reward until goal (cost-to-go) [29] minimax expected undiscounted goal probability [36] maximum expected undiscounted goal probability =-=[87]-=- maximum multiagent discounted expected reward [22] Table 1.3: Several popular objective functions. 17sUnder the discounted objective, the discount factor 0 < <1 controls how much e ect future rewards... |

275 |
Discounted dynamic programming
- Blackwell
- 1965
(Show Context)
Citation Context ...ductory textbook [132] views all of arti cial intelligence from an agent perspective. Interrelationships between the undiscounted-reward and the discounted-reward criteria were discussed by Blackwell =-=[19]-=-. A survey of results concerning the averagereward criterion was written by Arapostathis et al. [4]. Fernandez-Gaucherand, Ghosh and Marcus [52] explored combinations of discounted and average reward ... |

263 | Stable function approximation in dynamic programming.
- Gordon
- 1995
(Show Context)
Citation Context ...nd dynamic programming using approximate value functions is attracting increasing interest. Boyan and Moore [29] examined methods for solving a particular class of continuous state-space mdps, Gordon =-=[58]-=- and Tsitsiklis and Van Roy [164] demonstrated closely related provably convergent dynamic-programming algorithms, and Baird [7] derived a gradient-descent rule for adjusting the parameters representi... |

255 | On the convergence of stochastic iterative dynamic programming algorithms
- Jaakkola, Jordan, et al.
- 1994
(Show Context)
Citation Context ... value function. The speci c generalized mdp model presented here is both more and less general than Szepesvari's model; however, Theorem 3.7 is useful in both frameworks. Jaakkola, Jordan, and Singh =-=[69]-=- and Tsitsiklis [163] developed the connection between stochastic-approximation theory and reinforcement learning, focusing on the mdp model. The mathematics and insight used in Theorem 3.7 are not su... |

249 | Exploiting structure in policy construction
- Boutilier, Dearden, et al.
- 1995
(Show Context)
Citation Context ...w that the problem is somehow inherently intractable? There are representations for rewards and transitions that make it possible to specify compact models for mdps with exponential-size state spaces =-=[87, 21,24, 113]-=-. What are the complexity issues? It is probably computationally intractable 47sto nd -optimal policies using compact representations, but are there useful subclasses of mdps that can be solved e cien... |

243 |
Interactions between learning and evolution
- Ackley, Littman
- 1991
(Show Context)
Citation Context ...d" environments faced by biological agents. In particular, the only true reward signal in a biological system is death, which is perceptible by the agent too late to be of use. Simulation experiments =-=[3]-=- have shown that, over the span of many generations, arti cial agents can evolve their own proximal reward functions that are useful in predicting the relative goodness and badness of situations; in p... |

242 | From local actions to global tasks: Stigmergy and collective robotics
- Beckers, Holland, et al.
- 1994
(Show Context)
Citation Context ... rst, learn the pomdp model from experience, then (or concurrently) nd an optimal policy for the model. Given a model, a policy can be found 2 This type of memory can be viewed as a form of stigmergy =-=[12]-=-. The idea behind stigmergy is that the actions of an agent change the environment inaway that a ects later behavior resulting in a form of \external memory." 135susing techniques from Chapter 7; both... |

232 | Packet routing in dynamically changing networks: A reinforcement learning approach
- Boyan, Littman
- 1994
(Show Context)
Citation Context ...k-monitoring example was inspired by conversations with my colleagues at Bellcore; a simple study of reinforcement learning in a di erent telecommunications domain was undertaken by Boyan and Littman =-=[27]-=-. Additional applications are described in Puterman's textbook [126]. Kaelbling's book [72] provides a philosophical discussion of agents, along with an exploration of several kinds of sequential deci... |

230 |
How Good is the Simplex Algorithm
- Klee, Minty
- 1972
(Show Context)
Citation Context ...methods di er as to their choice of pivot rule, the rule for choosing which constraints to swap in and out at each iteration. Although simplex methods seem to perform well in practice, Klee and Minty =-=[81]-=- showed that one of Dantzig's choices of pivoting rule could lead the simplex algorithm to take an exponential number of iterations on some linear programs. Since then, other pivoting rules have been ... |

224 |
A survey of algorithmic methods for partially observed Markov decision processes
- Lovejoy
- 1991
(Show Context)
Citation Context ...45srepresentation of the exact value functions produced in value iteration; algorithms have been developed that attempt to represent approximations of the in nite-horizon value function more directly =-=[100]-=-, but I will not discuss these representations here. 7.3 Algorithms for Solving Information-state mdps The information-state mdp is a special kind of mdp, and many di erent algorithms are available fo... |

223 | Probabilistic Planning with Information Gathering and Contingent Execution.
- Draper, Hanks, et al.
- 1994
(Show Context)
Citation Context ...t makes it di cult and worthy of study is that the states of the environment are represented in a propositional form. Let us consider a simple example, adapted from a paper by Draper, Hanks, and Weld =-=[50, 49]-=-. The environment isamanufacturing plant and the agent's task is to process and ship a particular widget. At any moment in time, the widget is either painted (PA) or not, awed (FL) or not, blemished (... |

217 | Reinforcement learning with perceptual aliasing: the perceptual distinctions approach
- Chrisman
- 1992
(Show Context)
Citation Context ...f \external memory." 135susing techniques from Chapter 7; both algorithmic methods and learning methods are appropriate. This section describes several attempts at learning the model itself. Chrisman =-=[34]-=- showed how the Baum-Welsh algorithm [11] for learning hidden Markov models (HMMs) could be adapted to learning transition and observation functions for pomdps. He, and later McCallum [104], gave heur... |

206 | The complexity of stochastic games.
- Condon
- 1992
(Show Context)
Citation Context ...e in nite horizon [183] maximum expected average reward over the in nite horizon [102] maximum expected undiscounted reward until goal (cost-to-go) [29] minimax expected undiscounted goal probability =-=[36]-=- maximum expected undiscounted goal probability [87] maximum multiagent discounted expected reward [22] Table 1.3: Several popular objective functions. 17sUnder the discounted objective, the discount ... |

204 | Learning and sequential decision making
- Barto, Sutton, et al.
- 1989
(Show Context)
Citation Context ...es instead of action sequences. Much of the content of this chapter is a recapitulation of work in the operationsresearch literature [126, 15, 44, 46,68, 13] and the reinforcement-learning literature =-=[153, 173, 10, 145]-=-. The concepts and background introduced here will be built upon in all the succeeding chapters. 2.2 Markov Decision Processes Markov decision processes are the simplest family of models I will consid... |

195 |
Finite State Markovian Decision Processes.
- Derman
- 1970
(Show Context)
Citation Context |

179 | Planning under time constraints in stochastic domains.
- Dean, Kaelbling, et al.
- 1995
(Show Context)
Citation Context ...ses the action most appropriate in the current state. For an agent to follow such a plan, it must have access to some function that returns an action choice for every possible state. A partial policy =-=[43]-=- can be used to overcome the di culty of constructing and manipulating complicated universal plans; however, few theoretical tools are available to assess the success of a partial policy. Stationary p... |

162 | On the complexity of solving Markov decision problems - Littman, Dean, et al. - 1995 |

149 |
Optimal Control of Markov Decision Processes with Incomplete State Estimation,
- Astrom
- 1965
(Show Context)
Citation Context ...the agent started in state s 0 , and then normalizes the result. The vector xt of probabilities is an information state, which is an adequate summary of the past to allow optimal decisions to be made =-=[5]-=-. Since xt can be written entirely in terms of x 0, which does not change from step to step, and Dt which does, we can use the table Dt to represent the state of the system at time t. As there are (1 ... |

137 |
A Polynomial Algorithm in Linear Programming", Doklady Akademii Nauk
- Khachian
- 1979
(Show Context)
Citation Context ...omial in B. There are algorithms for solving rational linear programs that take time polynomial in the numberofvariables and constraints as well as the number of bits used to represent the coe cients =-=[78, 79]-=-. Thus, mdps can be solved in time polynomial in jSj, jAj, and B. Descendants of Karmarkar's algorithm [78] are considered among the most practically e cient linearprogramming algorithms. It is popula... |

131 |
Computationally feasible bounds for partially observed Markov decision processes
- Lovejoy
(Show Context)
Citation Context ...at use a piecewise-linear convex representation of the value function; for instance, in one class of methods, the in nite-horizon value function is approximated using a xed grid of information states =-=[99]-=-. Sondik [150] presented a policy-iteration algorithm for nding approximate solutions to in nite-horizon pomdps. Sawaki and Ichikawa [135] advocated the use of the valueiteration method, e ectively re... |

130 | Average reward reinforcement learning: Foundations, algorithms, and empirical results.
- Mahadevan
- 1996
(Show Context)
Citation Context ... (Chapter 5) maximum worst-case discounted reward over the in nite horizon [62] minimax expected average reward over the in nite horizon [183] maximum expected average reward over the in nite horizon =-=[102]-=- maximum expected undiscounted reward until goal (cost-to-go) [29] minimax expected undiscounted goal probability [36] maximum expected undiscounted goal probability [87] maximum multiagent discounted... |

128 | Computing optimal policies for partially observable decision processes using compact representations
- Boutilier, Poole
- 1996
(Show Context)
Citation Context ...r this type of problem. Of course, any of the pomdp algorithms described in Chapter 6 can be used, once the complete state space is constructed. The C-Buridan [50, 49] and structured policy iteration =-=[25]-=- algorithms solve partially observable problems using compact representations directly. The C-Buridan algorithm discovers the DAG-structured plan in Figure 8.1. Draper et al. note that \signi cant sea... |

126 | An input output HMM architecture.
- Bengio, Frasconi
- 1995
(Show Context)
Citation Context ...ber of states in the approximate model; Chrisman used a rule based on the accurate prediction of observations, and McCallum used a rule based on the accurate prediction of values. Bengio and Frasconi =-=[14]-=- created an algorithm for learning input/output HMMs, a model that is equivalent toapomdp with no rewards. Abe and Warmuth [1] studied the problem of learning approximately correct probabilistic autom... |

125 |
Discrete-time controlled Markov processes with average cost criterion: A survey
- Araposthathis, Borkar, et al.
- 1993
(Show Context)
Citation Context ...hips between the undiscounted-reward and the discounted-reward criteria were discussed by Blackwell [19]. A survey of results concerning the averagereward criterion was written by Arapostathis et al. =-=[4]-=-. Fernandez-Gaucherand, Ghosh and Marcus [52] explored combinations of discounted and average reward as a way to better trade o short-term and asymptotic reward. In the area of reinforcementlearning, ... |

119 | Overcoming incomplete perception with utile distinction memory.
- McCallum
- 1993
(Show Context)
Citation Context ...f. Chrisman [34] showed how the Baum-Welsh algorithm [11] for learning hidden Markov models (HMMs) could be adapted to learning transition and observation functions for pomdps. He, and later McCallum =-=[104]-=-, gave heuristic state-splitting rules to attempt to learn the smallest possible model that captures the structure of a given environment. The Baum-Welsh algorithm is known to converge to locally opti... |

117 | Efficient Computation of Equilibria for Extensive Two-Person Games
- Koller, Megiddo, et al.
- 1996
(Show Context)
Citation Context ...ugh these model are of great interest to individuals studying, for example, the coordination of multiple robots, only one group of researchers has explored algorithms for incomplete-information games =-=[85]-=-, and only for a subset of these models. 13s1.3 Evaluation Criteria The previous section began a formal treatment ofenvironment models that I will expand upon in the coming chapters. In this section, ... |

107 | Memoryless policies: Theoretical limitations and practical results.
- Littman
- 1994
(Show Context)
Citation Context ...ic structure of memory-based solutions to pomdps. Problems relating to nding observation-to-state mappings in pomdps, sometimes called memoryless policies, have been studied in many di erent contexts =-=[182, 91, 146, 70]-=-. Finding the optimal memoryless policy is NP-hard [91], and it often has very poor performance. In the case of the environment of Figure 6.1, for example, no memoryless policy takes less than an in n... |

102 | A subexponential randomized simplex algorithm - Kalai - 1992 |

98 | Planning under uncertainty: structural assumptions and computational leverage
- Boutilier, Dean, et al.
- 1998
(Show Context)
Citation Context ...lly intractable 47sto nd -optimal policies using compact representations, but are there useful subclasses of mdps that can be solved e ciently? This question is explored by Boutilier, Dean, and Hanks =-=[23]-=-. The dual linear-programming formulation of mdps has a ow-like interpretation. Algorithms for nding min-cost ows have been studied intensively over the last few years. Are there any ow-like algorithm... |

96 | On the computational complexity of approximating distributions by probabilistic automata.
- Abe, Warmuth
- 1992
(Show Context)
Citation Context ...d a rule based on the accurate prediction of values. Bengio and Frasconi [14] created an algorithm for learning input/output HMMs, a model that is equivalent toapomdp with no rewards. Abe and Warmuth =-=[1]-=- studied the problem of learning approximately correct probabilistic automata from experience. Their learning framework is very interesting, and worth extending to pomdps. Hernandez-Lerma and Marcus [... |

87 |
Algorithms for partially observed Markov decision processes
- Cheng
- 1988
(Show Context)
Citation Context ... Scherer [177] extended the reward-revision method, developed for mdps, to pomdps. The development of the witness algorithm [32, 92, 95] was inspired most directly by Cheng's linear support algorithm =-=[33]-=-, with the di erence that standard linear programming was to be used in place of vertex enumeration to identify missing vectors. An early version was shown to be incorrect [92], and later versions int... |

84 | On algorithms for simple stochastic games,
- Condon
- 1993
(Show Context)
Citation Context ... optimal policy for the opponent given that the agent is following 1. These choices are not all equivalent; in fact, only choices 2 and 3, which are duals, lead to algorithms that converge in general =-=[37]-=-. We therefore base the implementation of our improvePoliciesGame subroutine in Table 4.3 on choice 2. Since we need 2 to be the optimal counter-strategy to the greedy 1,Table 4.4 shows how to compute... |

70 | Memory approaches to reinforcement learning in nonMarkovian domains,”.
- Lin, Mitchell
- 1992
(Show Context)
Citation Context ... can be trained using backpropagation through time or some other suitable technique, and learns to retain \history features" to predict value. This approach has been studied by anumber of researchers =-=[106, 89, 137]-=-. It seems to work e ectively on simple problems, but can su er from convergence to local optima on more complex problems. Register memory Another short-term memory structure that has been studied in ... |

68 |
A new approach to linear ltering and prediction problems
- Kalman
- 1960
(Show Context)
Citation Context ...n, together with its knowledge of the underlying dynamics of the world, to maintain an estimate of its location. Many engineering applications follow this approach, using methods like the Kalman lter =-=[77]-=- to maintain a running estimate of the robot's spatial uncertainty, expressed as a Gaussian probability distribution in Cartesian space. This approach will 115snot do for our robot, though. Its uncert... |

64 |
Reinforcement Learning with High-dimensional, Continuous Actions
- Baird, Klopf
- 1993
(Show Context)
Citation Context ... 5 examines a model in which the selected actions are actually continuous probability distributions over a nite set of choices. Continuous actions can also be discretized, although speci c algorithms =-=[8]-=- can solve this type of environment more e ectively. episodic vs. sequential In an episodic environment, the agent faces the same problem over and over again. I am more concerned with sequential envir... |

63 |
The monotone and planar circuit value problems are log-space complete for P
- Goldschlager
- 1977
(Show Context)
Citation Context ... games in polynomial time. However, unlike mdps, the problem remains P-hard even when all transitions are deterministic. This can be shown by an easy reduction from the monotone circuit-value problem =-=[57]-=-|essentially, the opponent takes the place of the stochastic transitions in Papadimitriou and Tsitsiklis' mdp proof [116]. 4.6 Reinforcement Learning in Alternating Games As mentioned in the introduct... |

59 |
Planning with external events
- Blythe
- 1994
(Show Context)
Citation Context ...w that the problem is somehow inherently intractable? There are representations for rewards and transitions that make it possible to specify compact models for mdps with exponential-size state spaces =-=[87, 21,24, 113]-=-. What are the complexity issues? It is probably computationally intractable 47sto nd -optimal policies using compact representations, but are there useful subclasses of mdps that can be solved e cien... |

58 | Risk and reinforcement learning
- Heger
- 1994
(Show Context)
Citation Context ...m expected discounted reward over the in nite horizon (Chapter 2) minimax expected discounted reward over the in nite horizon (Chapter 5) maximum worst-case discounted reward over the in nite horizon =-=[62]-=- minimax expected average reward over the in nite horizon [183] maximum expected average reward over the in nite horizon [102] maximum expected undiscounted reward until goal (cost-to-go) [29] minimax... |

55 |
Kaelbling. Learning in embedded systems.
- Pack
- 1990
(Show Context)
Citation Context ...e current batter, and the number of balls and strikes. It probably would not include the color of the eld, the number of bases, or the shape of home plate. Figure 1.1 depicts a generic embedded agent =-=[72]-=- interacting with its environment. The agent is represented by a robotic gure and the environment as a blob. Although the agent is a decision-making entity, it is not enough for it simply to make deci... |

54 | The Witness Algorithm: Solving Partially Observable Markov Decision Processes”,
- Littman
- 1994
(Show Context)
Citation Context ...elbling and Cassandra, \Acting optimally in partially observable stochastic domains" [32] with Cassandra and Kaelbling, \The witness algorithm: Solving partially observable Markov decision processes" =-=[92]-=-, \Learning policies for partially observable environments: Scaling up" [94] with Cassandra and Kaelbling, and \An e cient algorithm for dynamic programming in partially observable Markov decision pro... |

53 |
A subexponential randomized algorithm for the simple stochastic game problem
- Ludwig
- 1995
(Show Context)
Citation Context ...lgorithm in polynomial time with a bounded probability of error. Perhaps a randomized algorithm for alternating Markov games would be easier to nd. There is a randomized subexponential-time algorithm =-=[101]-=-; is there one that runs in polynomial time? A connection can be made between deterministic mdps and min-cost ow problems (see Chapter 2). Can these connections be exploited to nd an e cient algorithm... |

48 |
A probabilistic production and inventory problem. Management Science,
- d’Epenoux
- 1963
(Show Context)
Citation Context ...a model speci ed by unknown parameters; they showed how to build an asymptotically optimal non-stationary policy for such models. The linear programming formulation of mdps was identi ed by D'Epenoux =-=[45]-=- and others (see Ho man and Karp's paper [67] for a list). Kushner and Kleinman [88] explored reasons for preferring the dual formulation for some applications. Denardo [44] explicitly linked policy i... |

48 | Multi-armed Bandit Allocation Indices. Wiley-Interscience series in systems and optimization - Gittins - 1989 |

41 | Generalization and scaling in reinforcement learning
- Ackley, Littman
- 1990
(Show Context)
Citation Context ...e-history windows, observation prediction models, and recurrent networks for approximating the optimal value function. Meeden, McGraw, and Blank [106] applied a simple backpropagation-based algorithm =-=[2]-=- to a recurrent network that learned to drive a remote-controlled car. Lin and Whitehead [179] presented reinforcement-learning algorithms for learning internal representations of the state, and an al... |

40 | Exploration bonuses and dual control.
- Dayan, TJ
- 1996
(Show Context)
Citation Context ...re of interest because they make it possible to consider a broader range of sequential decision-making problems; however, they are more complex mathematically and few formal models have been proposed =-=[42]-=-. For this reason, previous attempts at handling non-stationary environments have focused almost exclusively on empirical studies [93, 156]. In this thesis, I am most concerned with the theoretical an... |

40 |
Observation of a Markov process through a noisy channel
- Drake
- 1962
(Show Context)
Citation Context ...have been employed to solve them, developed associated complexity results, and described reinforcement-learning approaches. 138sThe fundamental mathematical structure of pomdps was developed by Drake =-=[48]-=- and Astrom [5]. The algorithmic foundation was laid by Sondik [149, 150]. Additional information on algorithmic approaches can be found in Section 7.8. State estimation in a type of continuous-space ... |

37 |
Linear programming and finite Markovian control problems
- Kallenberg
- 1983
(Show Context)
Citation Context ...lved exactly in polynomial time using linear programming. Given this fact, it is perhaps surprising that no nite-size linear program can express the optimal value function of an arbitrary Markov game =-=[76]-=-. This follows from the fact that linear programs have rational solutions given rational coe cients, while Markov games can have irrational solutions. 5.5 Complexity Results Markov games can be solved... |

36 | Modular Neural Networks for Learning Context-Dependant Game Strategies
- Boyan
- 1992
(Show Context)
Citation Context ...disc. exp. mdp [174] maxu f(x; u) x0 T (x; u; x0 )g(x0 ) P cost-based mdp [29] minu f(x; u) x0 T (x; u; x0 )g(x0 ) PP evaluating policy [154] u (x; u)f(x; u) u0 T (x; u; x0 )g(x0 ) P alt. Markov game =-=[26]-=- maxu or minu f(x; u) x0 T (x; u; x0 )g(x0 ) risk-sensitive mdp [62] maxu f(x; u) minx02N (x;u) g(x0 ) P evaluating risk-sensitive u (x; u)f(x; u) minx02N (x;u) g(x0 ) P exploration-sens. mdp [71] max... |

36 |
Adding temporary memory to ZCS
- Cliff, Ross
- 1995
(Show Context)
Citation Context ...su er from convergence to local optima on more complex problems. Register memory Another short-term memory structure that has been studied in the reinforcement-learning framework is storage registers =-=[72,91, 181, 35]-=-. The idea here is that the agent has explicit actions for saving information in non-volatile memory, and for retrieving this information at a later time. 2 The method has been used successfully when ... |

30 |
Convergence of indirect adaptive asynchronous value iteration algorithms. Advances in neural information processing systems
- Gullapalli, Barto
- 1994
(Show Context)
Citation Context ...nt has access to a model of the environment, updates can be performed at any state at any time. ! : 46sThe convergence of model-based reinforcement learning for mdps was shown by Gullapalli and Barto =-=[59]-=-. In Section 3.6.4, I present a related theorem for a broader class of models. Reinforcement learning is an exciting area and new algorithms and studies are appearing every day; this section barely sc... |

28 |
Modelfree reinforcement learning for non-markovian decision problems
- Singh, Jaakkola, et al.
- 1994
(Show Context)
Citation Context ...ic structure of memory-based solutions to pomdps. Problems relating to nding observation-to-state mappings in pomdps, sometimes called memoryless policies, have been studied in many di erent contexts =-=[182, 91, 146, 70]-=-. Finding the optimal memoryless policy is NP-hard [91], and it often has very poor performance. In the case of the environment of Figure 6.1, for example, no memoryless policy takes less than an in n... |

25 | Finding Mixed Strategies with Small Supports in Extensive Form Games
- Koller, Megiddo
- 1996
(Show Context)
Citation Context ...ted type of incomplete-information game in which the players' actions are issued sequentially but are not revealed until after both players have made their decisions. Koller, Megiddo, and von Stengel =-=[82, 84, 83]-=- looked closely at games of partial information. They developed algorithms that run in polynomial time with respect to the size of the game tree, which roughly means that their results apply to Markov... |

18 | Adaptation in constant utility non-stationary environments
- Littman, Ackley, et al.
- 1991
(Show Context)
Citation Context ...e complex mathematically and few formal models have been proposed [42]. For this reason, previous attempts at handling non-stationary environments have focused almost exclusively on empirical studies =-=[93, 156]-=-. In this thesis, I am most concerned with the theoretical analysis of algorithms, and therefore restrict the discussion to environments that do not change over time. Although I focus exclusively on t... |

16 | The loss from imperfect value functions in expectation-based and minimax-based tasks - Heger - 1996 |

16 | When the Best Move Isn’t Optimal: QLearning with Exploration.
- John
- 1994
(Show Context)
Citation Context ...kins and Dayan [174], Tsitsiklis [163], and Jaakkola, Jordan, and Singh [69]. The latter two papers brought out the connection between Q-learning and work in the eld of stochastic approximation. John =-=[71]-=- gave a critique of the use of the asymptotic optimal policy as a target for learning. 48sThe convergence of model-based reinforcement-learning methods was studied by Gullapalli and Barto [59]. Hernan... |

14 |
Rapid task learning for real robots
- Connell, Mahadevan
- 1993
(Show Context)
Citation Context ...ection 2.6.1). The rule itself is an extremely natural extension of Q-learning to vector-valued state spaces. In fact, an elaboration of this rule was developed independently by Connell and Mahadevan =-=[38]-=- for solving a distributed-representation reinforcement-learning problem in robotics. 7.6.2 Linear Q-learning Although replicated Q-learning is a generalization of Q-learning, it does not extend corre... |

9 |
Heights of convex polytopes
- Klee
- 1965
(Show Context)
Citation Context ...algorithms [33] make use of special-purpose routines that enumerate the vertices of each linear region of the value function. Bounding the number of vertices in a polyhedron is a well-studied problem =-=[80]-=- and it is known that there can be an exponential number. In fact, there is a family of one-stage pomdp problems such that, for every n, jSj = n +1,jAj =2n +1,jZj =1, j,t,1j =1,j,tj 2n + 1, and yet th... |

8 |
The Complexity of Linear Programming
- Dobkin, Reiss
(Show Context)
Citation Context ...at the relevant parameters can be summarized by ajA 1j jA 2j matrix consisting of the immediate reward values. Solving a matrix game is known to be polynomially equivalent to solving a linear program =-=[47]-=-. 5.2.2 Acting Optimally As with alternating Markov games, I mainly consider the problem of nding minimaxoptimal policies. Once again, I consider only the discounted expected value criterion. It is po... |

8 |
Stochastic Systems, Estimation, Identi- cation and Adaptive Control
- Kumar, Varaiya
- 1986
(Show Context)
Citation Context ... C[s; a]+1: ~T (s; a; s 0 )= Tc[s; a; s 0 ] C[s; a] and ~R(s; a) = Rs[s; a] C[s; a] : The estimated model can be used in any of several ways to nd a good policy. In the certainty-equivalence approach =-=[86]-=-, an optimal policy for the estimated model is found at each step. This makes maximal use of the available data at the cost of high computational overhead. In the DYNA [155], prioritized-sweeping [111... |

8 |
Mathematical programming and the control of Markov chains
- Kushner, Kleinman
- 1971
(Show Context)
Citation Context ...known as the dual, can also be used to solve mdps. One advantage of the dual formulation is that it makes it possible to express and incorporate additional constraints on the form of the policy found =-=[88]-=-. The dual linear program appears in Table 2.5. The f[s 0 ;a]variables can be thought of as indicating the amount of \policy ow" through state s 0 that exits via action a. Under this interpretation, t... |

7 |
Adaptive control of discounted Markov decision chains
- Hernandez-Lerma, Marcus
- 1985
(Show Context)
Citation Context ...he use of the asymptotic optimal policy as a target for learning. 48sThe convergence of model-based reinforcement-learning methods was studied by Gullapalli and Barto [59]. Hernandez-Lerma and Marcus =-=[64]-=- examined the closely related problem of nding an optimal policy for an mdp with a model speci ed by unknown parameters; they showed how to build an asymptotically optimal non-stationary policy for su... |

7 |
Fast algorithms for nding randomized strategies in game trees
- Koller, Megiddo, et al.
- 1994
(Show Context)
Citation Context ... too complex. For example, as long as the true state is revealed often enough, it ought to be possible to combine a successive-approximation algorithm with an e cient algorithm for solving game trees =-=[84]-=-. Would such ahybrid algorithm be of interest? Are there any applications with this structure? 5.8 Related Work Section 4.8 listed work related to alternating Markov games as well as the more general ... |

7 |
and Csaba Szepesvari, A Generalized Reinforcement-Learning Model
- Littman
- 1996
(Show Context)
Citation Context ... in this class. 49sChapter 3 Generalized Markov Decision Processes Portions of this chapter have appeared in earlier papers: \A generalized reinforcement-learning model: Convergence and applications" =-=[97]-=- with Szepesvari, and \Generalized Markov decision processes: Dynamicprogramming and reinforcement-learning algorithms" [158] with Szepesvari. The Markov decision process model, discussed in the previ... |

5 |
Basic structures of modern algebra
- Bahturin
- 1993
(Show Context)
Citation Context ...he answer is just one bit), but the exact complexity is unknown. Can it be shown to be uncomputable, perhaps by relating it to the problem of nding the roots of polynomial equations and Galois theory =-=[6]-=-? There are criteria other than minimax that capture the competitive aspect of games, while satisfying the conditions for being a generalized mdp. Among these are rules in which agents choose randomly... |

5 |
E cient dynamic-programming updates in partially observable Markov decision processes
- Littman, Cassandra, et al.
- 1996
(Show Context)
Citation Context ...ing policies for partially observable environments: Scaling up" [94] with Cassandra and Kaelbling, and \An e cient algorithm for dynamic programming in partially observable Markov decision processes" =-=[95]-=- with Cassandra and Kaelbling. In this chapter, I present anumber of algorithms for solving information-state Markov decision processes. As discussed in Section 6.2.2, an information-state mdp arises ... |

4 |
Dynamic Programing: Models and Applications (Prentice-Hall
- Denardo
- 1982
(Show Context)
Citation Context |

4 |
Reinforcement learning applied to a dierential game. Adaptive Behavior
- Harmon, Baird, et al.
- 1996
(Show Context)
Citation Context ...uming a synchronous environment isweaker than assuming that each action takes a xed amount of time. It is possible to approximate an asynchronous environment by a synchronous one by discretizing time =-=[61]-=-. single vs. multiple agent For simplicity, I consider environments with either one or two agents. Environments with more agents are worthy of study but can be extremely complicated to analyze because... |

3 |
Adaptive aggregation methods for in nite horizon dynamic programming
- Bertsekas, Castanon
- 1989
(Show Context)
Citation Context ...thods for solving mdps, including methods that accelerate the convergence of value iteration by keeping explicit suboptimality bounds [15] and by grouping and regrouping states throughout the process =-=[17]-=-. A di erent approach is illustrated in modi ed policy iteration [127], which has the basic form of policy iteration with the di erence that a successive-approximation algorithm (basically value itera... |

3 |
New ¯nite pivoting rules for the simplex method
- Bland
- 1977
(Show Context)
Citation Context ...ps might not include the counterexample linear programs. Some progress has been made speeding up simplex-based methods, for instance, through the introduction of randomized versions of pivoting rules =-=[20]-=-, some of which have been shown to result in subexponential complexity [75]. The fact that the optimal value function for an mdp can be expressed as the solution to a polynomial-size linear program ha... |

3 |
de Rougemont, and Anatol Slissenko. On the complexity of partially observed Markov decision processes
- Burago, Michel
- 1996
(Show Context)
Citation Context ...kov decision processes was initiated byPapadimitriou and Tsitsiklis [116]. For pomdps, they showed that the nite-horizon problem is PSPACE-hard. More recent work by Burago, de Rougemont and Slissenko =-=[31]-=- showed that a class of pomdps with bounded unobservability can be solved in polynomial time. They introduced a parameter m which is a measure of how \unobservable" the environment is; given that obse... |

3 |
Introduction to the Theory of Neural Compuation
- Hertz, Krogh, et al.
- 1991
(Show Context)
Citation Context ...he update rule R(s; at) = txt[s] rt , X s xt[s]R(s; at) where t is a learning rate, can be shown to make the reward function converge to one that predicts the immediate rewards arbitrarily accurately =-=[66]-=-. 6.7 Open Problems The study of algorithmic and complexity properties of pomdps is still relatively young. Although the results I presented in this chapter constitute signi cant progress towards unde... |

3 |
1’981]: “Stochastic games
- J, Mertens, et al.
(Show Context)
Citation Context ...showed how to build an asymptotically optimal non-stationary policy for such models. The linear programming formulation of mdps was identi ed by D'Epenoux [45] and others (see Ho man and Karp's paper =-=[67]-=- for a list). Kushner and Kleinman [88] explored reasons for preferring the dual formulation for some applications. Denardo [44] explicitly linked policy iteration to linear programming. Schrijver [14... |

3 |
The complexity oftwo-person zero-sum games in extensive form
- Koller, Megiddo
- 1992
(Show Context)
Citation Context ...ted type of incomplete-information game in which the players' actions are issued sequentially but are not revealed until after both players have made their decisions. Koller, Megiddo, and von Stengel =-=[82, 84, 83]-=- looked closely at games of partial information. They developed algorithms that run in polynomial time with respect to the size of the game tree, which roughly means that their results apply to Markov... |

2 |
The optimal search foramoving target when the search path is constrained
- Eagle
- 1984
(Show Context)
Citation Context ...lgorithms for nding ,t work by enumerating the set Gt of possibly useful policy trees, and then identifying which of these policy trees is useful: Monahan's algorithm [109]was the rst and later Eagle =-=[51]-=- and Lark [176] provided improvements. However, all these algorithms, regardless of their details, build Gt, the size of which is jAjj,t,1j jZj .Thus, even if a policy tree could be identi ed as usefu... |

2 |
Adaptive Control of Markov Processes with Incomplete State Information and Unknown Parameters
- Hernández-Lerma, Marcus
- 1987
(Show Context)
Citation Context ...] studied the problem of learning approximately correct probabilistic automata from experience. Their learning framework is very interesting, and worth extending to pomdps. Hernandez-Lerma and Marcus =-=[65]-=- approach the problem of reinforcement-learning in pomdps from a di erent perspective; their results show that given a method for learning a parameterized model of the environment, it is possible to u... |

1 |
Imposed and learned conventions in multiagent decision processes: Extended abstract. Unpublished manuscript
- Boutilier
- 1995
(Show Context)
Citation Context ... undiscounted reward until goal (cost-to-go) [29] minimax expected undiscounted goal probability [36] maximum expected undiscounted goal probability [87] maximum multiagent discounted expected reward =-=[22]-=- Table 1.3: Several popular objective functions. 17sUnder the discounted objective, the discount factor 0 < <1 controls how much e ect future rewards have on the decisions at each moment, with small v... |

1 |
Algorithms for approximating optimal value functions in acyclic domains. Unpublished manuscript
- Boyan, Moore
- 1995
(Show Context)
Citation Context ... because no more than jS 1j + jS 2j steps can elapse before the absorbing state is reached. These games can be solved in polynomial time using value iteration, or by a procedure referred to as DAG-SP =-=[28]-=-, which I will describe now. In a cycle-free game, all states can be categorized by the largest possible number of transitions that can elapse between an agent occupying the state and the agent reachi... |

1 |
Controlled Markov processes on the in nite planning horizon: Weighted and overtaking cost criteria
- Fernandez-Gaucherand, Ghosh, et al.
- 1995
(Show Context)
Citation Context ... discounted-reward criteria were discussed by Blackwell [19]. A survey of results concerning the averagereward criterion was written by Arapostathis et al. [4]. Fernandez-Gaucherand, Ghosh and Marcus =-=[52]-=- explored combinations of discounted and average reward as a way to better trade o short-term and asymptotic reward. In the area of reinforcementlearning, the average-reward criterion was studied rst ... |

1 |
Ordered eld property for stochastic games when the player who controls transitions changes from state to state
- Filar
- 1981
(Show Context)
Citation Context ...rkov games from a game-theory perspective have been written by Van Der Wal [166] and Vrieze [170]. A shorter survey is also available in a game-theory overview edited byPeters and Vrieze [120]. Filar =-=[54]-=- speci cally examined the di erence between simultaneous- and alternating-action games. It is interesting to note that many of the great minds of computer science worked on creating game-playing progr... |

1 |
Completely observable Markov decision processes with observation costs. Unpublished manuscript
- Hansen
- 1995
(Show Context)
Citation Context ...applications of pomdps to important real-world problems. The constraints present in these applications might be su cient to make the corresponding pomdps solvable. Some progress has been made: Hansen =-=[60]-=- blended completely unobservable and completely observable mdps to form an intermediate model, and Simmons and Koenig [144] controlled a robot using a pomdp model. From the interest that has been gene... |